nextstrain / mpox

Nextstrain build for mpox virus
https://nextstrain.org/mpox
MIT License
44 stars 19 forks source link

CI is using an incompatible version of the Conda runtime #177

Closed victorlin closed 1 year ago

victorlin commented 1 year ago

Currently (as observed in #176), the Conda runtime job instance of pathogen-ci is failing with the following error:

Current augur version: 22.1.0. Minimum required: 22.2.0

Augur version 22.1.0 is coming from this version of the Conda runtime: nextstrain-base 20230717T174555Z.

This used to work without any noticeable changes. Example: when the Augur minimum version was bumped to 22.2.0, Augur version 22.2.0 was available in this CI run. Notably, the version of the Conda runtime is nextstrain-base 20230731T212806Z.

This also seems to be working fine in the ncov repo, where the latest run resolved to nextstrain-base 20230830T164409Z.

My outstanding question is: why is an older version of the Conda runtime being resolved now, and seemingly only in this repo?

tsibley commented 1 year ago

Weird. I can reproduce this locally. Notably, when I run nextstrain update conda after initial setup, we do install the latest version (20230830T164409Z). This is what I'd expect since we explicitly resolve the version to update to ourselves. So signs point to Micromamba not resolving to the latest version on the initial micromamba create for some reason.

It would be good to understand what's going on here—is it a Micromamba bug? are we using it wrong?—but I imagine regardless of what's happening, we might still want to change nextstrain setup conda to use the same logic for figuring out the latest version as nextstrain update does rather than leaving it to Micromamba.

victorlin commented 1 year ago

Hmm, that's helpful info. It doesn't explain the behavior in the ncov run though? That one resolved to nextstrain-base 20230830T164409Z during nextstrain setup conda.

tsibley commented 1 year ago

Yeah. Weird.

joverlee521 commented 1 year ago

Seeing the same issue in the seasonal-flu CI now, so no longer just limited to this repo.

tsibley commented 1 year ago

I can reproduce this locally even with the standalone install of Nextstrain CLI by setting up a new Conda runtime from scratch, which makes sense given we think this is a Micromamba issue.

joverlee521 commented 1 year ago

Quick fix for the CI while we figure out the underlying issue.

tsibley commented 1 year ago

Thanks for the hot fix for CI!

I started digging into what's going on inside Micromamba by doing roughly this:

$ cd $(mkdir -dt)
$ export NEXTSTRAIN_HOME=$PWD
$ nextstrain debugger
(Pdb) interact
>>> from nextstrain.cli.runner.conda import micromamba, setup_micromamba
>>> setup_micromamba()
>>> micromamba("create", "-vvv", "--dry-run", "nextstrain-base")
…

I confirmed that the package index it's using, https://conda.anaconda.org/nextstrain/linux-64/repodata.json, contains the latest package version. It does. Then I diffed the two index entries to see if anything stood out, but nothing does.

Next to dig into the actual solver logs.

tsibley commented 1 year ago

The solver starts by considering the latest version, 20230830T164409Z. It finds some conflict when solving deps between the suitesparse 5.10.1 and metis 5.1.1 packages, even though it should be fine to just install the exact versions listed in the nextstrain-base spec. That conflict is resolved by the solver by ruling out 20230830T164409Z and repeating the process with the next highest version all the way down the line until it gets to 20230717T174555Z, which is the latest version with metis 5.1.0.

Since nextstrain-base is the only package not fully-constrained by a version and build in this solving operation, it's likely the only flexibility the solver has to address dep resolution conflicts.

I'm guessing that some difference in the solver or resolution algorithm between the conda-base builds and this version of Micromamba are causing the former to produce something the latter thinks is in conflict. Since they should use broadly the same solver/algo (libmamba → libsolv), this would imply that using a newer Micromamba version might fix this.

But also, I expect pinning the nextstrain-base version on setup would also do the trick, and is more explicitly what we want anyway. Not doing it was kind of an oversight on my part in the https://github.com/nextstrain/cli/pull/280 work. (Very understandable oversight though!)

tsibley commented 1 year ago

But also, I expect pinning the nextstrain-base version on setup would also do the trick, and is more explicitly what we want anyway.

…but actually is not sufficient on its own:

>>> micromamba("create", "-vvv", "--dry-run", "nextstrain-base ==20230830T164409Z hb0f4dca_0_locked")
…
    Encountered problems while solving:
      - package nextstrain-base-20230830T164409Z-hb0f4dca_0_locked requires suitesparse ==5.10.1 h9e50725_1, but none of the providers can be installed

    The environment can't be solved, aborting the operation

…
``` info libsolv number of solvables: 642231, memory used: 35122 K info libsolv number of ids: 234927 + 362461 info libsolv string memory used: 917 K array + 3573 K data, rel memory used: 4247 K array info libsolv string hash memory: 2048 K, rel hash memory : 4096 K info libsolv provide ids: 31552 info libsolv provide space needed: 673785 + 724922 info libsolv shrunk whatprovidesdata from 673785 to 673785 info libsolv shrunk whatprovidesauxdata from 673785 to 642230 info libsolv whatprovides memory used: 2337 K id array, 5463 K data info libsolv whatprovidesaux memory used: 917 K id array, 2508 K data info libsolv createwhatprovides took 32 ms info libmamba Parsing MatchSpec nextstrain-base ==20230830T164409Z hb0f4dca_0_locked info libmamba Parsing MatchSpec nextstrain-base ==20230830T164409Z hb0f4dca_0_locked info libmamba Adding job: nextstrain-base ==20230830T164409Z hb0f4dca_0_locked info libsolv solver started info libsolv dosplitprovides=0, noupdateprovide=0, noinfarchcheck=0 info libsolv allowuninstall=1, allowdowngrade=1, allownamechange=1, allowarchchange=0, allowvendorchange=0 info libsolv dupallowdowngrade=1, dupallownamechange=1, dupallowarchchange=1, dupallowvendorchange=1 info libsolv promoteepoch=0, forbidselfconflicts=0 info libsolv obsoleteusesprovides=0, implicitobsoleteusesprovides=0, obsoleteusescolors=0, implicitobsoleteusescolors=0 info libsolv dontinstallrecommended=0, addalreadyrecommended=0 onlynamespacerecommended=0 info libsolv obsoletes data: 1 entries info libsolv added 0 pkg rules for installed solvables info libsolv added 0 pkg rules for updaters of installed solvables info libsolv added 6019942 pkg rules for packages involved in a job info libsolv added 0 pkg rules because of weak dependencies info libsolv 28943 of 642230 installable solvables considered for solving info libsolv pruned rules from 6019943 to 6008658 info libsolv binary: 5881377 info libsolv normal: 127280, 9351992 literals info libsolv pkg rule memory used: 140827 K info libsolv pkg rule creation took 3585 ms info libsolv job: install providing nextstrain-base ==20230830T164409Z hb0f4dca_0_locked info libsolv - job Rule #6008666: info libsolv nextstrain-base-20230830T164409Z-hb0f4dca_0_locked [5] (w1) info libsolv next rules: 0 0 info libsolv choice rule creation took 3526 ms info libsolv 6008657 pkg rules, 2 * 4 update rules, 1 job rules, 0 infarch rules, 0 dup rules, 0 choice rules, 0 best rules, 0 yumobs rules info libsolv 0 black rules, 0 recommends rules, 104 repo priority rules info libsolv overall rule memory used: 140830 K info libsolv solving... info libsolv ANALYZE UNSOLVABLE ---------------------- info libsolv Rule #3297886: info libsolv !suitesparse-5.10.1-h9e50725_1 [281208] Install.level1 info libsolv metis-5.1.0-0 [141705] (w2) Conflict.level1 info libsolv metis-5.1.0-1 [141706] (w1) Conflict.level1 info libsolv metis-5.1.0-2 [141707] Conflict.level1 info libsolv metis-5.1.0-3 [141708] Conflict.level1 info libsolv metis-5.1.0-h470a237_3 [141709] Conflict.level1 info libsolv metis-5.1.0-h58526e2_1006 [141710] Conflict.level1 info libsolv metis-5.1.0-he1b5a44_1004 [141711] Conflict.level1 info libsolv metis-5.1.0-he1b5a44_1005 [141712] Conflict.level1 info libsolv metis-5.1.0-he1b5a44_1006 [141713] Conflict.level1 info libsolv metis-5.1.0-hf484d3e_1003 [141714] Conflict.level1 info libsolv metis-5.1.0-hfc679d8_3 [141715] Conflict.level1 info libsolv metis-5.1.0-h59595ed_1007 [349083] Conflict.level1 info libsolv next rules: 0 3297913 info libsolv Rule #2908664: info libsolv !metis-5.1.1-h59595ed_1 [349085] (w1) Install.level1 info libsolv !metis-5.1.0-0 [141705] (w2) Conflict.level1 info libsolv next rules: 0 2908677 info libsolv Rule #2908663: info libsolv !metis-5.1.1-h59595ed_1 [349085] (w1) Install.level1 info libsolv !metis-5.1.0-1 [141706] (w2) Conflict.level1 info libsolv next rules: 2908664 2908676 info libsolv Rule #2908662: info libsolv !metis-5.1.1-h59595ed_1 [349085] (w1) Install.level1 info libsolv !metis-5.1.0-2 [141707] (w2) Conflict.level1 info libsolv next rules: 2908663 2908675 info libsolv Rule #2908661: info libsolv !metis-5.1.1-h59595ed_1 [349085] (w1) Install.level1 info libsolv !metis-5.1.0-3 [141708] (w2) Conflict.level1 info libsolv next rules: 2908662 2908674 info libsolv Rule #2908660: info libsolv !metis-5.1.1-h59595ed_1 [349085] (w1) Install.level1 info libsolv !metis-5.1.0-h470a237_3 [141709] (w2) Conflict.level1 info libsolv next rules: 2908661 2908673 info libsolv Rule #2908659: info libsolv !metis-5.1.1-h59595ed_1 [349085] (w1) Install.level1 info libsolv !metis-5.1.0-h58526e2_1006 [141710] (w2) Conflict.level1 info libsolv next rules: 2908660 2908672 info libsolv Rule #2908658: info libsolv !metis-5.1.1-h59595ed_1 [349085] (w1) Install.level1 info libsolv !metis-5.1.0-he1b5a44_1004 [141711] (w2) Conflict.level1 info libsolv next rules: 2908659 2908671 info libsolv Rule #2908657: info libsolv !metis-5.1.1-h59595ed_1 [349085] (w1) Install.level1 info libsolv !metis-5.1.0-he1b5a44_1005 [141712] (w2) Conflict.level1 info libsolv next rules: 2908658 2908670 info libsolv Rule #2908656: info libsolv !metis-5.1.1-h59595ed_1 [349085] (w1) Install.level1 info libsolv !metis-5.1.0-he1b5a44_1006 [141713] (w2) Conflict.level1 info libsolv next rules: 2908657 2908669 info libsolv Rule #2908655: info libsolv !metis-5.1.1-h59595ed_1 [349085] (w1) Install.level1 info libsolv !metis-5.1.0-hf484d3e_1003 [141714] (w2) Conflict.level1 info libsolv next rules: 2908656 2908668 info libsolv Rule #2908654: info libsolv !metis-5.1.1-h59595ed_1 [349085] (w1) Install.level1 info libsolv !metis-5.1.0-hfc679d8_3 [141715] (w2) Conflict.level1 info libsolv next rules: 2908655 2908667 info libsolv Rule #2908653: info libsolv !metis-5.1.1-h59595ed_1 [349085] (w1) Install.level1 info libsolv !metis-5.1.0-h59595ed_1007 [349083] (w2) Conflict.level1 info libsolv next rules: 2908654 2908666 info libsolv Rule #6008481: info libsolv !nextstrain-base-20230830T164409Z-hb0f4dca_0_locked [5] (w1) Install.level1 info libsolv metis-5.1.1-h59595ed_1 [349085] (w2) Install.level1 info libsolv next rules: 6008482 0 info libsolv Rule #6008406: info libsolv !nextstrain-base-20230830T164409Z-hb0f4dca_0_locked [5] (w1) Install.level1 info libsolv suitesparse-5.10.1-h9e50725_1 [281208] (w2) Install.level1 info libsolv next rules: 6008407 0 info libsolv JOB Rule #6008666: info libsolv nextstrain-base-20230830T164409Z-hb0f4dca_0_locked [5] (w1) Install.level1 info libsolv next rules: 0 0 info libsolv enabledisablelearntrules called info libsolv resolving job rules info libsolv resolving installed packages info libsolv deciding unresolved rules info libsolv installing recommended packages info libsolv deciding orphaned packages info libsolv solver statistics: 0 learned rules, 1 unsolvable, 0 minimization steps info libsolv done solving. info libsolv solver took 28 ms info libsolv final solver statistics: 1 problems, 0 learned rules, 1 unsolvable info libsolv solver_solve took 7192 ms info libmamba Problem count: 1 error libmamba Could not solve for environment specs Encountered problems while solving: - package nextstrain-base-20230830T164409Z-hb0f4dca_0_locked requires suitesparse ==5.10.1 h9e50725_1, but none of the providers can be installed The environment can't be solved, aborting the operation ```
tsibley commented 1 year ago

Ok, my reading of the libsolv details in the previous comment and double checking the two suitesparse 5.10.1 packages available on conda-forge has me thinking that boa (used in conda-base builds) is producing a bad solve for the versions of suitesparse and metis. Micromamba seems correct here. (But then again, it also does the upgrade to the latest conda-base just fine?? I'm still confused by that still.)

I upgraded Micromamba to 1.5.0 (latest version), and it still doesn't like the latest package, but at least it has a better error message:

error    libmamba Could not solve for environment specs
    The following package could not be installed
    └─ nextstrain-base ==20230830T164409Z hb0f4dca_0_locked is not installable because it requires
       ├─ metis ==5.1.1 h59595ed_1, which can be installed;
       └─ suitesparse ==5.10.1 h9e50725_1, which requires
          └─ metis >=5.1.0,<5.1.1.0a0 , which conflicts with any installable versions previously reported.

This matches my reading of libsolv above.

tsibley commented 1 year ago

Still very confused how "install old, update to latest" works and how other CI jobs installed the latest just fine (e.g.). This feels like something changing at a distance.

tsibley commented 1 year ago

I'd thought maybe Nextstrain CLI 7.2.0's relatively recent upgrade of Micromamba 1.0.0 → 1.1.0 might have been implicated, but 1.0.0 exhibits the same issues locally and besides, 7.2.0 was released 2 weeks ago, well before recent CI jobs like the one linked above passed.

joverlee521 commented 1 year ago

This feels like something changing at a distance.

Looks like new builds of metis 5.1.0 and 5.1.1 were released a couple days ago, maybe some changes in dependencies there?

Edit: Oh wait, I see. It is using the latest metis build but still able to install suitesparse. Huh...

tsibley commented 1 year ago

I think I have it figured out. Don't think it's our fault. Let me confirm.

tsibley commented 1 year ago

Anaconda appears to have incorrect indexing of (both builds of) suitesparse 5.10.1.

Compare the metadata for the distribution (used post-install to solve deps for subsequent installs):

$ curl https://api.anaconda.org/release/conda-forge/suitesparse/5.10.1 | jq '
>   .distributions | map(
>       select(.attrs.subdir == "linux-64")
>     | [
>       .full_name,
>       (.attrs.depends | map(select(startswith("metis "))) | .[0])
>     ]
>   )
> '
[
  [
    "conda-forge/suitesparse/5.10.1/linux-64/suitesparse-5.10.1-h9e50725_1.tar.bz2",
    [
      "metis >=5.1.0,<5.2.0a0"
    ]
  ],
  [
    "conda-forge/suitesparse/5.10.1/linux-64/suitesparse-5.10.1-hd8046ac_0.tar.bz2",
    [
      "metis >=5.1.0,<5.2.0a0"
    ]
  ]
]

with the metadata in the channel index (used pre-install to solve deps):

$ curl https://conda.anaconda.org/conda-forge/linux-64/repodata.json.zst | zstdcat | jq '
>   .packages | to_entries | map(
>       select(.key | contains("suitesparse-5.10.1"))
>     | [
>       .key,
>       (.value.depends | map(select(startswith("metis "))))
>     ]
>   )
> '
[
  [
    "suitesparse-5.10.1-h9e50725_1.tar.bz2",
    [
      "metis >=5.1.0,<5.1.1.0a0"
    ]
  ],
  [
    "suitesparse-5.10.1-hd8046ac_0.tar.bz2",
    [
      "metis >=5.1.0,<5.1.1.0a0"
    ]
  ]
]

This is why initial install of nextstrain-base ==20230830T164409Z fails but upgrade to that same version succeeds: the former uses the channel index metadata for suitesparse, the latter the locally installed distribution metadata (e.g. ${prefix}/conda-meta/suitesparse-5.10.1-h9e50725_1.json).

I confirmed that the distribution metadata API is indeed returning the metadata from the actual distribution:

$ curl https://conda.anaconda.org/conda-forge/linux-64/suitesparse-5.10.1-h9e50725_1.tar.bz2 \
> | tar -xjO info/index.json \
> | jq '.depends | map(select(startswith("metis ")))'
[
  "metis >=5.1.0,<5.2.0a0"
]

$ curl https://conda.anaconda.org/conda-forge/linux-64/suitesparse-5.10.1-h9e50725_1.tar.bz2 \
> | tar -xjO info/recipe/meta.yaml \
> | yq '.requirements.run | map(select(startswith("metis ")))'
[
  "metis >=5.1.0,<5.2.0a0",
  "metis >=5.1.0,<5.2.0a0"
]

and that it's all the same as what's in the local install:

$ jq '.depends | map(select(startswith("metis ")))' "$NEXTSTRAIN_HOME"/runtimes/conda/env/conda-meta/suitesparse-5.10.1-h9e50725_1.json
[
  "metis >=5.1.0,<5.2.0a0"
]

In short:

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. —Leon Bambrick, via Martin Fowler

and we're hitting caching issues. (An index is a cache.)

tsibley commented 1 year ago

It gets messier: the difference in the index vs. distribution metadata is not accidental, but intentional.

I went to read about channel indexing and noticed this step (emphasis mine):

For each subdir:

  1. Look at all the packages that exist in the subdir.
  2. Generate a list of packages to add/update/remove.
  3. Remove all packages that need to be removed.
  4. For all packages that need to be added/updated:
    • Extract the package to access metadata, including full package name, file modification time (mtime), size, and index.json.
      • Aggregate package metadata to repodata collection.
  5. Apply repodata hotfixes (patches).
  6. Compute and save the reduced current_index.json index.

That raised my eyebrows. So I read further about repodata patching, which mentioned how conda-forge applies repodata patches using https://github.com/conda-forge/conda-forge-repodata-patches-feedstock/.

Any sign of suitesparse or metis in there? Oh, you bet!

https://github.com/conda-forge/conda-forge-repodata-patches-feedstock/commit/2a2c288c1cfa2e73873e04478ba16576a762e829

Committed just a couple days ago. So this is intentional, to fix an actual ABI breakage, but it has the side-effect of breaking a previously-working combination of packages. This unfortunate risk is noted by the Conda docs linked to above:

Hotfixing is tricky, as it has the potential to break environments that have worked, but it is also sometimes necessary to fix environments that are known not to work.

I think if we rebuild conda-base again, now that the hotfixing is in place, we'll be ok for new installs again. Will confirm that next.

tsibley commented 1 year ago

That conda-forge repodata patches change was merged 30 Aug at about 10:19 US/Pacific. To take effect it then would have to be built, uploaded, and finally used by Anaconda during index update.

Our latest nextstrain-base version (20230830T164409Z) starting building at 9:44 and finished by around 9:55, so wouldn't have seen the new hotfix patch to the repodata. orz

tsibley commented 1 year ago

Rebuild is looking promising already.

image

tsibley commented 1 year ago

So assuming that rebuild mostly* resolves the issue for now, how do we avoid similar issues in the future?

One way might be having scheduled CI in conda-base that regularly tests if the latest package version is still initially installable (similar to how Nextstrain CLI regularly tests if its standalone installers still work, since they're also dependent on external resources). If that test breaks, we get an early warning to see what's up. If we're really fancy, we could potentially even try to detect certain kinds of breakages like this kind here and automatic remediate it by kicking off another package build.

* nextstrain-base versions between (20230717T174555Z, 20230830T164409Z] are still forever broken for initial installs, but could be upgraded to.

tsibley commented 1 year ago

This looks resolved by the just-released nextstrain-base 20230901T214523Z.

tsibley commented 1 year ago

Closing this as this reported issue is resolved. We'd maybe like to do more to prevent it from happening in the future, but I opened a conda-base issue for that: https://github.com/nextstrain/conda-base/issues/41

victorlin commented 1 year ago

https://github.com/nextstrain/monkeypox/issues/177#issuecomment-1701497622: we might still want to change nextstrain setup conda to use the same logic for figuring out the latest version as nextstrain update does rather than leaving it to Micromamba.

https://github.com/nextstrain/cli/issues/318