nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
35 stars 20 forks source link

Fix docker build #416

Closed victorlin closed 4 months ago

victorlin commented 11 months ago

Description of proposed changes

Fix the Docker build by switching away from Pipenv and removing some unused dependencies.

Related issue(s)

Fixes #413

Testing

corneliusroemer commented 4 months ago

Any progress here? Our CI is still broken 🙃

victorlin commented 4 months ago

@corneliusroemer not yet. I'll prioritize this for myself since I'm now realizing that ncov-ingest has been stuck using old versions of Nextclade and Augur. Since Nextclade 3 is now the default in the newer nextstrain/base images, we may have to bundle some Nextclade-related changes with this PR to at least use nextclade2 explicitly.

joverlee521 commented 4 months ago

@corneliusroemer not yet. I'll prioritize this for myself since I'm now realizing that ncov-ingest has been stuck using old versions of Nextclade and Augur. Since Nextclade 3 is now the default in the newer nextstrain/base images, we may have to bundle some Nextclade-related changes with this PR to at least use nextclade2 explicitly.

ncov-ingest currently downloads Nextclade executable as part of the pipeline, so it's not using Nextclade version from the nextstrain/base image. @corneliusroemer is already migrating this to Nextclade V3 in https://github.com/nextstrain/ncov-ingest/pull/435.

victorlin commented 4 months ago

Oh thanks for clarifying! Is it common to have a Snakemake rule for env setup? I would expect it to be in the Dockerfile - that used to be the case but it was removed in 2ec2a6b22ad9d99a236a5202b255af5af9134c09. My guess is that Docker caching made the "latest" download outdated. Either an exact version needs to be specified or cache should be invalidated for the "latest" download (this is what we do for nextstrain/base).

joverlee521 commented 4 months ago

Is it common to have a Snakemake rule for env setup?

I think this is left over from historical use of Nextclade before it was part of nextstrain/base. Previous conversation on this makes it seem like we can remove this from the Snakemake workflow itself and just use the Nextclade version available in the nextstrain/base image.

corneliusroemer commented 4 months ago

Yes, we can drop in the future if we have a new image, but right now it obviously doesn't work yet 😀

victorlin commented 4 months ago

I see it's on the task list for #375. Thanks for filling me in here! I'll still look into fixing the image build.

victorlin commented 4 months ago

I'm going to rebase this onto latest master, run GenBank fetch and ingest (on branch), then merge if everything succeeds.

~EDIT: GenBank fetch and ingest running: GitHub workflow, AWS Batch~

EDIT 2: Oops, I did the test wrong. Need to run the workflow on the PR image using aab1f5ff2787e161fabc3ecba809e8143c6e332c, not the PR source branch. Running: GitHub workflow, AWS Batch

joverlee521 commented 4 months ago

Ah, error comes from the removal of rm -rf ~/.cache, see details in https://github.com/nextstrain/ncov-ingest/commit/fd2b922c0b88d21c8245ab8c45797d93f2de4706.

Will make a follow-up PR shortly.

victorlin commented 4 months ago

Oh sorry! In aab1f5ff2787e161fabc3ecba809e8143c6e332c I changed fetch-and-ingest-genbank-master.yml instead of fetch-and-ingest-genbank-branch.yml 🤦

joverlee521 commented 4 months ago

No worries @victorlin, the set up of GH Action workflows in this repo is a bit outdated. I'm planning on doing updates to reduce potential confusion like this.