Seq counts workflow - Githubissues

joverlee521 commented 1 year ago

Description of proposed changes

Convert sequence counts workflow from GitHub Action steps to Snakemake workflow, which makes it easier to run the workflow locally and to use the pathogen-repo-build workflow to launch the job on AWS Batch. Fixes the problem where the GISAID workflow has been failing due to out of disk space issues since August 21, 2023.

Keeping the GH Action attached to the AWS Batch job since they should only take a couple minutes and we need to trigger the downstream model jobs when job completes.

Checklist

[x] Checks pass
[x] GISAID test run
[x] Open test run

joverlee521 commented 1 year ago

There's a lot of cruft to remove since I vendored the shared ingest repo, but I'll do that in a separately so that this PR is not blocked by that clean up work.

joverlee521 commented 1 year ago

After both test runs completed, I did a comparison of the sequence counts files by checking their hashes and diffing their contents.

All of the open data files had equal hashes. The GISAID files currently in production were created before the latest metadata file was uploaded with the latest data dump from GISAID, so hashes were different. The diff of the production and trial files showed an increase in a subset of sequence counts and new rows of data, which is expected with the new data dump.

Detailed outputs

Output from when I ran the script: ``` Object: files/workflows/forecasts-ncov/open/nextstrain_clades/global.tsv.gz Hashes equal Object: files/workflows/forecasts-ncov/open/nextstrain_clades/usa.tsv.gz Hashes equal Object: files/workflows/forecasts-ncov/open/pango_lineages/global.tsv.gz Hashes equal Object: files/workflows/forecasts-ncov/open/pango_lineages/usa.tsv.gz Hashes equal Object: files/workflows/forecasts-ncov/gisaid/nextstrain_clades/global.tsv.gz Different hashes, downloading and diffing files ingest/vendored/download-from-s3: line 23: trial-gisaid-nextstrain_clades-global.tsv: No such file or directory [ INFO] Downloading s3://nextstrain-data/files/workflows/forecasts-ncov/trial/seq-counts-workflow/gisaid/nextstrain_clades/global.tsv.gz → trial-gisaid-nextstrain_clades-global.tsv [ INFO] Downloading s3://nextstrain-data/files/workflows/forecasts-ncov/gisaid/nextstrain_clades/global.tsv.gz → production-gisaid-nextstrain_clades-global.tsv Object: files/workflows/forecasts-ncov/gisaid/nextstrain_clades/usa.tsv.gz Different hashes, downloading and diffing files ingest/vendored/download-from-s3: line 23: trial-gisaid-nextstrain_clades-usa.tsv: No such file or directory [ INFO] Downloading s3://nextstrain-data/files/workflows/forecasts-ncov/trial/seq-counts-workflow/gisaid/nextstrain_clades/usa.tsv.gz → trial-gisaid-nextstrain_clades-usa.tsv [ INFO] Downloading s3://nextstrain-data/files/workflows/forecasts-ncov/gisaid/nextstrain_clades/usa.tsv.gz → production-gisaid-nextstrain_clades-usa.tsv Object: files/workflows/forecasts-ncov/gisaid/pango_lineages/global.tsv.gz Different hashes, downloading and diffing files ingest/vendored/download-from-s3: line 23: trial-gisaid-pango_lineages-global.tsv: No such file or directory [ INFO] Downloading s3://nextstrain-data/files/workflows/forecasts-ncov/trial/seq-counts-workflow/gisaid/pango_lineages/global.tsv.gz → trial-gisaid-pango_lineages-global.tsv ingest/vendored/download-from-s3: line 23: production-gisaid-pango_lineages-global.tsv: No such file or directory [ INFO] Downloading s3://nextstrain-data/files/workflows/forecasts-ncov/gisaid/pango_lineages/global.tsv.gz → production-gisaid-pango_lineages-global.tsv Object: files/workflows/forecasts-ncov/gisaid/pango_lineages/usa.tsv.gz Different hashes, downloading and diffing files ingest/vendored/download-from-s3: line 23: trial-gisaid-pango_lineages-usa.tsv: No such file or directory [ INFO] Downloading s3://nextstrain-data/files/workflows/forecasts-ncov/trial/seq-counts-workflow/gisaid/pango_lineages/usa.tsv.gz → trial-gisaid-pango_lineages-usa.tsv ingest/vendored/download-from-s3: line 23: production-gisaid-pango_lineages-usa.tsv: No such file or directory [ INFO] Downloading s3://nextstrain-data/files/workflows/forecasts-ncov/gisaid/pango_lineages/usa.tsv.gz → production-gisaid-pango_lineages-usa.tsv ``` Output diff files: [gisaid-nextstrain_clades-global.tsv.diff.txt](https://github.com/nextstrain/forecasts-ncov/files/12529290/gisaid-nextstrain_clades-global.tsv.diff.txt) [gisaid-nextstrain_clades-usa.tsv.diff.txt](https://github.com/nextstrain/forecasts-ncov/files/12529291/gisaid-nextstrain_clades-usa.tsv.diff.txt) [gisaid-pango_lineages-global.tsv.diff.txt](https://github.com/nextstrain/forecasts-ncov/files/12529292/gisaid-pango_lineages-global.tsv.diff.txt) [gisaid-pango_lineages-usa.tsv.diff.txt](https://github.com/nextstrain/forecasts-ncov/files/12529293/gisaid-pango_lineages-usa.tsv.diff.txt)

joverlee521 commented 1 year ago

Merging to use the updated workflow for today's sequence counts updates.

trvrb commented 1 year ago

Generally great to have this as snakemake to make it easier to run locally as well. Thank you Jover!

nextstrain / forecasts-ncov

Seq counts workflow #60

Description of proposed changes

Checklist