Closed joverlee521 closed 1 year ago
There's a lot of cruft to remove since I vendored the shared ingest repo, but I'll do that in a separately so that this PR is not blocked by that clean up work.
After both test runs completed, I did a comparison of the sequence counts files by checking their hashes and diffing their contents.
All of the open data files had equal hashes. The GISAID files currently in production were created before the latest metadata file was uploaded with the latest data dump from GISAID, so hashes were different. The diff of the production and trial files showed an increase in a subset of sequence counts and new rows of data, which is expected with the new data dump.
Merging to use the updated workflow for today's sequence counts updates.
Generally great to have this as snakemake to make it easier to run locally as well. Thank you Jover!
Description of proposed changes
Convert sequence counts workflow from GitHub Action steps to Snakemake workflow, which makes it easier to run the workflow locally and to use the pathogen-repo-build workflow to launch the job on AWS Batch. Fixes the problem where the GISAID workflow has been failing due to out of disk space issues since August 21, 2023.
Keeping the GH Action attached to the AWS Batch job since they should only take a couple minutes and we need to trigger the downstream model jobs when job completes.
Checklist