Open joverlee521 opened 3 weeks ago
It may be time to revisit https://github.com/nextstrain/ncov-ingest/issues/240
My comment from related thread on Slack:
In the absence of benchmark files, I'm just scanning the Snakemake log files from the workflows for some general timings: ~1.5h - downloading data from GISAID/S3 ~4h - transform-gisiad ~1h - filter fasta for new sequences to run through Nextclade ~0.5h - joining metadata + nextclade = ~7h of data munging - this is about the same with/without new data
The rest of the workflow is just uploading files to S3!!
Without new data, it still takes ~2h to generate the hash for sequences.fasta for checking against S3. With new data, it takes ~4h to upload sequences.fasta to S3.
Ah, this is also not considering the full run that gets triggered when there's a new Nextclade dataset released.
The last full run on Apri 16, 2024 ran for ~15h.
Bumping the CPUs in #447 decreased the build time by about ~1h, which came from parallelizing the download of data at the beginning of the workflow.
We will still run over the 12h limit for full Nextclade runs, so I'm going to work on https://github.com/nextstrain/ingest/issues/41
Thanks @joverlee521 for the summary!
How hard is it to parallelize ~4h - transform-gisaid
?
As this operates on ndjson lines, it might be parallelizable, or at least some part of it.
The obvious way to do so would be to have a split rule to divide input files into N chunks, run transform, and merge back.
How hard is it to parallelize ~4h - transform-gisaid?
@corneliusroemer I honestly have no idea...I made https://github.com/nextstrain/ncov-ingest/issues/448 to track this separately.
The speeding up upload-to-s3 is not as straight-forward as initially thought...
For now, sidestepping the issue by creating a nextstrain-ncov-ingest IAM user and added credentials to repo secrets. So the workflow be able to run without any time limits. Added a post clean up list above to remove those credentials and delete the user once we've resolved this issue.
Context
Our automated workflows use short lived AWS credentials for sessions which are limited by the max_session_duration of 12 hours, which is the maximum allowed by AWS.
The GISAID workflow this max yesterday and ran into errors:
TODOs
Post clean-up