GISAID workflow hitting max_session_duration

nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.

MIT License

35 stars 20 forks source link

GISAID workflow hitting max_session_duration #446

Open joverlee521 opened 3 weeks ago

joverlee521 commented 3 weeks ago

Context

Our automated workflows use short lived AWS credentials for sessions which are limited by the max_session_duration of 12 hours, which is the maximum allowed by AWS.

The GISAID workflow this max yesterday and ran into errors:

upload failed: - to s3://nextstrain-ncov-private/gisaid.ndjson.xz An error occurred (ExpiredToken) when calling the UploadPart operation: The provided token has expired.

TODOs

Post clean-up

[ ] remove AWS_* repo secrets
[ ] remove nextstrain-ncov-ingest user

joverlee521 commented 3 weeks ago

It may be time to revisit https://github.com/nextstrain/ncov-ingest/issues/240

joverlee521 commented 3 weeks ago

My comment from related thread on Slack:

In the absence of benchmark files, I'm just scanning the Snakemake log files from the workflows for some general timings: ~1.5h - downloading data from GISAID/S3 ~4h - transform-gisiad ~1h - filter fasta for new sequences to run through Nextclade ~0.5h - joining metadata + nextclade = ~7h of data munging - this is about the same with/without new data

The rest of the workflow is just uploading files to S3!!

Without new data, it still takes ~2h to generate the hash for sequences.fasta for checking against S3. With new data, it takes ~4h to upload sequences.fasta to S3.

joverlee521 commented 3 weeks ago

Ah, this is also not considering the full run that gets triggered when there's a new Nextclade dataset released.

The last full run on Apri 16, 2024 ran for ~15h.

joverlee521 commented 3 weeks ago

Bumping the CPUs in #447 decreased the build time by about ~1h, which came from parallelizing the download of data at the beginning of the workflow.

We will still run over the 12h limit for full Nextclade runs, so I'm going to work on https://github.com/nextstrain/ingest/issues/41

corneliusroemer commented 3 weeks ago

Thanks @joverlee521 for the summary!

How hard is it to parallelize ~4h - transform-gisaid?

As this operates on ndjson lines, it might be parallelizable, or at least some part of it.

The obvious way to do so would be to have a split rule to divide input files into N chunks, run transform, and merge back.

joverlee521 commented 3 weeks ago

How hard is it to parallelize ~4h - transform-gisaid?

@corneliusroemer I honestly have no idea...I made https://github.com/nextstrain/ncov-ingest/issues/448 to track this separately.

joverlee521 commented 3 weeks ago

The speeding up upload-to-s3 is not as straight-forward as initially thought...

For now, sidestepping the issue by creating a nextstrain-ncov-ingest IAM user and added credentials to repo secrets. So the workflow be able to run without any time limits. Added a post clean up list above to remove those credentials and delete the user once we've resolved this issue.