Open joverlee521 opened 5 months ago
This is a great summary! Thanks @joverlee521
To avoid confusion, I'll use the following definitions as per a recent slack conversation
Thinking about the ideal automated usage of a repo, I'd advocate for the following:
Ingest (may require multiple invocations, e.g. ncov)
phylo
Ideally, failure to produce any auspice dataset would not result in unnecessary failure of others. For invocations which use wildcard targets, I think we could achieve this with --keep-going
+ modified upload rules, but I haven't investigated. (This would make reporting on the status a bit more complicated.) For multiple invocations achieving this independence is easier.
I don't know if I have strong opinions on whether we bundle multiple invocations up into AWS jobs or whether to keep each invocation as it's own separate AWS job, or how to handle the dependency graph, except to say that the simpler the solution the better!
My answers to your open questions would be:
¹ This could be made different from an invocation of nextstrain build
because you can get that to run multiple snakemake invocations in the same AWS job.
I think I generally concur with @jameshadfield's comments above. One minor exception is I'd think to use the term build as we define in our glossary, where it encompasses everything that goes into and results in an Auspice dataset JSON, not just the JSON itself.
@joverlee521 wrote:
- […] Coordinating ingest + phylo sounds a lot like a DAG so can we use Snakemake to help here?
There's definitely a dependency graph. But ISTM that expressing it in Snakemake only makes sense though if we're running everything in a single giant job (i.e. the answer to questions 1 and 3 is "yes"). If we're not running everything in a single giant job (and I think we shouldn't), then it makes more sense to me to manage this dependency graph in the overarching "orchestration" layer, which for us is GitHub Actions workflows. This would be difficult without the GitHub job staying attached to the AWS Batch job, but we plan to do that anyway!
Another thought, prompted from following the builddir issue:
Many of our pathogens¹ could be run in their entirety (ingest + phylo) within a single GitHub Actions job (no AWS involved). This would be even simpler if we used a single snakemake workflow² for the entire repo (e.g.), and could use a single invocation of pathogen-repo-build
as currently written (e.g.). But that's not a hard requirement, we could also rejig pathogen-repo-build
to handle multiple workflow invocations with the necessary queuing / conditional execution logic described above. For more complex pathogen repos, even if we used a single Snakefile we would still want to use multiple invocations to deploy separate parts of the snakemake workflow³ to different AWS jobs.
For collaborators, this means any job which fits in a single action job (under 6h, under some memory ceiling) could easily be automated and monitored with only a few minor modifications to pathogen-repo-build
such as handling nextstrain.org credentials to allow dataset upload to their group, saving the working directory as an artifact etc⁴.
¹ probably all or most the following: mumps, measles, zika, ebola, rsv, hepB, wnv, dengue. Memory requirements may force some of them to be run on AWS.
² This would require phylo to not use separate configs to define separate auspice datasets. I.e. how rsv works, not how mpox does it. But I think this is cleaner anyway.
³ Using targets to separate out which parts of the workflow to run in each job
⁴ And if it doesn't run in a single action job we have docs on how they can set up their own AWS environment.
One minor exception is I'd think to use the term build as we define in our glossary, where it encompasses everything that goes into and results in an Auspice dataset JSON, not just the JSON itself.
Thanks -- I've updated my post to use "auspice dataset" to avoid any confusion.
Yeah, agreed that not all pathogens need to run on AWS Batch. ncov-ingest ran on GitHub Actions for the first year before I lifted-and-shifted it to AWS Batch due to disk space.
The memory requirements aren't the only thing to consider: GitHub Actions CPUs are limited in number (4) and in my experience can be slower than elsewhere. Likely not a deal breaker, but something to consider. Disk space can also be an issue for larger pathogens.
Even without Batch involved, though, GitHub Actions is still our base "orchestration" layer, and it'd be a speed up to hoist top-level fully-independent jobs out of the Snakemake workflow and into parallel GitHub Actions jobs regardless of if they then launch on AWS Batch.
Some related thoughts I jotted down before lunch.
We'll be using our noon (Seattle) slot tomorrow to discuss this topic in general.
Zika's GH Action workflows are my latest iteration on the standard GH Action workflows. This was built on @tsibley's ideas and the discussion we had back in February.
There are a couple things I'd still like to work through:
ingest.yaml
, phylogenetic.yaml
, and ingest-to-phylogenetic.yaml
as reusable workflows or should they just be workflow templates? My main concern for making them reusable workflows is the variation in pathogen workflow configs, making it hard to standardize the config overrides in ingest.yaml and phylogenetic.yamlInstead of having a separate ingest-to-phylogenetic.yaml
workflow that runs on a schedule and glues the two other workflows together, what about having ingest.yaml
run on that schedule (on: schedule: …
) and phylogenetic.yaml
run after ingest.yaml
runs (on: workflow_run: …
) and is a) successful and b) found new data?
This means the manual and automatic workflows are the same, and we'd avoid issues of drift and additional overhead and multiple sources of workflow run history.
Context
In a pathogen repo, there will technically be 3 separate workflows that can be run: ingest, phylogenetic, and nextclade. It's unclear how we want to set up the standardized GitHub Action workflows for automating them. This was originally brought up in our discussion of the the pathogen-repo-template.
Existing workflows
The existing GitHub Action workflows for automated pathogen repos vary mostly for historical reasons:
Open questions
nextstrain build
invocations be maintained? Do we run these as a matrix within the action (i.e. independent AWS jobs), or do we extend the reusable workflow to accommodate this?