nextstrain / pathogen-repo-guide

3 stars 0 forks source link

How should we standardize GitHub Action workflows per repo? #25

Open joverlee521 opened 5 months ago

joverlee521 commented 5 months ago

Context

In a pathogen repo, there will technically be 3 separate workflows that can be run: ingest, phylogenetic, and nextclade. It's unclear how we want to set up the standardized GitHub Action workflows for automating them. This was originally brought up in our discussion of the the pathogen-repo-template.

Existing workflows

The existing GitHub Action workflows for automated pathogen repos vary mostly for historical reasons:

  1. ncov-ingest/ncov - Initial GISAID data concerns led to separate repos which resulted in separate GH Action workflows. ncov-ingest has daily scheduled GH Action workflows for GISAID and GenBank. Within the Snakemake workflows, they upload results to S3 and trigger the ncov GH Action workflows for GISAID or open builds.
  2. mpox - ingest and phylogenetic workflows in the same repo, but kept the same S3 upload and trigger set up for GH Action workflows to mirror ncov. There is only a single GenBank ingest GH Action workflow scheduled to run daily that triggers three different phylogenetic GH Action workflows with slightly different hard-coded configs (rebuild-mpxv, rebuild-hmpxv1, rebuild-hmpxv1-big).
  3. rsv - ingest and phylogenetic workflows in the same repo, kept S3 uploads, but does not use triggers. The ingest GH Action workflow is scheduled to run once a week and the phylogenetic GH Action workflow is scheduled to run 3 hours after the ingest.

Open questions

  1. Do we run a single action for ingest + phylo? Are these separate jobs, e.g. uploading data to S3 then downloading it for the next job? Or do we have one job which runs all the workflow invocations? Lots of considerations about failures and where they occur to think through here.
  2. Coordinating ingest + phylo sounds a lot like a DAG so can we use Snakemake to help here? (previous discussion on Slack)
  3. How should ~multiple phylogenetic builds~ phylo builds that require multiple nextstrain build invocations be maintained? Do we run these as a matrix within the action (i.e. independent AWS jobs), or do we extend the reusable workflow to accommodate this?
jameshadfield commented 5 months ago

This is a great summary! Thanks @joverlee521

To avoid confusion, I'll use the following definitions as per a recent slack conversation


Thinking about the ideal automated usage of a repo, I'd advocate for the following:

Ingest (may require multiple invocations, e.g. ncov)

phylo Ideally, failure to produce any auspice dataset would not result in unnecessary failure of others. For invocations which use wildcard targets, I think we could achieve this with --keep-going + modified upload rules, but I haven't investigated. (This would make reporting on the status a bit more complicated.) For multiple invocations achieving this independence is easier.

I don't know if I have strong opinions on whether we bundle multiple invocations up into AWS jobs or whether to keep each invocation as it's own separate AWS job, or how to handle the dependency graph, except to say that the simpler the solution the better!

My answers to your open questions would be:

  1. Keep them [edit: them = the AWS jobs] separate. Helps with varying resource requirements, and makes the (common) situation of one ingest invocation → many phylo invocations easier to manage.
  2. Here you're talking about a snakemake workflow which runs n snakemake invocations? I'd say no, but interested in others' opinions.
  3. To handle varying resource requirements I'd say separate AWS jobs. I don't know if the complexity can be lifted to the reusable workflow or if it's better in the caller workflow.

¹ This could be made different from an invocation of nextstrain build because you can get that to run multiple snakemake invocations in the same AWS job.

tsibley commented 5 months ago

I think I generally concur with @jameshadfield's comments above. One minor exception is I'd think to use the term build as we define in our glossary, where it encompasses everything that goes into and results in an Auspice dataset JSON, not just the JSON itself.

@joverlee521 wrote:

  1. […] Coordinating ingest + phylo sounds a lot like a DAG so can we use Snakemake to help here?

There's definitely a dependency graph. But ISTM that expressing it in Snakemake only makes sense though if we're running everything in a single giant job (i.e. the answer to questions 1 and 3 is "yes"). If we're not running everything in a single giant job (and I think we shouldn't), then it makes more sense to me to manage this dependency graph in the overarching "orchestration" layer, which for us is GitHub Actions workflows. This would be difficult without the GitHub job staying attached to the AWS Batch job, but we plan to do that anyway!

jameshadfield commented 5 months ago

Another thought, prompted from following the builddir issue:

Many of our pathogens¹ could be run in their entirety (ingest + phylo) within a single GitHub Actions job (no AWS involved). This would be even simpler if we used a single snakemake workflow² for the entire repo (e.g.), and could use a single invocation of pathogen-repo-build as currently written (e.g.). But that's not a hard requirement, we could also rejig pathogen-repo-build to handle multiple workflow invocations with the necessary queuing / conditional execution logic described above. For more complex pathogen repos, even if we used a single Snakefile we would still want to use multiple invocations to deploy separate parts of the snakemake workflow³ to different AWS jobs.

For collaborators, this means any job which fits in a single action job (under 6h, under some memory ceiling) could easily be automated and monitored with only a few minor modifications to pathogen-repo-build such as handling nextstrain.org credentials to allow dataset upload to their group, saving the working directory as an artifact etc⁴.

¹ probably all or most the following: mumps, measles, zika, ebola, rsv, hepB, wnv, dengue. Memory requirements may force some of them to be run on AWS.

² This would require phylo to not use separate configs to define separate auspice datasets. I.e. how rsv works, not how mpox does it. But I think this is cleaner anyway.

³ Using targets to separate out which parts of the workflow to run in each job

⁴ And if it doesn't run in a single action job we have docs on how they can set up their own AWS environment.


One minor exception is I'd think to use the term build as we define in our glossary, where it encompasses everything that goes into and results in an Auspice dataset JSON, not just the JSON itself.

Thanks -- I've updated my post to use "auspice dataset" to avoid any confusion.

tsibley commented 5 months ago

Yeah, agreed that not all pathogens need to run on AWS Batch. ncov-ingest ran on GitHub Actions for the first year before I lifted-and-shifted it to AWS Batch due to disk space.

The memory requirements aren't the only thing to consider: GitHub Actions CPUs are limited in number (4) and in my experience can be slower than elsewhere. Likely not a deal breaker, but something to consider. Disk space can also be an issue for larger pathogens.

Even without Batch involved, though, GitHub Actions is still our base "orchestration" layer, and it'd be a speed up to hoist top-level fully-independent jobs out of the Snakemake workflow and into parallel GitHub Actions jobs regardless of if they then launch on AWS Batch.

tsibley commented 5 months ago

Some related thoughts I jotted down before lunch.

We'll be using our noon (Seattle) slot tomorrow to discuss this topic in general.

joverlee521 commented 5 months ago

Notes from our internal discussion

joverlee521 commented 2 months ago

Zika's GH Action workflows are my latest iteration on the standard GH Action workflows. This was built on @tsibley's ideas and the discussion we had back in February.

There are a couple things I'd still like to work through:

tsibley commented 2 months ago

Instead of having a separate ingest-to-phylogenetic.yaml workflow that runs on a schedule and glues the two other workflows together, what about having ingest.yaml run on that schedule (on: schedule: …) and phylogenetic.yaml run after ingest.yaml runs (on: workflow_run: …) and is a) successful and b) found new data?

This means the manual and automatic workflows are the same, and we'd avoid issues of drift and additional overhead and multiple sources of workflow run history.