Add GitHub Action for Nextclade annotations

huddlej commented 5 months ago

Description of proposed changes

Adds rules, config, and GitHub Action file to support running Nextclade on all available HA and NA sequences for H1N1pdm, H3N2, and Vic.

The new Snakemake logic lives in a custom build config named nextclade which skips most of the standard workflow build logic, using only the "download from S3" rules and its own custom rules to run Nextclade for the lineages and segments defined in the build config. This custom logic runs Nextclade with the default dataset per lineage and segment just like the flu_frequencies workflow and uploads the results to S3. Once we have merged this PR, we should be able to automatically run Nextclade with each ingest of new data and run flu_frequencies from the resulting files on S3.

This PR includes a couple of minor changes to other parts of the standard workflow to allow the Nextclade build config YAML to be as simple as possible and also to allow all workflows to download the parsed sequences and metadata from S3 instead of downloading the raw sequences and parsing them again locally.

To run the Nextclade workflow, use the following command:

nextstrain build . upload_all_nextclade_files --configfile profiles/nextclade.yaml

This uploads Nextclade annotations and alignment files to S3 per lineage and segment like seasonal-flu/vic/ha/nextclade.tsv.xz and seasonal-flu/vic/ha/aligned.fasta.xz.

This PR does not produce a merged metadata and Nextclade annotations file like the one used in the flu_frequencies workflow or the analogous ncov metadata files with Nextclade annotations included. I stopped short of creating this merged file in S3, too, because we use a single metadata TSV per lineage (for all segments) but we produce Nextclade annotations per segment. We could upload the metadata TSVs per lineage and segment with Nextclade annotations merged into a single file, but this would duplicate a lot of information across segments. Maybe that duplication is acceptable, but it's worth discussing more internally first.

~~Note: since this PR adds a new GitHub Action, I can't test the action until we've merged the PR into master.~~

Related issue(s)

Related to #144

Checklist

[x] Checks pass
[x] GitHub Action runs and deploys data to S3 as expected

joverlee521 commented 5 months ago

Not doing an in-depth review, just reading over this PR because I'm interested in how you handled the different segments.

I stopped short of creating this merged file in S3, too, because we use a single metadata TSV per lineage (for all segments) but we produce Nextclade annotations per segment. We could upload the metadata TSVs per lineage and segment with Nextclade annotations merged into a single file, but this would duplicate a lot of information across segments. Maybe that duplication is acceptable, but it's worth discussing more internally first.

It'd be good to discuss what's the easiest for downstream users here (granted that may currently only be the flu-frequencies workflow?).

Note: since this PR adds a new GitHub Action, I can't test the action until we've merged the PR into master.

It's possible to test a new GitHub Action with either Tom's or Victor's work-around

huddlej commented 5 months ago

It'd be good to discuss what's the easiest for downstream users here

Totally agree, @joverlee521! @plsteinberg is one such downstream user who has to manually join the metadata and HA Nextclade annotations for their project, so we might have an idea of how that could be improved based on their feedback. 😄

It's possible to test a new GitHub Action with either Tom's or Victor's work-around

Gah, I knew this existed but forgot the mechanics. Thank you for the reminder! Trying with Victor's approach now.

huddlej commented 5 months ago

I'm going to merge this now, so I can start running the GitHub Action after our weekly ingests. We should continue to discuss this implementation in the future, though, since there is likely still remove to improve the user experience.

nextstrain / seasonal-flu