Download data from S3 to start workflow

huddlej commented 2 months ago

Description of proposed changes

Replaces workflow logic for downloading data from fauna with 1) a custom workflow that downloads from fauna, parses sequences and metadata, and uploads to S3 and 2) new main workflow logic to download parsed sequences/metadata from S3 and filter to the requested subtype before continuing the rest of the workflow.

One major change from this implementation is the replacement of one metadata file per subtype and segment with a single metadata file across all segments. The metadata file includes a n_segments column with the number of segment sequences available for each metadata record which allows the original "same strains" path through the phylogenetic workflow to work.

To run upload to S3:

cd ingest
nextstrain build \
    --env RETHINK_HOST \
    --env RETHINK_AUTH_KEY \
    --env AWS_ACCESS_KEY_ID \
    --env AWS_SECRET_ACCESS_KEY \
    .

See the ingest README for more details.

After the upload, S3 will have one metadata for all subtypes and segments and one sequences file per gene segment across all subtypes like:

s3://nextstrain-data-private/files/workflows/avian-flu/metadata.tsv.zst
s3://nextstrain-data-private/files/workflows/avian-flu/ha/sequences.fasta.zst

What this means for users

The changes in this PR will be breaking changes for some users including people who currently have credentials to access fauna but do not have AWS credentials to access the private bucket above. We will need to issue these users with AWS credentials that provide at least read access to nextstrain-data-private and they will need to learn how to pass those credentials to tools like the Nextstrain CLI (or through the envdir argument).

Users who want to run the upload workflow will need read/write access to the private bucket. Ideally, we could limit the number of users who need these permissions by making the GitHub Action described in the next steps below.

Next steps

One immediate improvement to user experience of running the "upload" workflow would be to expose it through a GitHub Action in this repository, such that running the workflow only entails an authorized GitHub user clicking a "Run" button. Once this Action is in place, it could easily be expanded to automatically trigger new phylogenetic builds when the upload completes just like we do in the seasonal-flu workflow.

Related issue(s)

Checklist

[x] Checks pass

jameshadfield commented 2 months ago

This approach keeps a separate metadata file per segment to simplify replacement of fauna download logic in the original workflow and allow existing rules that expect segment-specific metadata (e.g., add segment counts, etc.) to work without additional changes.

This doesn't have to be part of this PR, but a nicer interface to aim towards would be using a single metadata file and adding the segment counts to that file. Would simplify the snakemake workflow a bit. I'm not sure whether metadata fields would have to be joined across the inputs (i.e. is there metadata that's only supplied for some segments and not others).

huddlej commented 2 months ago

@jameshadfield Good call. The first commit was my attempt to get S3-based data working without breaking any downstream steps in the workflow. But @trvrb had the same request for a single metadata file, so I'll try this out for this PR. Maybe we can chat tomorrow about specifics, though?

In the mean time, I'll also fix the paths to input data for the CI builds.

trvrb commented 2 months ago

I think we can plan to merge this PR when we're happy with it to include a single metadata file on S3. Then in a separate PR we can update the workflow to use S3 files and switch to using a single metadata file.

nextstrain / avian-flu