Closed huddlej closed 1 month ago
This approach keeps a separate metadata file per segment to simplify replacement of fauna download logic in the original workflow and allow existing rules that expect segment-specific metadata (e.g., add segment counts, etc.) to work without additional changes.
This doesn't have to be part of this PR, but a nicer interface to aim towards would be using a single metadata file and adding the segment counts to that file. Would simplify the snakemake workflow a bit. I'm not sure whether metadata fields would have to be joined across the inputs (i.e. is there metadata that's only supplied for some segments and not others).
@jameshadfield Good call. The first commit was my attempt to get S3-based data working without breaking any downstream steps in the workflow. But @trvrb had the same request for a single metadata file, so I'll try this out for this PR. Maybe we can chat tomorrow about specifics, though?
In the mean time, I'll also fix the paths to input data for the CI builds.
I think we can plan to merge this PR when we're happy with it to include a single metadata file on S3. Then in a separate PR we can update the workflow to use S3 files and switch to using a single metadata file.
Description of proposed changes
Replaces workflow logic for downloading data from fauna with 1) a custom workflow that downloads from fauna, parses sequences and metadata, and uploads to S3 and 2) new main workflow logic to download parsed sequences/metadata from S3 and filter to the requested subtype before continuing the rest of the workflow.
One major change from this implementation is the replacement of one metadata file per subtype and segment with a single metadata file across all segments. The metadata file includes a
n_segments
column with the number of segment sequences available for each metadata record which allows the original "same strains" path through the phylogenetic workflow to work.To run upload to S3:
See the ingest README for more details.
After the upload, S3 will have one metadata for all subtypes and segments and one sequences file per gene segment across all subtypes like:
What this means for users
The changes in this PR will be breaking changes for some users including people who currently have credentials to access fauna but do not have AWS credentials to access the private bucket above. We will need to issue these users with AWS credentials that provide at least read access to
nextstrain-data-private
and they will need to learn how to pass those credentials to tools like the Nextstrain CLI (or through the envdir argument).Users who want to run the upload workflow will need read/write access to the private bucket. Ideally, we could limit the number of users who need these permissions by making the GitHub Action described in the next steps below.
Next steps
One immediate improvement to user experience of running the "upload" workflow would be to expose it through a GitHub Action in this repository, such that running the workflow only entails an authorized GitHub user clicking a "Run" button. Once this Action is in place, it could easily be expanded to automatically trigger new phylogenetic builds when the upload completes just like we do in the seasonal-flu workflow.
Related issue(s)
Checklist