Closed jameshadfield closed 1 month ago
If we are namespacing every independent ingest workflow, I wonder if it would make more sense for the data "source" to be at the top level.
I really like the top-level per-source folders idea... let's do this. I have always found the data
vs results
distinction within ingest
to be not quite right, so I'm not thrilled about recreating these within each source directory, but I also don't want to get sidetracked on making progress here.
@joverlee521 do you want to take over this PR and build your NCBI work on top of it?
If we are namespacing every independent ingest workflow, I wonder if it would make more sense for the data "source" to be at the top level.
+1 for this. It's an approach that's worked well for me in the past when ingesting disparate sources into a single database. Each source has bespoke inputs and processing but emits conventional/standardized outputs which can be used and aggregated by downstream steps.
Tested fauna changes locally with
nextstrain build \
--envdir ../env.d/seasonal-flu/ \
ingest upload_all \
--config "s3_dst=s3://nextstrain-data-private/files/workflows/avian-flu/trial/ingest-fixes" "segments=['ha']"
which successfully uploaded the ha
files to the trial prefix.
Tested andersen-lab changes locally with
nextstrain build ingest merge_andersen_segment_metadata
which successfully completed the ingest.
I'm planning to merge this tomorrow morning. I'll make NCBI ingest changes separately.
The first 3 commits of #35, as that PR may never be merged.