Ingest fixes - Githubissues

nextstrain / avian-flu

Nextstrain build for avian influenza viruses

http://nextstrain.org/avian-flu

13 stars 6 forks source link

Ingest fixes #36

Closed jameshadfield closed 1 month ago

jameshadfield commented 1 month ago

The first 3 commits of #35, as that PR may never be merged.

jameshadfield commented 1 month ago

If we are namespacing every independent ingest workflow, I wonder if it would make more sense for the data "source" to be at the top level.

I really like the top-level per-source folders idea... let's do this. I have always found the data vs results distinction within ingest to be not quite right, so I'm not thrilled about recreating these within each source directory, but I also don't want to get sidetracked on making progress here.

@joverlee521 do you want to take over this PR and build your NCBI work on top of it?

tsibley commented 1 month ago

If we are namespacing every independent ingest workflow, I wonder if it would make more sense for the data "source" to be at the top level.

+1 for this. It's an approach that's worked well for me in the past when ingesting disparate sources into a single database. Each source has bespoke inputs and processing but emits conventional/standardized outputs which can be used and aggregated by downstream steps.

joverlee521 commented 1 month ago

Tested fauna changes locally with

nextstrain build \
    --envdir ../env.d/seasonal-flu/ \
    ingest upload_all \
        --config "s3_dst=s3://nextstrain-data-private/files/workflows/avian-flu/trial/ingest-fixes" "segments=['ha']"

which successfully uploaded the ha files to the trial prefix.

Tested andersen-lab changes locally with

nextstrain build ingest merge_andersen_segment_metadata

which successfully completed the ingest.

I'm planning to merge this tomorrow morning. I'll make NCBI ingest changes separately.