Closed joverlee521 closed 2 weeks ago
Noting changes in the new file:
US State
column -> we will lose division
dataDate
column is Collection_Date
-> only includes year, we will lose specific datesHost
column includes a lot more host values! geo_loc_name_country
and geo_loc_name_country_continent
so region and country no longer have to be hardcoded in curate-andersen-lab-data. For [1] and [2], @trvrb noted that this is fine and we can include the additional data through annotations. This will require adding ./vendored/merge-user-metadata
the andersen-lab curation rule.
For [3], I plan to remove the custom host parsing in curate-andersen-lab-data and switch to using the transform-host script used in the NCBI GenBank ingest. This way they can share a single host map and have the same standard values.
@trvrb pointed out on Slack that they are now automatically updating SraRunTable_PRJNA1102327_automated.csv.
The ingest workflow is still using the old metadata file from the Andersen-lab repo that was extracted
https://github.com/nextstrain/avian-flu/blob/047a3a23716804628369b8b236a49f9f9354dd57/ingest/build-configs/ncbi/rules/ingest_andersen_lab.smk#L32
To get all of the latest data in our ingest, we need to switch over to the new CSV file.