nextstrain / avian-flu

Nextstrain build for avian influenza viruses
http://nextstrain.org/avian-flu
13 stars 6 forks source link

ingest/andersen-lab: Switch to automated CSV file #53

Closed joverlee521 closed 2 weeks ago

joverlee521 commented 2 weeks ago

@trvrb pointed out on Slack that they are now automatically updating SraRunTable_PRJNA1102327_automated.csv.

The ingest workflow is still using the old metadata file from the Andersen-lab repo that was extracted

https://github.com/nextstrain/avian-flu/blob/047a3a23716804628369b8b236a49f9f9354dd57/ingest/build-configs/ncbi/rules/ingest_andersen_lab.smk#L32

To get all of the latest data in our ingest, we need to switch over to the new CSV file.

joverlee521 commented 2 weeks ago

Noting changes in the new file:

  1. There is no US State column -> we will lose division data
  2. The Date column is Collection_Date -> only includes year, we will lose specific dates
  3. The Host column includes a lot more host values!
  4. The addition of geo_loc_name_country and geo_loc_name_country_continent so region and country no longer have to be hardcoded in curate-andersen-lab-data.

For [1] and [2], @trvrb noted that this is fine and we can include the additional data through annotations. This will require adding ./vendored/merge-user-metadata the andersen-lab curation rule.

For [3], I plan to remove the custom host parsing in curate-andersen-lab-data and switch to using the transform-host script used in the NCBI GenBank ingest. This way they can share a single host map and have the same standard values.