nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
36 stars 20 forks source link

Add genbank_accession_rev field #359

Closed chaoran-chen closed 1 year ago

chaoran-chen commented 1 year ago

Context

For the "open" instance of LAPIS, we use data from https://data.nextstrain.org/files/ncov/open/. It would be very useful if users can know which version of the sequences they are getting from LAPIS but unfortunately, the genbank_accession field in the metadata.tsv file only contains the accession without the version number suffix (e.g., OV377246).

The mpox data already contains a field with the version number.

Description

It would be great to have an additional column (genbank_accession_rev?) that contains the version number as well (e.g., OV377246.1).

(cc @cecivale)

j23414 commented 1 year ago

Looks like we're pulling "genbank_accession_rev" here:

But somehow it's not being passed onward to the metadata. For Dengue, I had to add it to the list of output columns like

But in ncov-ingest I think we'd have to add it to:

@joverlee521 does that look right?