ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
369 stars 41 forks source link

`Submitter Names` lack spaces in `dataformat tsv virus-genome` compared to genbank file serialization #336

Open corneliusroemer opened 8 months ago

corneliusroemer commented 8 months ago

Describe the bug The Submitter Names field has a not-very-robust serialization format of LAST NAME,FIRST NAME INITIALS,LAST NAME,FIRST NAME INITIALS... that does not separate individuals. Is this on purpose, if so why?

When I look up the original genbank file for a sequence, there is a space after the initials, before the next Last Name.

Compare output from

   datasets download virus genome taxon 186538  --no-progressbar  --filename results/ncbi_dataset.zip
 dataformat tsv virus-genome   --package results/ncbi_dataset.zip  --fields submitter-names

for e.g. OR084927 with what's shown for the corresponding .gb file.

CLI output: Kinganda-Lusamaki,E.,Whitmer,S.,Lokilo-Lofiko,E.,Amuri-Aziza,A.,Muyembe-Mawete,F.,Makangara-Cigolo,J.C.,... Genbank file: Kinganda-Lusamaki,E., Whitmer,S., Lokilo-Lofiko,E., Amuri-Aziza,A., Muyembe-Mawete,F., Makangara-Cigolo,J.C.,

Note that the Genbank file separates names with a whitespace - which is prudent, as otherwise one needs to hope that the parity holds for long strings.

olearyna commented 7 months ago

Hi corneliusroemer,

Thanks, we'll look into it.

Nuala