ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
369 stars 41 forks source link

Flag obsolete records in dataset #365

Closed joverlee521 closed 5 months ago

joverlee521 commented 5 months ago

Hi NCBI Datasets team,

Is it possible to flag obsolete GenBank records in a downloaded dataset?

PP767032 has been replaced by PP766984, which is currently visible in the webpage:

Screenshot 2024-05-20 at 4 44 32 PM

However, if I download both records via the CLI, I cannot find this information in the metadata.

$ datasets download virus genome accession PP766984.1 PP767032.1 --filename test.zip

I thought I could rely on GenBank accessions to be unique records in the dataset, but that is not the case here. It would be helpful to either filter out obsolete records with a --no-obsolete flag or have a field in the data_report.jsonl that flags obsolete records.

olearyna commented 5 months ago

Hi joverlee521

Thank you for bringing this issue to our attention. I have reported the problem to the NCBI Virus group, and they are currently investigating why the replaced sequence has not been removed. We will keep this issue open until it is resolved.

Nuala

Nuala A. O'Leary, PhD Product Owner, NCBI Datasets National Center for Biotechnology Information, NLM, NIH, DHHS

olearyna commented 5 months ago

Hi joverlee521,

The replaced sequences have been removed from NCBI Datasets, closing this issue.