ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
355 stars 39 forks source link

Version/date specifier to download old instead of latest genome annotations #393

Closed gatoniel closed 1 month ago

gatoniel commented 1 month ago

Is your feature request related to a problem? Please describe. We are using the Vibrio cholerae annotation GCF_000006745.1 (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000006745.1/). In our lab, we have recently discovered that the latest annotation (from 08/24/2023) differs from the previous annotation (from 08/10/2022). For reproducibility, we would like to re-download the old version as some of our analysis was based on it.

Describe the solution you'd like Would it be possible to add a date or version specifier in the datasets CLI to download the genome annotation that was the latest annotation for a specified accession number on a certain date? This would allow to test the reproducibility of a analysis pipeline. Currently, it seems there is no control about updates on the database which is a problem to reproducibility.

It could be, that I completely misunderstand something about the genome annotation database. Then, I would be great if you could help me fixing my wrong assumptions...

EDIT: I just found out how to access old versions via browser: https://www.ncbi.nlm.nih.gov/nuccore/15600771?report=girevhist But this is through nuccore and not via datasets/genome.

EDIT2: I've also found the --released-after specifier, but it seems to do something different.

ericcox1 commented 1 month ago

Hi @gatoniel,

Thanks for opening this issue.

As you have already discovered, the previous annotations are available in Nucleotide. Here are links to the full GenBank flat files for the previous annotations from 8/10/2022, for chromosome I (NC_002505.1) and chromosome II (NC_002506.1): NC_002505.1 NC_002506.1

We are not planning to make this data available via NCBI Datasets in the near-term but we may reconsider in the future if there is sufficient user demand.

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets NIH/NLM/NCBI eric.cox@nih.gov