ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
349 stars 39 forks source link

Include md5sum in JSON or as other output #302

Open aboffin opened 8 months ago

aboffin commented 8 months ago

Hi,

Thank you for your team's commendable work on datasets which finally provides a comprehensive and singular way to download data from NCBI, whereas previously one had to resort to a multitude of EUtils/Perl/Python scripts that output something almost, but not quite entirely unlike what we wanted, however reliability seems to be an issue as with other tools.

Is there a way to check the integrity of the downloads? In the typical example that is given, this information does not exist:

./datasets download genome accession GCF_000001405.40 --dehydrated --filename human_GRCh38_dataset.zip
unzip human_GRCh38_dataset.zip -d GRCh38
./datasets rehydrate --directory GRCh38

cd GRCh38/ncbi_dataset/data
grep md5 *json
# outputs nothing

I am perplexed that such a simple mechanism of checksum integrity was not provided considering that networks do fail and partial downloads may lead to, at best confusion and at worst incorrect results, when using such genomes for further analyses.

I see that issue #206 raised the same question but it was closed without any definitive answer regarding md5sum.

olearyna commented 8 months ago

Hi aboffin<

Thanks for highlighting this issue. I understand this is an important feature. The NCBI Datasets team is actively exploring the implementation of a checksum mechanism. I'll leave this issue open until it is addressed.

All the best, Nuala

Nuala A. O'Leary, PhD Product Owner, NCBI Datasets National Center for Biotechnology Information, NLM, NIH, DHHS