merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
423 stars 144 forks source link

Add checksum checking for NCI downloads in `anvi-setup-ncbi-cogs` #2110

Closed Ge0rges closed 1 year ago

Ge0rges commented 1 year ago

I added checksum verification to anvi-setup-ncbi-cogs. If fails, it prints an error. This handles the fact that COG2014 does not have a checksum file as well. This feature is to handle the incorrect downloads often placing this command due to NCBI servers. See #1738.

While not added here, I find it surprising that the function download_file in utils.py does not ever check if file_size == downloaded_size. I leave it up to the maintainer to determine if that check is necessary, or if it exists elsewhere.

PS: I was able to test both successful and unsuccesful.

meren commented 1 year ago

Thank you very much for this PR, @Ge0rges! It looks reasonable to me and I will merge it immediately so we can test it a bit. It is a shame that COG14 does not have checksums. Since those files never change, in theory, one could generate a checksums file from a successful download, store that file in anvi'o codebase (somewhere under data/misc, I guess), and set the access for that file for the COG14 entry in self.cog_files. This way even if people are working with COG14, they still could benefit from the same solution.

While not added here, I find it surprising that the function download_file in utils.py does not ever check if file_size == downloaded_size. I leave it up to the maintainer to determine if that check is necessary, or if it exists elsewhere.

That is true. Content-Length is often not in response headers, so such a check would only work for a fraction of files that function downloads, but still, that could have been useful at least in some instances :)

Thank you again!