nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.73k stars 624 forks source link

Add an option to the file method to check for md5sum #2491

Open JoseEspinosa opened 2 years ago

JoseEspinosa commented 2 years ago

After discussion on the nf-core slack, we (@ewels, @mahesh-panchal) think that it would be useful to add a native option to the file method to check the integrity of the files that are staged. Using the example shown in the documentation here, an option that might be named checksum should allow providing the hash in a similar manner to the code below:

pdb = file('http://files.rcsb.org/header/5FID.pdb', checksum: 'ba45addcc599af2ac71492f0f55da866')

The idea will be that the hash of the file is calculated either if the file is staged or if it is already present in the cage by a previous execution and that if the hash does not match the provided by the user an exception is raised, similarly to what happens when checkIfExists option is set to true and the file is not found in the system.

YPHa commented 2 years ago

I was recently looking for such functionality. But files from Ensembl should be hashed using "sum" instead of md5sum. So some flexibility in this regard would also be highly appreciated (maybe with an additional argument on which kind of hash to make?)

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stevekm commented 1 year ago

to save a little time, might consider using something faster like sha1 instead of md5 for this

bentsherman commented 1 year ago

We might need to support both anyway since the checksums might be provided in either format

jordeu commented 1 year ago

I'll add support for md5, sha256 and sha1