sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
467 stars 80 forks source link

Should md5s include file names when they are created? #2359

Open dkoslicki opened 1 year ago

dkoslicki commented 1 year ago

I ran into an interesting situation:

$ grep 58654c15655966372a5eccb6666c5b03 MANIFEST.csv
formatted_db.sig,58654c15655966372a5eccb6666c5b03,58654c15,31,DNA,0,1000,4785,1,GCF_008631595.1_ASM863159v1_genomic.fna,formatted_db.fasta
formatted_db.sig,58654c15655966372a5eccb6666c5b03,58654c15,31,DNA,0,1000,4785,1,GCF_008631565.1_ASM863156v1_genomic.fna,formatted_db.fasta

These two entries have the same exact md5, and yet the files are different. Indeed, at this scale factor and k-mer size, these underlying genomes have Jaccard/containment == 1. Yet looking into it, the files do appear to be different.

dkoslicki commented 1 year ago

I.e. I was originally under the assumption that the md5 is formed using the entire signature (including the fields hash_function, name, etc.), but given the above, it may only use the signatures field Looks like it just uses the mins, so two signatures that have the same mins but different abundances would also get the same md5

ctb commented 1 year ago

interesting - hadn't thought about the abundance situation!

but, yes, the idea I think I had when I was designing md5sum was that it would be a hash of the content only, not the "metadata".

note that md5sum is used extensively in picklists to select signatures.

I'll provide more perspective and link out to other relevant issues in a bit :).