sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

rust load_signature trusts & caches (without checking) the md5sum in the signature file #1180

Open ctb opened 4 years ago

ctb commented 4 years ago

on latest,

cd tests/test-data/
jq . < genome-s10+s11.sig > jq.sig
sourmash search genome-s10+s11.sig jq.sig -k 21 --dna

yields:

1 matches:
similarity   match
----------   -----
 96.8%       s10+s11

wat?

ctb commented 4 years ago

sourmash sig describe shows identical minhashes based on md5sum:

signature filename: jq.sig
signature: s10+s11
source file: -
md5: 8a619747693c045afde376263841806b
k=21 molecule=DNA num=500 scaled=0 seed=42 track_abundance=0
size: 500
signature license: CC0

vs

signature filename: genome-s10+s11.sig
signature: s10+s11
source file: -
md5: 8a619747693c045afde376263841806b
k=21 molecule=DNA num=500 scaled=0 seed=42 track_abundance=0
size: 500
signature license: CC0
ctb commented 4 years ago

(boggle)

olgabot commented 4 years ago

???? this is bizzare ..

ctb commented 4 years ago

I wrote a script --

#! /usr/bin/env python
import sourmash
old = sourmash.load_one_signature('old1.sig', ksize=21, select_moltype='dna')
jq = sourmash.load_one_signature('jq1.sig', ksize=21, select_moltype='dna')

print(old.minhash.similarity(jq.minhash))

x = set(old.minhash.hashes)
y = set(jq.minhash.hashes)
print(f'{len(x - y)} in old but not in new')
print(f'{len(y - x)} in new but not in old')
print(x - y)
print(y - x)

print(old.md5sum())
print(jq.md5sum())

and got the following output:

0.968
14 in old but not in new
14 in new but not in old
{9363903249964323, 9324819557339109, 9177767560730791, 9335360823692169, 9029543384217101, 9123156062912911, 9360058626564465, 9144313946356755, 9108143333754387, 9292443588546677, 9366616833410327, 9260099400862233, 9353758096949499, 9124442343550655}
{9124442343550656, 9366616833410328, 9363903249964324, 9324819557339108, 9177767560730792, 9335360823692168, 9029543384217100, 9360058626564464, 9123156062912912, 9292443588546676, 9144313946356756, 9108143333754388, 9260099400862232, 9353758096949500}
8a619747693c045afde376263841806b
8a619747693c045afde376263841806b

two things stand out --

ctb commented 4 years ago

I wrote some more python code:


import hashlib
m = hashlib.md5()
m.update(str(old.minhash.ksize).encode('utf-8'))
for hash in old.minhash.hashes:
    m.update(str(hash).encode('utf-8'))
print(m.hexdigest())

m = hashlib.md5()
m.update(str(jq.minhash.ksize).encode('utf-8'))
for hash in jq.minhash.hashes:
    m.update(str(hash).encode('utf-8'))
print(m.hexdigest())

and got

8a619747693c045afde376263841806b
3ed333703b853431dd822fa4cbee95e3

so that's good.

ctb commented 4 years ago

ok, more breadcrumbs -- I added

mh = jq.minhash.copy_and_clear()
mh.add_many(jq.minhash.hashes)

from sourmash._lowlevel import ffi, lib
from sourmash.utils import RustObject, rustcall, decode_str
print(decode_str(mh._methodcall(lib.kmerminhash_md5sum)))

and we get

3ed333703b853431dd822fa4cbee95e3
1.0

so it appears that when you force recalculation of the md5sum in the "new" (jq-transformed) sketch, it agrees with the Python one that is calculated on the hashes. So there is a problem in the Rust code somewhere where the md5sum from the JSON is being loaded and cached without being checked or reset.

This is in addition to jq just transforming some numbers for the heck of it.

ctb commented 4 years ago

I think the jq thing is caused by precision issues, see https://github.com/stedolan/jq/issues/1741.

ctb commented 4 years ago

so, umm, I think I give up on the jq problem. Will leave this here for the Rust md5 issue.

ctb commented 4 years ago

I went in and hand-edited the md5sum in the JSON signature file, and verified that sourmash signature describe loaded and displayed the incorrect md5sum without checking it. So this is now officially a specific bug on the rust side :)

luizirber commented 4 years ago

so it appears that when you force recalculation of the md5sum in the "new" (jq-transformed) sketch, it agrees with the Python one that is calculated on the hashes. So there is a problem in the Rust code somewhere where the md5sum from the JSON is being loaded and cached without being checked or reset.

Yup, it loads the md5sum from the signature without checking. Probably needs to replace this block with something that build the MinHash with an empty md5sum, trigger the md5sum calculation, and then assert that the new one is the same as the one read from the file. Maybe:

let result = KmerMinHash {
    num,
    ksize: tmpsig.ksize,
    seed: tmpsig.seed,
    max_hash: tmpsig.max_hash,
    md5sum: Mutex::new(None),
    mins,
    abunds,
    hash_function,
};

let new_md5sum = result.md5sum();

// TODO: return an error instead of using assert_eq!
assert_eq!(new_md5sum, tmpsig.md5sum);

Ok(result)
ctb commented 4 years ago

...as an old collaborator once said after I introduced him to testing, "ok, I can see that if it's not tested, the functionality will be lost." 😆