sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
472 stars 80 forks source link

gather is calculating `remaining_bp` incorrectly #3194

Closed ctb closed 4 months ago

ctb commented 4 months ago

It looks like it is not taking into account the unidentified minhashes that are subtracted from the query sketch before entering into GatherResult.

this may be the source of the difference noticed between Rust-based gather in https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/331 and sourmash gather output.

more here soon!

ctb commented 4 months ago

Using the same brutal debugging tactics from #3193, the following script:

import sourmash

combined_sig = list(sourmash.load_file_as_signatures('combined.sig',
                                                     ksize=21))[0]
combined_mh = combined_sig.minhash
combined_set = set(combined_mh.hashes)

match1_sig = list(sourmash.load_file_as_signatures('match1_NC_000853.1.sig',
                                                   ksize=21))[0]
match1_mh = match1_sig.minhash
match1_set = set(match1_mh.hashes)

match2_sig = list(sourmash.load_file_as_signatures('match2_NC_011978.1.sig',
                                                   ksize=21))[0]

match2_mh = match2_sig.minhash
match2_set = set(match2_mh.hashes)

match5_sig = list(sourmash.load_file_as_signatures('match5_NC_009486.1.sig',
                                                   ksize=21))[0]
match5_mh = match5_sig.minhash
match5_set = set(match5_mh.hashes)

###

print('orig len:', len(combined_set) * combined_mh.scaled)

f_match_1 = len(match1_set.intersection(combined_set)) / len(match1_set)
combined_set -= match1_set

print('after first match removed:', len(combined_set) * combined_mh.scaled)

f_match_2 = len(match2_set.intersection(combined_set)) / len(match2_set)
combined_set -= match2_set
print('after second match removed:', len(combined_set) * combined_mh.scaled)

f_match_5 = len(match5_set.intersection(combined_set)) / len(match5_set)
combined_set -= match5_set

print('after third match removed:', len(combined_set) * combined_mh.scaled)

yields:

orig len: 14660000
after first match removed: 12740000
after second match removed: 11050000
after third match removed: 10130000

which is what calculate_gather_stats returns (tested in commit 6abb5cf9 in #3193).

However, sourmash gather returns numbers based on the identified portion of the sketch only:

image
ctb commented 4 months ago

Fixing in https://github.com/sourmash-bio/sourmash/pull/3195