Closed ctb closed 4 months ago
Using the same brutal debugging tactics from #3193, the following script:
import sourmash
combined_sig = list(sourmash.load_file_as_signatures('combined.sig',
ksize=21))[0]
combined_mh = combined_sig.minhash
combined_set = set(combined_mh.hashes)
match1_sig = list(sourmash.load_file_as_signatures('match1_NC_000853.1.sig',
ksize=21))[0]
match1_mh = match1_sig.minhash
match1_set = set(match1_mh.hashes)
match2_sig = list(sourmash.load_file_as_signatures('match2_NC_011978.1.sig',
ksize=21))[0]
match2_mh = match2_sig.minhash
match2_set = set(match2_mh.hashes)
match5_sig = list(sourmash.load_file_as_signatures('match5_NC_009486.1.sig',
ksize=21))[0]
match5_mh = match5_sig.minhash
match5_set = set(match5_mh.hashes)
###
print('orig len:', len(combined_set) * combined_mh.scaled)
f_match_1 = len(match1_set.intersection(combined_set)) / len(match1_set)
combined_set -= match1_set
print('after first match removed:', len(combined_set) * combined_mh.scaled)
f_match_2 = len(match2_set.intersection(combined_set)) / len(match2_set)
combined_set -= match2_set
print('after second match removed:', len(combined_set) * combined_mh.scaled)
f_match_5 = len(match5_set.intersection(combined_set)) / len(match5_set)
combined_set -= match5_set
print('after third match removed:', len(combined_set) * combined_mh.scaled)
yields:
orig len: 14660000
after first match removed: 12740000
after second match removed: 11050000
after third match removed: 10130000
which is what calculate_gather_stats
returns (tested in commit 6abb5cf9 in #3193).
However, sourmash gather
returns numbers based on the identified portion of the sketch only:
It looks like it is not taking into account the unidentified minhashes that are subtracted from the query sketch before entering into
GatherResult
.this may be the source of the difference noticed between Rust-based gather in https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/331 and sourmash gather output.
more here soon!