Right now we're doing purely set-based calculations, but we also have count information for each item in the set that we could use to improve the overall distance estimate. I wrote the following (not working anymore) code out as test, but something like this may work decently (as compared to e.g. cosine distances that may have different sets of issues).
// if the original max_min parameter is larger than the max hash in either
// table, then this will panic b/c i or j will point outside; note that i and j at the
// end are slightly off from what they should be too (see the current distance
// function for better accounting)
while (sketch1[i].hash <= max_hash_value) && (sketch2[j].hash <= max_hash_value) {
if sketch1[i].hash < sketch2[j].hash {
s1_complement += u64::from(sketch1[i].count);
} else if sketch2[j].hash < sketch1[i].hash {
s2_complement += u64::from(sketch2[j].count);
} else {
s1_intersect += u64::from(sketch1[i].count);
s2_intersect += u64::from(sketch2[j].count);
i += 1;
j += 1;
}
}
// the normalization here is kind of made-up, but we do need some way to
// make sure that sequencing depth differences don't impact the comparison
let containment: f64 = s1_intersect as f64 / (s1_intersect as f64 + s1_complement as f64);
let common = s1_intersect;
// we're scaling the s2 counts to the range of s1 that they share here
let total = s1_complement + s1_intersect + (s2_intersect + s2_complement) * s1_intersect / s2_intersect;
let jaccard: f64 = common as f64 / total as f64;
Have you implement this weighted idea and merged into the main branch? except the one you point out, baminhash and probminhash are also weighted set method.
Right now we're doing purely set-based calculations, but we also have count information for each item in the set that we could use to improve the overall distance estimate. I wrote the following (not working anymore) code out as test, but something like this may work decently (as compared to e.g. cosine distances that may have different sets of issues).
Ioffe also has a paper on weighted MinHash that might be useful to understand here too: http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36928.pdf