onecodex / finch-rs

A genomic minhashing implementation in Rust
https://www.onecodex.com
MIT License
92 stars 8 forks source link

Better distances using counts #41

Open bovee opened 4 years ago

bovee commented 4 years ago

Right now we're doing purely set-based calculations, but we also have count information for each item in the set that we could use to improve the overall distance estimate. I wrote the following (not working anymore) code out as test, but something like this may work decently (as compared to e.g. cosine distances that may have different sets of issues).

    // if the original max_min parameter is larger than the max hash in either
    // table, then this will panic b/c i or j will point outside; note that i and j at the
    // end are slightly off from what they should be too (see the current distance
    // function for better accounting)
    while (sketch1[i].hash <= max_hash_value) && (sketch2[j].hash <= max_hash_value) {
        if sketch1[i].hash < sketch2[j].hash {
            s1_complement += u64::from(sketch1[i].count);
        } else if sketch2[j].hash < sketch1[i].hash {
            s2_complement += u64::from(sketch2[j].count);
        } else {
            s1_intersect += u64::from(sketch1[i].count);
            s2_intersect += u64::from(sketch2[j].count);
            i += 1;
            j += 1;
        }
    }
    // the normalization here is kind of made-up, but we do need some way to
    // make sure that sequencing depth differences don't impact the comparison
    let containment: f64 = s1_intersect as f64 / (s1_intersect as f64 + s1_complement as f64);
    let common = s1_intersect;
    // we're scaling the s2 counts to the range of s1 that they share here
    let total = s1_complement + s1_intersect + (s2_intersect + s2_complement) * s1_intersect / s2_intersect;
    let jaccard: f64 = common as f64 / total as f64;

Ioffe also has a paper on weighted MinHash that might be useful to understand here too: http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36928.pdf

jianshu93 commented 2 years ago

Hello,

Have you implement this weighted idea and merged into the main branch? except the one you point out, baminhash and probminhash are also weighted set method.

Jianshu