sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
465 stars 78 forks source link

Return k-mers in addition to sketch/hash values #2073

Open dkoslicki opened 2 years ago

dkoslicki commented 2 years ago

This is a feature request (as discussed with @bluegenes earlier today) to have the k-mers that correspond to the hash values also returned when doing a sourmash sketch. I was made aware of sourmash sig kmers, but this is a two-pass method that can be quite slow on large files (the intended use-case). There exists in the API a way to do this, but it would be most useful as a flag to sourmash sketch (say, --save-kmers).

The intended use cases include:

  1. Progress towards ArgFracMinHash as discussed in #2039
  2. I plan on implementing a ternary search tree data structure that will help with large databases of sketches. This will be used for functional profiling investigations, and may help @bluegenes too in speeding up large-scale queries.
ctb commented 2 years ago

what format would be desirable here? CSV hashes <=> k-mers?

dkoslicki commented 2 years ago

I was initially thinking another entry in the JSON output, similar to how the mins and counts are currently stored as ordered lists.

I also forgot to mention, in order to get #2039 to to work properly, you'll want to make a uniform decision about what to do with collisions. The most straightforward way to do this is that if two or more k-mers hash to the same value, then store the lexicographically smallest k-mer.

mr-eyes commented 2 years ago

I think this can be achieved by creating another vector of kmers in the Rust side alongside the two main vectors mins and abunds.https://github.com/sourmash-bio/sourmash/blob/13bf0d55b5c9c49b0702f223e22617585eff33bb/src/core/src/sketch/minhash.rs#L50-L53

Maybe saving it in memory as a two-bit representation? but this will limit the kmer-size. As long as the output is a text format, then just the kmer characters will be okay to be saved in the JSON.

mr-eyes commented 2 years ago

Sorry for the confusion. Two-bit representation can't work due to encodings other than DNA.

@dkoslicki Will sorting the kmers as a post-process after saving the sketch work? Instead of serializing the kmers in their lexicographical order?

dkoslicki commented 2 years ago

@mr-eyes I don't know if I quite understand. Do you mean some sort of second pass (when you mentioned post processing)? I imagine that during the min hashing, when inserting hashes, you keep another list in parallel and insert k-mers in it.

The lexicographic part is only for resolving collisions. So the k-mers are kept in the same order as the hashes. It's just if ever you have two distinct k-mers hashing to the same value, just save the k-mers with the smaller lexico value.

Or do I misunderstand your question?

mr-eyes commented 2 years ago

@dkoslicki @ctb Please correct me if I am wrong. My guess about collisions is that minhashes are less likely to produce collisions, especially when they are scaled.

If my assumption is correct, I would suggest implementing this part without adding extra conditions to resolve collisions for speed.

I imagine that during the min hashing, when inserting hashes, you keep another list in parallel and insert k-mers in it.

Yes, exactly!

ctb commented 2 years ago

@dkoslicki @ctb Please correct me if I am wrong. My guess about collisions is that minhashes are less likely to produce collisions, especially when they are scaled.

collisions do happen - we've found one or two in the wild, over in spacegraphcats! - and since scaled is applied after hashing, the scaled factor does not affect them.

I was initially thinking another entry in the JSON output, similar to how the mins and counts are currently stored as ordered lists.

we've talked about this elsewhere - originally in https://github.com/sourmash-bio/sourmash/issues/211, and there's an extended discussion here.

I think the place to start is to return them when sketching. Saving them into the signature JSON is a whole separate thing and likely to take a while to happen; if there's an intermediate output format that would work for you, @dkoslicki, that would probably be much easier.

dkoslicki commented 2 years ago

Collisions will invariably happen; you'll have a list of hashes/mins with values [v1,v2,v3] and a list of corresponding k-mers [k1, k2, k3], and then you have another k-mer k4 with hash v2. Yet k4 != k2, so what to do!? Replace k2 with k4, or leave k2 as is? The idea is to check which is lexicographically smaller: k2 or k4. If k4 is smaller, then your lists become [v1,v2,v3] and [k1,k4,k3], if k2 is smaller, then the lists stay [v1,v2,v3] and [k1,k2,k3].

There is no speed penalty since there is exactly one extra comparison to be had (is k2<=k4 or k2>k4)

dkoslicki commented 2 years ago

Saving them into the signature JSON is a whole separate thing and likely to take a while to happen; if there's an intermediate output format that would work for you, @dkoslicki, that would probably be much easier.

I see! Then a simple csv in the same order as the JSON hashes would be great