one2all/new2all output counts for common k-mers between multiple db-samples

refresh-bio / kmer-db

Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).

GNU General Public License v3.0

83 stars 17 forks source link

one2all/new2all output counts for common k-mers between multiple db-samples #12

Open mihkelvaher opened 4 years ago

mihkelvaher commented 4 years ago

Hi!

The one2all/new2all give this information about the intersection sizes with a new sample: s1: 100/150 s2: 200/300 s3: 50/1000 ...

Is there any way to get more detailed information showing common k-mers? For example, given these counts, I have no idea if the 50 k-mers seen in s3 are also present in s1 or s2. The preferred output would be something like this: s1: 50/50 s2: 200/300 s3: 0/900 s1 AND s3: 50/100 ...

This can be achieved by creating all of the intersections beforehand, but looking at the kmer-db database structure, I was hoping to skip that step.

Regards, Mihkel

agudys commented 4 years ago

Dear Mikhel,

We can think of adding the functionality you mentioned to kmer-db. However, the number of all possible intersections grows exponentially with a number of queries. Wouldn't it be better to give user the possibility to explicitly state what intersections he is interested in?

Regards, Adam

mihkelvaher commented 4 years ago

Hi!

The number of intersections does indeed grow fast. Could the given intersections be limited by the number of k-mers shared by the references? For example, if s1, s2 and s3 share less than 1000 k-mers, the intersection would not be shown. Also, showing intersections where something was actually found while searching, reduces the output size significantly.

blahah commented 3 years ago

Just run all2all as well :)