sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

Add k-mer positions while sketching #2985

Open mr-eyes opened 7 months ago

mr-eyes commented 7 months ago

I have a suggestion that may be considered for a future version of sourmash. The current Minhash includes hashes and abundances. I propose adding an optional array called positions, which would consist of a float number in the format of {sequence_id}.{kmer_position}. This additional information could be used in so many applications.

I got this idea when I was using the k-mer position information to make the following plot.

image

ctb commented 7 months ago

thanks @mr-eyes! I love the idea! One thing here is that the current arrays collapse duplicates (while counting multiplicity) so this would have to be a whole separate array.

I think a really useful intermediate idea would be to support an expanded hashing generator function - one that, fed a sequence, iterates across the sequence and yields a tuple of (hashval, sequence, strand, position). This would enable a lot of useful features like the above without changing the sketch format itself.

mr-eyes commented 7 months ago

Thank you! That's a good approach. Another approach that I should have been more clearer about to solve the deduplication problem:

Unique Hashes Array: [10, 20, 30, 40]

This array represents the hashes of unique kmers found across 2 sequences.

Abundances Array: [2, 3, 1, 4]

Positions Array (With sequence_id.kmer_position Format):

[1.3, 2.5, 1.7, 1.9, 2.2, 2.8, 2.14, 1.12, 1.15, 2.18]

This array should be read as follows:

•   The first two floats 1.3 and 2.5 correspond to the positions of the kmer with hash 10 in sequence 1 and sequence 2, respectively.
•   The next three floats 1.7, 1.9, 2.2 correspond to the positions of the kmer with hash 20 across the sequences, indicating it was found twice in sequence 1 and once in sequence 2.
•   The single float 2.8 corresponds to the kmer with hash 30, found once in sequence 2.
•   The last four floats 2.14, 1.12, 1.15, 2.18 correspond to the kmer with hash 40, indicating it was found across both sequences with varying positions.
ctb commented 7 months ago

additional thought - this would primarily be useful for reference genomes, not for shotgun sequence. although I could see it being useful for long reads!

more more thought - there's no reason we couldn't have a new binary container that supported this format, but that was readable as straight FracMinHash sketches via a plug-in. That way you could have this information today, but still be able to use all the sourmash search, query, and introspection functionality.