oxli-bio / oxli

k-mers and the like
BSD 3-Clause "New" or "Revised" License
15 stars 0 forks source link

implement a `kmers()` method #74

Open dr-joe-wirth opened 4 weeks ago

dr-joe-wirth commented 4 weeks ago

Hello,

I use khmer in primerForge to retrieve kmers that appear exactly once in the genome My current workflow is this:

from khmer import Countgraph

def getKmersThatAppearOnce(seq:str, k:int) -> set[str]:
    """gets all the kmers that appears exactly once in a sequence

    Args:
        seq (str): the sequence to get kmers for
        k (int): the kmer length

    Returns:
        set[str]: a set of all the kmers that appear exactly once in the sequence
    """
    # create a countgraph
    kh = Countgraph(k, 1e7, 1)

    # consume the sequence so that kmers can be counted
    kh.consume(seq)

    # extract the kmers that appear once on each strand
    return {x for x in kh.get_kmers(seq) if kh.get(x) == 1}

I am very interested in replacing khmer in my program with oxli, but currently oxli does not have a get_kmers method.

I'd really like if you could implement this. Or at least if you could add how to get all the kmers in a sequence to the documentation if this is not something you wish to implement in oxli.

Thanks so much for building this!

edit: I simplified the code to more clearly demonstrate my use case

Adamtaranto commented 4 weeks ago

Hi Joe,

A kmer-tracking feature is in the works in PR #70 that will store the canonical k-mers corresponding to hashes.

I've got the functionality down but it is in need of optimisation (currently much slower than just counting with hashes) - so khmer may be faster atm.

Usage would be something like:

from oxli import KmerCountTable

def getKmersThatAppearOnce(seq:str, k:int) -> set[str]:
    """gets all the canonical k-mers that appear exactly once in a sequence

    Args:
        seq (str): the sequence to get kmers for
        k (int): the kmer length

    Returns:
        set[str]: a set of all canonical k-mers that appear exactly once in the sequence
    """

    # Init new count table with k-mer tracking enabled
    kct = KmerCountTable(ksize=k, store_kmers=True)

    # Consume sequence string
    kct.consume(seq)

    # drop all records with count > 1
    kct.maxcut(1)

    # dump_kmers() will return a list of (kmer,count) tuples
    return { x for x, y in kct.dump_kmers(file=None)}

Would this do the job?

dr-joe-wirth commented 4 weeks ago

Oh fantastic! I'll try it out tomorrow. Looking forward to the optimized version.

Adamtaranto commented 4 weeks ago

Cool, let me know how the speed compares to khmer.

We still need to make consume() multithreaded as well. Probably only practical for bacterial genomes atm.

ctb commented 4 weeks ago

I hope to get a chance to work on oxli this sunday.

In the meantime, I would be surprised if oxli weren't faster than khmer when compiled in release mode (e.g. a wheel). khmer was not particularly optimized IIRC. But we can and should measure, and of course multithread!

Memory usage is a different issue... I would expect oxli to be nowhere near as memory efficient as khmer was. For now, at least.

dr-joe-wirth commented 3 weeks ago

I tried to use it using the pip install instructions for developmers; I cloned the main branch.

There is no method called dump_kmers so I couldn't compare. Let me know once this is implemented and I will happily do a quick benchmark.

Update:

I have also tried using the integrate_kmer_tracking branch and received the following error when calling the dump_kmers method:

ValueError: K-mer storage is disabled. No hash:kmer map is available.
Adamtaranto commented 3 weeks ago

I think you might have left out the store_kmers option when creating the KmerCountTable.

First, install oxli from the "integrate_kmer_tracking" feature branch.

# Clone this repo
git clone git@github.com:oxli-bio/oxli.git && cd oxli

# Create a dev env
conda env create -f environment.yml -n oxli

# Activate the new env
conda activate oxli

# Fetch list of remote branches
git fetch

# View all available branches
git branch -v -a

# Switch to the branch with kmer tracking features
git switch integrate_kmer_tracking

# Local install with testing dependancies
pip install -e '.[test]'

Test out the dump_kmers() function. This will only work if you set store_kmers = True when creating the KmerCountTable

# Create new table with kmer tracking enabled
kct = KmerCountTable(ksize=4, store_kmers=True)
seq = "AAAAAT"

# Consume sequence string
kct.consume(seq)

# Inspect kmers and counts 
print(kct.dump_kmers(file=None, sortcounts=False, sortkeys=False))

# drop all records with count > 1
kct.maxcut(1)

#Inspect remaining kmers
print(kct.dump_kmers(file=None, sortcounts=False, sortkeys=False))

Note: There was a bug in integrate_kmer_tracking branch that threw an error if you tried to call dump_kmers() on a KmerCountTable after removing records with maxcut(). Bug #75. Should be fixed now.

Error:

PanicException: called `Option::unwrap()` on a `None` value

Let me know if the above doesn't work and instead try something similar to your original solution:

kct = KmerCountTable(ksize=4, store_kmers=True)
seq = "AAAAAT"

# Consume sequence string
kct.consume(seq)

{ x for x,y, in kct.dump_kmers() if y == 1}
Adamtaranto commented 3 weeks ago

To the original request, do we want to add a function to report all the canonical kmers in the table? @ctb

An equivalent to hashes(), maybe kmers()?

Atm I haven't been editing the hash_to_kmer table to remove hash:kmer pairs when a key is dropped from the count table, so it is possible to have kmers in hash_to_kmer that have a count of 0 or have been deleted. I don't really want to update the drop functions to also edit the kmer map.

Using dump_kmers() as above gets around this issue by skipping any kmers that do not exist in the count table. Maybe this is good enough?

dr-joe-wirth commented 3 weeks ago

I was able to test the runtime for getting kmers. It did raise an error using maxcut I haven't reinstalled with the fixed version yet. Here is the code I used:

from khmer import Countgraph
from oxli import KmerCountTable

def khmer_test(seq:str, k:int) -> set[str]:
    """runs khmer

    Args:
        seq (str): sequence to get kmers
        k (int): the kmer length

    Returns:
        set[str]: a set of kmers that appear once
    """
    # create the count graph
    kh = Countgraph(k, 1e7, 1)

    # consume the sequences so kmers can be counted
    kh.consume(seq)

    # extract the kmers that appear once
    out = {x for x in kh.get_kmers(seq) if kh.get(x) == 1}

def oxli_test(seq:str, k:int) -> set[str]:
    """runs oxli in a similar manner to how i use khmer

    Args:
        seq (str): sequence to get kmers
        k (int): the kmer length

    Returns:
        set[str]: a set of kmers that appear once
    """
    # create the count table
    kt = KmerCountTable(k, store_kmers=True)

    # consume the sequence
    kt.consume(seq)

    # get the kmers that appear once
    out = {x for x,y in kt.dump_kmers(file=None) if y == 1}

I wrote another program (not included) to time runs on three different E. coli genomes using these two methods.

Here are the genomes I used:

genome name NCBI accession
i1.gbff GCF_001650295.1
i2.gbff GCF_008727175.1
i3.gbff GCF_002208865.2

Here are the results I obtained. Times in ss.ms format:

genome program runtime
i1.gbff khmer 7.77
i1.gbff oxli 12.61
i2.gbff khmer 7.79
i2.gbff oxli 11.66
i3.gbff khmer 7.79
i3.gbff oxli 11.18

Do you expect the maxcut way of doing things to be significantly faster? If so, I can repeat this with that implementation.

dr-joe-wirth commented 3 weeks ago

Something I noticed is that oxli finds different kmers than khmer. It appears that oxli considers only cannonical kmers while khmer finds all kmers.

I have deleted my previous comment because it was unnecessarily complicated. Here is a simple explanation of what I am seeing.

from Bio.Seq import Seq
from khmer import Countgraph
from oxli import KmerCountTable

# set values to work with
k = 20
fwd = 'ATTCCGGAACCTTGAGAGATTCACGG'
rev = str(Seq(fwd).reverse_complement())

# make graphs
fcg = Countgraph(k, 1e7, 1)
rcg = Countgraph(k, 1e7, 1)
kct = KmerCountTable(k, store_kmers=True)

# consume sequences; khmer requires both
fcg.consume(fwd)
rcg.consume(rev)
kct.consume(fwd)

# get the forward and reverse kmers using khmer
f_kmers = [x for x in fcg.get_kmers(fwd) if fcg.get(x) == 1]
r_kmers = [x for x in rcg.get_kmers(rev) if rcg.get(x) == 1]

# get the cannonical kmers using oxli
o_kmers = [x for x,y in kct.dump_kmers(file=None) if y == 1]

Here are the contents of the lists of kmers:

>>> o_kmers
['CGGAACCTTGAGAGATTCAC',
 'CGTGAATCTCTCAAGGTTCC',
 'CCGTGAATCTCTCAAGGTTC',
 'CCGGAACCTTGAGAGATTCA',
 'GAATCTCTCAAGGTTCCGGA',
 'ATCTCTCAAGGTTCCGGAAT',
 'AATCTCTCAAGGTTCCGGAA']

>>> f_kmers
['ATTCCGGAACCTTGAGAGAT',
 'TTCCGGAACCTTGAGAGATT',
 'TCCGGAACCTTGAGAGATTC',
 'CCGGAACCTTGAGAGATTCA',
 'CGGAACCTTGAGAGATTCAC',
 'GGAACCTTGAGAGATTCACG',
 'GAACCTTGAGAGATTCACGG']

>>> r_kmers
['CCGTGAATCTCTCAAGGTTC',
 'CGTGAATCTCTCAAGGTTCC',
 'GTGAATCTCTCAAGGTTCCG',
 'TGAATCTCTCAAGGTTCCGG',
 'GAATCTCTCAAGGTTCCGGA',
 'AATCTCTCAAGGTTCCGGAA',
 'ATCTCTCAAGGTTCCGGAAT']

Is it possible to get all kmers instead of just cannonical kmers? In my use case, it would be more useful to get all 14 kmers instead of just the 7 cannonical ones.

Adamtaranto commented 3 weeks ago

It might be a little faster to do maxcut with oxli if you you want to grab the latest version of that branch.

I think the extra time is in calculating the revcomp and then choosing the canonical kmer in oxli. This should be addressed in #70.

Only storing canonical kmers is pretty well baked into oxli, so I think the best path is to revcomp the kmers from dump_kmers() and add them to your final set.

dr-joe-wirth commented 3 weeks ago

Only storing canonical kmers is pretty well baked into oxli, so I think the best path is to revcomp the kmers from dump_kmers() and add them to your final set.

I see. If that is the case, then I don't think oxli will suit my purposes. I want kmers that appear exactly once on both strands.

The way I currently use khmer is like this:

# pseudocode
fwd = importSeq(fn)
rev = fwd.reverse_complement()

# use khmer to get kmers that appear once for each strand
fwd_kmers = getKmersThatAppearOnce(str(fwd))
rev_kmers = getKmersThatAppearOnce(str(rev)) 

# get all kmers that appear once across both strands
all_kmers = fwd_kmers.symmetric_difference(rev_kmers)

To do the same with oxli, I believe you are proposing something like this:

from Bio.Seq import Seq

# pseudocode
fwd = importSeq(fn)

# use oxli to get kmers that appear once
cannon_kmers = getKmersThatAppearOnce(str(fwd))

# get all kmers that appear once across both strands
all_kmers = cannon_kmers.union({str(Seq(x).reverse_complement()) for x in cannon_kmers}

when i apply these strategies to real sequence data (E. coli genome), i do not get equivalent sets. In my test, the number of kmers I got from the khmer method is 5,829,972. The number of kmers I got from the oxli method is 9,862,795. The oxli set contains all the kmers that the khmer set contains, but it also includes 4,032,823 additional kmers. Presumably these kmers appear more than once.

can you think of a way for me to resolve this issue? do you think i am approaching this the wrong way?

dr-joe-wirth commented 3 weeks ago

quick update as well.

i ran the two different oxli approaches on the same genbank files in my previous comment. oxli_1 is the same way as before. oxli_2 is the maxcut way. I also got kmers for both strands for both khmer and oxli approaches.

genome program time (ss.ms)
i1.gbff khmer 16.89
i1.gbff oxli_1 38.06
i1.gbff oxli_2 24.62
i2.gbff khmer 14.50
i2.gbff oxli_1 27.50
i2.gbff oxli_2 28.95
i3.gbff khmer 13.54
i3.gbff oxli_1 27.97
i3.gbff oxli_2 28.46
Adamtaranto commented 2 weeks ago

Hi Joe, thanks for the update.

Re your example code, below. When you use oxli to get cannon_kmers you have a set of all the kmers that genuinely occur once in the sequence (that is, considering both strands).

When you reverse complement the cannonical kmer set and then take the union, you will add in the rc for all the non-palindromic kmers. Palindromes will necessarily only occur once in a set.

from Bio.Seq import Seq

# pseudocode
fwd = importSeq(fn)

# use oxli to get kmers that appear once
cannon_kmers = getKmersThatAppearOnce(str(fwd))

# get all kmers that appear once across both strands
all_kmers = cannon_kmers.union({str(Seq(x).reverse_complement()) for x in cannon_kmers}

This should only really end up giving you fewer kmers than the khmer method, as khmer is also allowing kmers that occur once on the fwd and once on the reverse.

Do you see the same final count with both the maxcut() and the manual filter methods?

Could you provide a minimal sequence that replicates this issue? I will try to figure out where the difference is coming from.

dr-joe-wirth commented 2 weeks ago

yes i see the same using maxcut(). the results sets produced are equivalent with both methods.

something weird is definitely going on. get methods are not equivalent between oxli and khmer

from oxli import KmerCountTable
from khmer import Countgraph
from Bio.Seq import Seq

# constant
K = 20

# pseudo code
fwd = importSeq(fn)

# get oxli kmers
kct = KmerCountTable(K, store_kmers=True)
kct.consume(fwd)
kct.maxcut(1)

oxli_kmers = {x for x,y in kct.dump_kmers(file=None)}

# get khmer countgraph
fcg = Countgraph(K, 1e7, 1)
fcg.consume(fwd)

# save sets of kmers for each count
counts = dict()
for kmer in oxli_kmers:
    # get the count of the kmer in the forward strand using the `khmer` countgraph
    c = fcg.get(kmer)

    # save the kmer for this count; create an empty setif need be
    counts[c] = counts.get(c, set())
    counts[c].add(kmer)

# print
for k in sorted(counts.keys()):
    print(f'{k}:\t{len(counts[k])}')

When I printed the results, I had 37 different values for counts:

1:      2914995
2:      1437662
3:      423026
4:      104073
5:      26848
6:      8889
7:      4166
8:      4949
9:      2852
10:     1203
11:     733
12:     767
13:     353
14:     130
15:     53
16:     9
17:     19
18:     113
19:     103
20:     73
21:     46
22:     60
23:     38
24:     105
25:     63
26:     40
27:     21
28:     5
29:     2
30:     2
31:     1
33:     2
48:     1
50:     1
55:     1
60:     1
65:     3

i do not yet have a minimal dataset for you to evaluate.

dr-joe-wirth commented 2 weeks ago

here is a sequence and code that causes some sort of issue.

from Bio.Seq import Seq
from khmer import Countgraph
from oxli import KmerCountTable

# constants; 10,000bp sequences
FWD = 'GTATACAGATCGTGCGATCTACTGTGGATAACTCTGTCAGGAAGCTTGGATCAACCGGTAGTTATCCAAAGAACAACCGTTGTTCAGTTTTTGAGTTGTGTATAACCCCTCATTCTGATCCCAGCTTATACGGTCCAGGATCACCGATCATTCACAGTTAATGATCCTTTCCAGGTTGTTGATCTTAAAAGCCGGATCCTTGTTATCCACAGGGCAGTGCGATCCTAATAAGAGATAACAATAGAACAGATCTCTAAATAAATAGATCTTCTTTTTAATACCCAGGATCCCAGGTCTTTCTCAAGCCGACAAAGTTGAGTAGAATCCACGGCCCGGGCTTCAATCCATTTTCATACCGCGTTATGCGAGGCAATCACCATGTTTTATCCGGATCCTTTTGACGTCATCATCATTGGCGGGGGTCATGCAGGCACCGAGGCCGCGATGGCTGCGGCGCGTATGGGTCAACAGACTCTGCTTTTGACACACAATATCGACACTCTGGGGCAGATGAGCTGCAACCCGGCGATCGGCGGTATTGGGAAGGGACATCTGGTAAAAGAAGTGGATGCACTCGGCGGTCTGATGGCGAAAGCGATCGATCAGGCGGGTATCCAGTTTAGGATACTAAACGCAAGCAAGGGACCGGCAGTTCGCGCTACCCGAGCTCAGGCGGATCGTGTGCTCTACCGTCAGGCGGTACGTACGGCGCTGGAGAACCAACCGAACCTGATGATCTTCCAGCAGGCGGTTGAAGATCTTATTGTCGAAAACGATCGTGTGGTCGGTGCTGTTACCCAAATGGGACTGAAGTTCCGTGCCAAAGCCGTCGTGCTCACCGTTGGGACGTTCCTCGACGGTAAAATTCATATCGGTCTGGATAACTACAGCGGTGGCCGTGCTGGTGATCCGCCGTCCATTCCGCTTTCTCGCCGTTTGCGTGAACTTCCGCTGCGCGTTGGTCGTCTGAAAACCGGGACACCACCGCGTATTGATGCTCGAACCATCGACTTTAGCGTGCTTGCGCAACAGCATGGCGATAACCCAATGCCGGTATTCTCTTTTATGGGCAATGCGTCCCAGCATCCACAGCAGGTGCCGTGTTATATCACCCATACCAACGAGAAAACCCATGATGTGATCCGCAGTAACCTCGATCGTAGCCCAATGTACGCAGGGGTGATCGAAGGTGTCGGCCCACGCTACTGCCCGTCGATCGAAGACAAAGTCATGCGCTTCGCCGACAGAAATCAGCATCAGATCTTCCTTGAACCGGAAGGGCTGACTTCTAACGAAATTTATCCGAACGGTATCTCCACCAGCCTGCCGTTCGATGTGCAGATGCAAATCGTCCGCTCTATGCAGGGGATGGAAAACGCGAAGATCGTGCGTCCGGGTTATGCCATTGAGTATGACTTCTTCGATCCTCGCGACCTGAAACCGACGCTGGAGAGCAAGTTTATCCAGGGGCTGTTCTTTGCTGGTCAGATTAACGGCACTACCGGTTACGAAGAAGCCGCTGCGCAAGGTTTGCTGGCCGGTCTTAACGCTGCCCGTCTGTCTGATGACAAAGAAGGTTGGGCTCCGGCGCGTTCTCAGGCCTATCTCGGCGTACTGGTTGATGACCTGTGCACTTTAGGAACCAAAGAACCGTATCGTATGTTTACCTCGCGCGCAGAATATCGTCTGATGCTGCGCGAAGATAATGCGGATCTGCGTTTGACTGAAATCGGTCGTGAACTGGGCCTGGTGGATGACGAACGTTGGGCGCGCTTTAACGAGAAACTTGAGAATATCGAGCGTGAGCGTCAGCGTCTGAAATCGACCTGGGTAACCCCGTCGGCGGAAGCTGCAGCCGAAGTGAATGCTCACCTGACTGCGCCGCTTTCCCGTGAAGCCAGTGGTGAAGATCTGCTGCGTCGTCCGGAAATGACTTATGAAAAATTAACCACGCTGACGCCGTTTGCCCCTGCGTTGACAGACGAACAGGCGGCGGAACAGGTTGAGATTCAGGTTAAATACGAAGGTTATATCGCGCGCCAGCAAGATGAGATCGAAAAGCAGCTGCGTAACGAGAACACCTTGCTACCCGCGACACTGGATTACCGCCAGGTATCCGGTCTTTCTAACGAAGTGATCGCCAAACTTAACGATCACAAACCGGCCTCTATCGGCCAGGCTTCACGTATTTCTGGCGTCACGCCTGCGGCCATCTCCATTCTGCTGGTGTGGCTGAAAAAACAGGGTATGCTGCGTCGTAGCGCATAACGCATTAAAAATGCCTGGTAAGCCCCCGCTTACCAGGCAACGCATCAAGAACAGGTAATCACCGTGCTCAACAAACTCTCCTTACTGCTGAAAGACGCAGGTATTTCGCTTACCGATCACCAGAAAAACCAGCTTATTGCCTACGTGAATATGCTGCATAAATGGAACAAAGCGTACAACCTGACTTCGGTCCGCGATCCTAATGAGATGCTGGTACGCCATATTCTCGATAGCATTGTGGTGGCACCGTATCTGCAAGGTGAACGGTTTATCGATGTCGGCACCGGTCCTGGACTGCCTGGCATTCCACTCTCTATCGTGCGTCCTGAAGCCCATTTCACTCTGTTGGATAGCCTTGGTAAACGCGTGCGTTTCCTTCGTCAGGTGCAACATGAGCTTAAACTGGAGAATATTGAGCCAGTACAGAGCAGGGTAGAAGAGTTTCCTTCAGAGCCGCCATTTGATGGCGTAATTAGCCGCGCTTTTGCCTCTCTGAACGATATGGTGAGCTGGTGCCACCATCTTCCTGGTGAGCAAGGCCGTTTCTACGCGCTGAAAGGGCAAATGCCGGAAGATGAAATCGCTTTGTTGCCCGAAGAATATCAGGTCGAATCAGTGGTTAAACTTCAGGTTCCAGCCCTGGATGGCGAACGTCATCTGGTGGTGATTAAAGCAAATAAAATTTAATTTTTATCAAAAAAATCATAAAAAATTGACCGGTTAGACTGTTAACAACAACCAGGTTTTCTACTGATATAACTGGTTACATTTAACGCCACGTTCACTCTTTTGCATCAACAAGATAACGTGGCTTTTTTTGGTAAGCAGAAAATAAGTCATTAGTGAAAATATCAGTCTGCTAAAAATCGGCGCTAAGAACCATCATTGGCTGTTAAAACATTATTAAAAATGTCAATGGGTGGTTTTTGTTGTGTAAATGTCATTTATTAAAACAGTATCTGTTTTTAGACTGAAATATCATAAACTTGCAAAGGCATCATTTGCCAAGTAAATAAATATGCTGTGCGCGAACATGCGCAATATGTGATCTGAAGCACGCTTTATCACCAGTGTTTACGCGTTATTTACAGTTTTTCATGATCGAACAGGGTTAGTAGAAAAGTCGCAATTGTATGCACTGGAAAAATATTTAAACATTTATTCACCTTTTGGCTACTTATTGTTTGAAATCACGGGGGCGCACCGTATAATTTGACCGCTTTTTGATGCTTGACTCTAAGCCTTAAAGAAAGTTTTATACGACACGCGGCATACCTCGAAGGGAGCAGGAGTGAAAAACGTGATGTCTGTGTCGCTCGTGAGTCGAAACGTTGCTCGGAAGCTTCTGCTCGTTCAGTTACTGGTGGTGATAGCAAGTGGATTGCTGTTCAGCCTCAAAGACCCCTTCTGGGGCGTCTCTGCAATAAGCGGGGGCCTGGCAGTCTTTCTGCCTAACGTTTTGTTTATGATATTTGCCTGGCGTCACCAGGCGCATACACCAGCGAAAGGCCGGGTGGCCTGGACATTCGCATTTGGTGAAGCTTTCAAAGTTCTGGCGATGTTGGTGTTACTGGTGGTGGCGTTGGCGGTTTTAAAGGCGGTATTCTTGCCGCTGATCGTTACGTGGGTTTTGGTGCTGGTGGTTCAGATACTGGCACCGGCTGTAATTAACAACAAAGGGTAAAAGGCATCATGGCTTCAGAAAATATGACGCCGCAGGATTACATAGGACACCACCTGAATAACCTTCAGCTGGACCTGCGTACATTCTCGCTGGTGGATCCACAAAACCCCCCAGCCACCTTCTGGACAATCAATATTGACTCCATGTTCTTCTCGGTGGTGCTGGGTCTGTTGTTCCTGGTTTTATTCCGTAGCGTAGCCAAAAAGGCGACCAGCGGTGTGCCAGGTAAGTTTCAGACCGCGATTGAGCTGGTGATCGGCTTTGTTAATGGTAGCGTGAAAGACATGTACCATGGCAAAAGCAAGCTGATTGCTCCGCTGGCCCTGACGATCTTCGTCTGGGTATTCCTGATGAACCTGATGGATTTACTGCCTATCGACCTGCTGCCGTACATTGCTGAACATGTACTGGGTCTGCCTGCACTGCGTGTGGTTCCGTCTGCGGACGTGAACGTAACGCTGTCTATGGCACTGGGCGTATTTATCCTGATTCTGTTCTACAGCATCAAAATGAAAGGCATCGGCGGCTTCACGAAAGAGTTGACGCTGCAGCCGTTCAATCACTGGGCGTTCATTCCTGTCAACTTAATCCTTGAAGGGGTAAGCCTGCTGTCCAAACCAGTTTCACTCGGTTTGCGACTGTTCGGTAACATGTATGCCGGTGAGCTGATTTTCATTCTGATTGCTGGTCTGTTGCCGTGGTGGTCACAGTGGATCCTGAATGTGCCGTGGGCCATTTTCCACATCCTGATCATTACGCTGCAAGCCTTCATCTTCATGGTTCTGACGATCGTCTATCTGTCGATGGCGTCTGAAGAACATTAATTTACCAACACTACTACGTTTTAACTGAAACAAACTGGAGACTGTCATGGAAAACCTGAATATGGATCTGCTGTACATGGCTGCCGCTGTGATGATGGGTCTGGCGGCAATCGGTGCTGCGATCGGTATCGGCATCCTCGGGGGTAAATTCCTGGAAGGCGCAGCGCGTCAACCTGATCTGATTCCTCTGCTGCGTACTCAGTTCTTTATCGTTATGGGTCTGGTGGATGCTATCCCGATGATCGCTGTAGGTCTGGGTCTGTACGTGATGTTCGCTGTCGCGTAGTAAGCGTTGCTTTTATTTAAAGAGCAATATCAGAACGTTAACTAAATAGAGGCATTGTGCTGTGAATCTTAACGCAACAATCCTCGGCCAGGCCATCGCGTTTGTCCTGTTCGTTCTGTTCTGCATGAAGTACGTATGGCCGCCATTAATGGCAGCCATCGAAAAACGTCAAAAAGAAATTGCTGACGGCCTTGCTTCCGCAGAACGAGCACATAAGGACCTTGACCTTGCAAAGGCCAGCGCGACCGACCAGCTGAAAAAAGCGAAAGCGGAAGCCCAGGTAATCATCGAGCAGGCGAACAAACGCCGCTCGCAGATTCTGGACGAAGCGAAAGCTGAGGCAGAACAGGAACGTACTAAAATCGTGGCCCAGGCGCAGGCGGAAATTGAAGCCGAGCGTAAACGTGCCCGTGAAGAGCTGCGTAAGCAAGTTGCTATCCTGGCTGTTGCTGGCGCCGAGAAGATCATCGAACGTTCCGTGGATGAAGCTGCTAACAGCGACATCGTGGATAAACTTGTCGCTGAACTGTAAGGAGGGAGGGGCTGATGTCTGAATTTATTACGGTAGCTCGCCCCTACGCCAAAGCAGCTTTTGACTTTGCCGTCGAACACCAAAGTGTAGAACGCTGGCAGGACATGCTGGCGTTTGCCGCCGAGGTAACCAAAAACGAACAAATGGCAGAGCTTCTCTCTGGCGCGCTTGCGCCAGAAACGCTTGCCGAGTCGTTTATCGCAGTTTGTGGTGAGCAACTGGACGAAAACGGTCAGAACCTTATTCGGGTAATGGCTGAAAATGGTCGTCTTAACGCGCTCCCGGATGTTTTGGAGCAGTTTATTCACCTTCGTGCCGTGAGTGAGGCTACCGCTGAGGTAGACGTCATTTCCGCTGCCGCACTGAGTGAACAACAGCTCGCGAAAATTTCTGCTGCGATGGAAAAACGTCTGTCACGCAAAGTTAAGCTGAATTGCAAAATCGATAAGTCTGTAATGGCAGGCGTTATCATCCGAGCGGGTGATATGGTCATTGATGGCAGCGTACGCGGTCGTCTTGAGCGCCTTGCAGACGTCTTGCAGTCTTAAGGGGACTGGAGCATGCAACTGAATTCCACCGAAATCAGCGAACTGATCAAGCAGCGCATTGCTCAGTTCAATGTTGTGAGTGAAGCTCACAACGAAGGTACTATTGTTTCTGTAAGTGACGGTGTTATCCGCATTCACGGCCTGGCCGATTGTATGCAGGGTGAAATGATCTCCCTGCCGGGTAACCGTTACGCTATCGCACTGAACCTCGAGCGCGACTCTGTTGGTGCGGTTGTTATGGGTCCGTATGCTGACCTTGCCGAAGGCATGAAAGTTAAGTGTACTGGCCGTATCCTGGAAGTTCCGGTTGGCCGTGGCCTGCTGGGCCGTGTGGTTAACACTCTGGGTGCACCAATCGACGGTAAAGGTCCGCTGGATCACGACGGCTTCTCTGCTGTAGAAGCAATCGCTCCGGGCGTTATCGAACGTCAGTCCGTAGATCAGCCGGTACAGACCGGTTATAAAGCCGTTGACTCCATGATCCCAATCGGTCGTGGTCAGCGTGAATTGATCATCGGTGACCGTCAGACCGGTAAAACCGCACTGGCTATCGATGCCATCATCAACCAGCGCGATTCCGGTATCAAATGTATCTATGTCGCTATCGGCCAGAAAGCGTCCACCATTTCTAACGTGGTACGTAAACTGGAAGAGCACGGCGCACTGGCTAACACCATCGTTGTGGTAGCAACCGCGTCTGAATCCGCTGCACTGCAATACCTGGCACCGTATGCCGGTTGCGCAATGGGCGAATACTTCCGTGACCGCGGTGAAGATGCGCTGATCATTTACGATGATCTGTCTAAACAGGCTGTTGCTTACCGTCAGATCTCGCTGCTGCTCCGTCGTCCGCCAGGACGTGAAGCATTCCCGGGCGACGTATTCTACCTCCACTCTCGTCTGCTGGAGCGTGCTGCGCGTGTTAACGCCGAATACGTTGAAGCTTTCACCAAAGGTGAAGTGAAAGGGAAAACCGGTTCACTGACCGCGCTGCCGATTATCGAAACTCAAGCGGGTGACGTTTCTGCGTTCGTTCCGACCAACGTAATCTCCATTACCGATGGTCAGATCTTCCTGGAAACCAACCTGTTCAACGCCGGTATTCGTCCTGCGGTTAACCCGGGTATTTCCGTATCCCGTGTTGGTGGTGCAGCACAGACCAAGATCATGAAAAAACTGTCCGGTGGTATCCGTACCGCTCTGGCACAGTATCGTGAACTGGCAGCGTTCTCTCAGTTTGCATCCGACCTTGACGATGCAACACGTAAGCAGCTTGACCACGGTCAAAAAGTGACCGAACTGCTGAAACAGAAACAGTATGCGCCGATGTCCGTTGCGCAGCAGTCTCTGGTTCTGTTCGCAGCAGAACGTGGTTACCTGGCGGATGTTGAACTGTCGAAAATCGGCAGCTTCGAAGCCGCTCTGCTGGCTTACGTCGACCGTGATCACGCTCCGTTGATGCAAGAGATCAACCAGACCGGTGGCTACAACGACGAAATCGAAGGCAAGCTGAAAGGCATCCTCGATTCCTTCAAAGCAACCCAATCCTGGTAACGTCTGGCGGCTTGCCTTAGGGCAGGCCGCAAGGCATTGAGGAGAAGCTCATGGCCGGCGCAAAAGAGATACGTAGTAAGATCGCAAGCGTCCAGAACACGCAAAAGATCACTAAAGCGATGGAGATGGTCGCCGCTTCCAAAATGCGTAAATCGCAGGATCGCATGGCGGCCAGCCGTCCTTATGCAGAAACCATGCGCAAAGTGATTGGTCACCTTGCACACGGTAATCTGGAATATAAGCACCCTTACCTGGAAGACCGCGACGTTAAACGCGTGGGCTACCTGGTGGTGTCGACCGACCGTGGTTTGTGCGGTGGTTTGAACATTAACCTGTTCAAAAAACTGCTGGCGGAAATGAAGACCTGGACCGACAAAGGCGTTCAATGCGACCTCGCAATGATCGGCTCGAAAGGCGTGTCGTTCTTCAACTCCGTGGGCGGCAATGTTGTTGCCCAGGTCACCGGCATGGGGGATAACCCTTCCCTGTCCGAACTGATCGGTCCGGTAAAAGTGATGTTGCAGGCCTACGACGAAGGCCGTCTGGACAAGCTTTACATTGTCAGCAACAAATTTATTAACACCATGTCTCAGGTTCCGACCATCAGCCAGCTGCTGCCGTTACCGGCATCAGATGATGATGATCTGAAACATAAATCCTGGGATTACCTGTACGAACCCGATCCGAAGGCGTTGCTGGATACCCTGCTGCGTCGTTATGTCGAATCTCAGGTTTATCAGGGCGTGGTTGAAAACCTGGCCAGCGAGCAGGCCGCCCGTATGGTGGCGATGAAAGCCGCGACCGACAATGGCGGCAGCCTGATTAAAGAGCTGCAGTTGGTATACAACAAAGCTCGTCAGGCCAGCATTACTCAGGAACTCACCGAGATCGTCTCGGGGGCCGCCGCGGTTTAAACAGGTTATTTCGTAGAGGATTTAAGATGGCTACTGGAAAGATTGTCCAGGTAATCGGCGCCGTAGTTGACGTCGAATTCCCTCAGGATGCCGTACCGCGCGTGTACGATGCTCTTGAGGTGCAAAATGGTAATGAGCGTCTGGTGCTGGAAGTTCAGCAGCAGCTCGGCGGCGGTATCGTGCGTACCATCGCAATGGGTTCCTCCGACGGTCTGCGTCGCGGTCTGGATGTAAAAGACCTCGAACACCCGATCGAAGTCCCGGTAGGTAAAGCGACTCTGGGCCGTATCATGAACGTACTGGGTGAACCGGTCGACATGAAAGGCGAGATCGGTGAAGAAGAGCGTTGGGCGATTCACCGCGCAGCACCTTCCTACGAAGAGCTGTCAAACTCTCAGGAACTGCTGGAAACCGGTATCAAAGTTATCGACCTGATGTGTCCGTTCGCTAAGGGCGGTAAAGTTGGTCTGTTCGGTGGTGCGGGTGTAGGTAAAACCGTAAACATGATGGAGCTCATTCGTAACATCGCGATCGAGCACTCCGGTTACTCTGTGTTTGCGGGCGTAGGTGAACGTACTCGTGAGGGGAACGACTTCTACCACGAAATGACCGACTCCAACGTTATCGATAAAGTATCCCTGGTGTATGGCCAGATGAACGAGCCGCCGGGAAACCGTCTGCGCGTAGCTCTGACCGGTCTGACCATGGCTGAGAAATTCCGTGACGAAGGTCGTGACGTTCTGCTGTTCGTTGACAACATCTATCGTTACACCCTGGCCGGTACGGAAGTATCCGCACTGCTGGGCCGTATGCCTTCAGCGGTAGGTTATCAGCCGACCCTGGCGGAAGAGATGGGCGTTCTGCAGGAACGTATCACCTCCACCAAAACCGGTTCTATCACCTCCGTACAGGCAGTATACGTACCTGCGGATGACTTGACTGACCCGTCTCCGGCAACCACCTTTGCGCACCTTGACGCAACCGTGGTACTGAGCCGTCAGATCGCGTCTCTGGGTATCTACCCGGCCGTTGACCCGCTGGACTCCACCAGCCGTCAGCTGGACCCGCTGGTGGTTGGTCAGGAACACTACGACACCGCGCGTGGCGTTCAGTCCATCCTGCAACGTTATCAGGAACTGAAAGACATCATCGCCATCCTGGGTATGGATGAACTGTCTGAAGAAGACAAACTGGTGGTAGCGCGTGCTCGTAAGATCCAGCGCTTCCTGTCCCAGCCGTTCTTCGTGGCAGAAGTATTCACCGGTTCTCCGGGTAAATACGTCTCCCTGAAAGACACCATCCGTGGCTTTAAAGGCATCATGGAAGGCGAATACGATCACCTGCCGGAGCAGGCGTTCTACATGGTCGGTTCCATCGAAGAAGCTGTGGA'
REV = str(Seq(FWD).reverse_complement())
K = 20

# functions
def get_khmer_kmers(fwd:str, rev:str, k:int) -> set[str]:
    """gets the kmers that appear once in the genome

    Args:
        fwd (str): the forward sequence
        rev (str): the reverse sequence
        k (int): the kmer length

    Returns:
        set[str]: kmers that appear once in the genome
    """
    # make countgraphs
    fcg = Countgraph(k, 1e7, 1)
    rcg = Countgraph(k, 1e7, 1)

    # consume sequences
    fcg.consume(fwd)
    rcg.consume(rev)

    # get kmers that appear once on each strand; ignore chimeras made at junctions
    out = {x for x in fcg.get_kmers(fwd) if fcg.get(x) == 1}
    rks = {x for x in rcg.get_kmers(rev) if rcg.get(x) == 1}

    # only keep kmers that appear once across both strands
    out.symmetric_difference_update(rks)

    return out

def get_oxli_kmers(seq:str, k:int) -> set[str]:
    """gets the kmers that appears once in the genome

    Args:
        seq (str): the sequence
        k (int): the kmer length

    Returns:
        set[str]: kmers that appear once in the genome
    """
    # build count table
    kct = KmerCountTable(k, store_kmers=True)
    kct.consume(seq)

    # only keep kmers that appear once
    kct.maxcut(1)

    # get the kmers and add reverse complements to the set
    out = {x for x,y in kct.dump_kmers(file=None)}
    out.update({str(Seq(x).reverse_complement()) for x in out})

    return out

# test case
khmer_kmers = get_khmer_kmers(FWD, REV, K)
oxli_kmers = get_oxli_kmers(FWD, K)

Examining the sets reveals the following information:

>>> len(khmer_kmers)
19950
>>> len(oxli_kmers)
19962
>>> oxli_kmers.difference(khmer_kmers)
{'AACAACAGACCCAGCACCAC',
 'ATCGCAGCAGAAATTTTCGC',
 'CAACCCGGCGATCGGCGGTA',
 'CGGGCAGTAGCGTGGGCCGA',
 'GAGTACGCAGCAGAGGAATC',
 'GATTCCTCTGCTGCGTACTC',
 'GCGAAAATTTCTGCTGCGAT',
 'GTGGTGCTGGGTCTGTTGTT',
 'TAACGCGGTATGAAAATGGA',
 'TACCGCCGATCGCCGGGTTG',
 'TCCATTTTCATACCGCGTTA',
 'TCGGCCCACGCTACTGCCCG'}

I hope this helps.

Notes:

Adamtaranto commented 2 weeks ago

@dr-joe-wirth thanks for preparing this example.

@ctb I'm not familiar with the workings of khmer, can you take a look at this? Happy to jump in and chase down bugs if you have a lead.

ctb commented 2 weeks ago

The first thing that comes to mind is that khmer has false positives - the countgraph will in some cases overcount things. @dr-joe-wirth if you change the 1e7 to 1e8, you should see fewer overcounts.

Reading through the rest of the comments now ;)

ctb commented 2 weeks ago

More comments: regarding choosing the canonical k-mer, this is implemented on the oxli side, not in the hashing mechanism we're taking from sourmash. So I think it's not that difficult to expose all k-mers, not just the canonical k-mers.

The speed is a different problem. I'm surprised (and unhappy) that we're slower than khmer!

ctb commented 2 weeks ago

so I may start there... with the speed...

dr-joe-wirth commented 2 weeks ago

so I may start there... with the speed...

wonderful. i am really excited for this package.

Adamtaranto commented 6 days ago

@ctb I can work on supporting an "all-kmers" mode if you want to open a new issue.