Open dr-joe-wirth opened 4 weeks ago
Hi Joe,
A kmer-tracking feature is in the works in PR #70 that will store the canonical k-mers corresponding to hashes.
I've got the functionality down but it is in need of optimisation (currently much slower than just counting with hashes) - so khmer may be faster atm.
Usage would be something like:
from oxli import KmerCountTable
def getKmersThatAppearOnce(seq:str, k:int) -> set[str]:
"""gets all the canonical k-mers that appear exactly once in a sequence
Args:
seq (str): the sequence to get kmers for
k (int): the kmer length
Returns:
set[str]: a set of all canonical k-mers that appear exactly once in the sequence
"""
# Init new count table with k-mer tracking enabled
kct = KmerCountTable(ksize=k, store_kmers=True)
# Consume sequence string
kct.consume(seq)
# drop all records with count > 1
kct.maxcut(1)
# dump_kmers() will return a list of (kmer,count) tuples
return { x for x, y in kct.dump_kmers(file=None)}
Would this do the job?
Oh fantastic! I'll try it out tomorrow. Looking forward to the optimized version.
Cool, let me know how the speed compares to khmer.
We still need to make consume() multithreaded as well. Probably only practical for bacterial genomes atm.
I hope to get a chance to work on oxli this sunday.
In the meantime, I would be surprised if oxli weren't faster than khmer when compiled in release mode (e.g. a wheel). khmer was not particularly optimized IIRC. But we can and should measure, and of course multithread!
Memory usage is a different issue... I would expect oxli to be nowhere near as memory efficient as khmer was. For now, at least.
I tried to use it using the pip
install instructions for developmers; I cloned the main branch.
There is no method called dump_kmers
so I couldn't compare. Let me know once this is implemented and I will happily do a quick benchmark.
Update:
I have also tried using the integrate_kmer_tracking
branch and received the following error when calling the dump_kmers
method:
ValueError: K-mer storage is disabled. No hash:kmer map is available.
I think you might have left out the store_kmers
option when creating the KmerCountTable
.
First, install oxli from the "integrate_kmer_tracking" feature branch.
# Clone this repo
git clone git@github.com:oxli-bio/oxli.git && cd oxli
# Create a dev env
conda env create -f environment.yml -n oxli
# Activate the new env
conda activate oxli
# Fetch list of remote branches
git fetch
# View all available branches
git branch -v -a
# Switch to the branch with kmer tracking features
git switch integrate_kmer_tracking
# Local install with testing dependancies
pip install -e '.[test]'
Test out the dump_kmers()
function. This will only work if you set store_kmers = True
when creating the KmerCountTable
# Create new table with kmer tracking enabled
kct = KmerCountTable(ksize=4, store_kmers=True)
seq = "AAAAAT"
# Consume sequence string
kct.consume(seq)
# Inspect kmers and counts
print(kct.dump_kmers(file=None, sortcounts=False, sortkeys=False))
# drop all records with count > 1
kct.maxcut(1)
#Inspect remaining kmers
print(kct.dump_kmers(file=None, sortcounts=False, sortkeys=False))
Note: There was a bug in integrate_kmer_tracking
branch that threw an error if you tried to call dump_kmers()
on a KmerCountTable
after removing records with maxcut()
. Bug #75. Should be fixed now.
Error:
PanicException: called `Option::unwrap()` on a `None` value
Let me know if the above doesn't work and instead try something similar to your original solution:
kct = KmerCountTable(ksize=4, store_kmers=True)
seq = "AAAAAT"
# Consume sequence string
kct.consume(seq)
{ x for x,y, in kct.dump_kmers() if y == 1}
To the original request, do we want to add a function to report all the canonical kmers in the table? @ctb
An equivalent to hashes()
, maybe kmers()
?
Atm I haven't been editing the hash_to_kmer
table to remove hash:kmer pairs when a key is dropped from the count table, so it is possible to have kmers in hash_to_kmer
that have a count of 0 or have been deleted. I don't really want to update the drop functions to also edit the kmer map.
Using dump_kmers()
as above gets around this issue by skipping any kmers that do not exist in the count table. Maybe this is good enough?
I was able to test the runtime for getting kmers. It did raise an error using maxcut
I haven't reinstalled with the fixed version yet. Here is the code I used:
from khmer import Countgraph
from oxli import KmerCountTable
def khmer_test(seq:str, k:int) -> set[str]:
"""runs khmer
Args:
seq (str): sequence to get kmers
k (int): the kmer length
Returns:
set[str]: a set of kmers that appear once
"""
# create the count graph
kh = Countgraph(k, 1e7, 1)
# consume the sequences so kmers can be counted
kh.consume(seq)
# extract the kmers that appear once
out = {x for x in kh.get_kmers(seq) if kh.get(x) == 1}
def oxli_test(seq:str, k:int) -> set[str]:
"""runs oxli in a similar manner to how i use khmer
Args:
seq (str): sequence to get kmers
k (int): the kmer length
Returns:
set[str]: a set of kmers that appear once
"""
# create the count table
kt = KmerCountTable(k, store_kmers=True)
# consume the sequence
kt.consume(seq)
# get the kmers that appear once
out = {x for x,y in kt.dump_kmers(file=None) if y == 1}
I wrote another program (not included) to time runs on three different E. coli genomes using these two methods.
Here are the genomes I used:
genome name | NCBI accession |
---|---|
i1.gbff | GCF_001650295.1 |
i2.gbff | GCF_008727175.1 |
i3.gbff | GCF_002208865.2 |
Here are the results I obtained. Times in ss.ms
format:
genome | program | runtime |
---|---|---|
i1.gbff | khmer | 7.77 |
i1.gbff | oxli | 12.61 |
i2.gbff | khmer | 7.79 |
i2.gbff | oxli | 11.66 |
i3.gbff | khmer | 7.79 |
i3.gbff | oxli | 11.18 |
Do you expect the maxcut
way of doing things to be significantly faster? If so, I can repeat this with that implementation.
Something I noticed is that oxli
finds different kmers than khmer
. It appears that oxli
considers only cannonical kmers while khmer
finds all kmers.
I have deleted my previous comment because it was unnecessarily complicated. Here is a simple explanation of what I am seeing.
from Bio.Seq import Seq
from khmer import Countgraph
from oxli import KmerCountTable
# set values to work with
k = 20
fwd = 'ATTCCGGAACCTTGAGAGATTCACGG'
rev = str(Seq(fwd).reverse_complement())
# make graphs
fcg = Countgraph(k, 1e7, 1)
rcg = Countgraph(k, 1e7, 1)
kct = KmerCountTable(k, store_kmers=True)
# consume sequences; khmer requires both
fcg.consume(fwd)
rcg.consume(rev)
kct.consume(fwd)
# get the forward and reverse kmers using khmer
f_kmers = [x for x in fcg.get_kmers(fwd) if fcg.get(x) == 1]
r_kmers = [x for x in rcg.get_kmers(rev) if rcg.get(x) == 1]
# get the cannonical kmers using oxli
o_kmers = [x for x,y in kct.dump_kmers(file=None) if y == 1]
Here are the contents of the lists of kmers:
>>> o_kmers
['CGGAACCTTGAGAGATTCAC',
'CGTGAATCTCTCAAGGTTCC',
'CCGTGAATCTCTCAAGGTTC',
'CCGGAACCTTGAGAGATTCA',
'GAATCTCTCAAGGTTCCGGA',
'ATCTCTCAAGGTTCCGGAAT',
'AATCTCTCAAGGTTCCGGAA']
>>> f_kmers
['ATTCCGGAACCTTGAGAGAT',
'TTCCGGAACCTTGAGAGATT',
'TCCGGAACCTTGAGAGATTC',
'CCGGAACCTTGAGAGATTCA',
'CGGAACCTTGAGAGATTCAC',
'GGAACCTTGAGAGATTCACG',
'GAACCTTGAGAGATTCACGG']
>>> r_kmers
['CCGTGAATCTCTCAAGGTTC',
'CGTGAATCTCTCAAGGTTCC',
'GTGAATCTCTCAAGGTTCCG',
'TGAATCTCTCAAGGTTCCGG',
'GAATCTCTCAAGGTTCCGGA',
'AATCTCTCAAGGTTCCGGAA',
'ATCTCTCAAGGTTCCGGAAT']
Is it possible to get all kmers instead of just cannonical kmers? In my use case, it would be more useful to get all 14 kmers instead of just the 7 cannonical ones.
It might be a little faster to do maxcut with oxli if you you want to grab the latest version of that branch.
I think the extra time is in calculating the revcomp and then choosing the canonical kmer in oxli. This should be addressed in #70.
Only storing canonical kmers is pretty well baked into oxli, so I think the best path is to revcomp the kmers from dump_kmers()
and add them to your final set.
Only storing canonical kmers is pretty well baked into oxli, so I think the best path is to revcomp the kmers from dump_kmers() and add them to your final set.
I see. If that is the case, then I don't think oxli
will suit my purposes. I want kmers that appear exactly once on both strands.
The way I currently use khmer
is like this:
# pseudocode
fwd = importSeq(fn)
rev = fwd.reverse_complement()
# use khmer to get kmers that appear once for each strand
fwd_kmers = getKmersThatAppearOnce(str(fwd))
rev_kmers = getKmersThatAppearOnce(str(rev))
# get all kmers that appear once across both strands
all_kmers = fwd_kmers.symmetric_difference(rev_kmers)
To do the same with oxli
, I believe you are proposing something like this:
from Bio.Seq import Seq
# pseudocode
fwd = importSeq(fn)
# use oxli to get kmers that appear once
cannon_kmers = getKmersThatAppearOnce(str(fwd))
# get all kmers that appear once across both strands
all_kmers = cannon_kmers.union({str(Seq(x).reverse_complement()) for x in cannon_kmers}
when i apply these strategies to real sequence data (E. coli genome), i do not get equivalent sets. In my test, the number of kmers I got from the khmer
method is 5,829,972. The number of kmers I got from the oxli
method is 9,862,795. The oxli
set contains all the kmers that the khmer
set contains, but it also includes 4,032,823 additional kmers. Presumably these kmers appear more than once.
can you think of a way for me to resolve this issue? do you think i am approaching this the wrong way?
quick update as well.
i ran the two different oxli
approaches on the same genbank files in my previous comment. oxli_1
is the same way as before. oxli_2
is the maxcut
way. I also got kmers for both strands for both khmer
and oxli
approaches.
genome | program | time (ss.ms) |
---|---|---|
i1.gbff | khmer | 16.89 |
i1.gbff | oxli_1 | 38.06 |
i1.gbff | oxli_2 | 24.62 |
i2.gbff | khmer | 14.50 |
i2.gbff | oxli_1 | 27.50 |
i2.gbff | oxli_2 | 28.95 |
i3.gbff | khmer | 13.54 |
i3.gbff | oxli_1 | 27.97 |
i3.gbff | oxli_2 | 28.46 |
Hi Joe, thanks for the update.
Re your example code, below. When you use oxli to get cannon_kmers
you have a set of all the kmers that genuinely occur once in the sequence (that is, considering both strands).
When you reverse complement the cannonical kmer set and then take the union, you will add in the rc for all the non-palindromic kmers. Palindromes will necessarily only occur once in a set.
from Bio.Seq import Seq # pseudocode fwd = importSeq(fn) # use oxli to get kmers that appear once cannon_kmers = getKmersThatAppearOnce(str(fwd)) # get all kmers that appear once across both strands all_kmers = cannon_kmers.union({str(Seq(x).reverse_complement()) for x in cannon_kmers}
This should only really end up giving you fewer kmers than the khmer method, as khmer is also allowing kmers that occur once on the fwd and once on the reverse.
Do you see the same final count with both the maxcut()
and the manual filter methods?
Could you provide a minimal sequence that replicates this issue? I will try to figure out where the difference is coming from.
yes i see the same using maxcut()
. the results sets produced are equivalent with both methods.
something weird is definitely going on. get
methods are not equivalent between oxli
and khmer
from oxli import KmerCountTable
from khmer import Countgraph
from Bio.Seq import Seq
# constant
K = 20
# pseudo code
fwd = importSeq(fn)
# get oxli kmers
kct = KmerCountTable(K, store_kmers=True)
kct.consume(fwd)
kct.maxcut(1)
oxli_kmers = {x for x,y in kct.dump_kmers(file=None)}
# get khmer countgraph
fcg = Countgraph(K, 1e7, 1)
fcg.consume(fwd)
# save sets of kmers for each count
counts = dict()
for kmer in oxli_kmers:
# get the count of the kmer in the forward strand using the `khmer` countgraph
c = fcg.get(kmer)
# save the kmer for this count; create an empty setif need be
counts[c] = counts.get(c, set())
counts[c].add(kmer)
# print
for k in sorted(counts.keys()):
print(f'{k}:\t{len(counts[k])}')
When I printed the results, I had 37 different values for counts:
1: 2914995
2: 1437662
3: 423026
4: 104073
5: 26848
6: 8889
7: 4166
8: 4949
9: 2852
10: 1203
11: 733
12: 767
13: 353
14: 130
15: 53
16: 9
17: 19
18: 113
19: 103
20: 73
21: 46
22: 60
23: 38
24: 105
25: 63
26: 40
27: 21
28: 5
29: 2
30: 2
31: 1
33: 2
48: 1
50: 1
55: 1
60: 1
65: 3
i do not yet have a minimal dataset for you to evaluate.
here is a sequence and code that causes some sort of issue.
from Bio.Seq import Seq
from khmer import Countgraph
from oxli import KmerCountTable
# constants; 10,000bp sequences
FWD = 'GTATACAGATCGTGCGATCTACTGTGGATAACTCTGTCAGGAAGCTTGGATCAACCGGTAGTTATCCAAAGAACAACCGTTGTTCAGTTTTTGAGTTGTGTATAACCCCTCATTCTGATCCCAGCTTATACGGTCCAGGATCACCGATCATTCACAGTTAATGATCCTTTCCAGGTTGTTGATCTTAAAAGCCGGATCCTTGTTATCCACAGGGCAGTGCGATCCTAATAAGAGATAACAATAGAACAGATCTCTAAATAAATAGATCTTCTTTTTAATACCCAGGATCCCAGGTCTTTCTCAAGCCGACAAAGTTGAGTAGAATCCACGGCCCGGGCTTCAATCCATTTTCATACCGCGTTATGCGAGGCAATCACCATGTTTTATCCGGATCCTTTTGACGTCATCATCATTGGCGGGGGTCATGCAGGCACCGAGGCCGCGATGGCTGCGGCGCGTATGGGTCAACAGACTCTGCTTTTGACACACAATATCGACACTCTGGGGCAGATGAGCTGCAACCCGGCGATCGGCGGTATTGGGAAGGGACATCTGGTAAAAGAAGTGGATGCACTCGGCGGTCTGATGGCGAAAGCGATCGATCAGGCGGGTATCCAGTTTAGGATACTAAACGCAAGCAAGGGACCGGCAGTTCGCGCTACCCGAGCTCAGGCGGATCGTGTGCTCTACCGTCAGGCGGTACGTACGGCGCTGGAGAACCAACCGAACCTGATGATCTTCCAGCAGGCGGTTGAAGATCTTATTGTCGAAAACGATCGTGTGGTCGGTGCTGTTACCCAAATGGGACTGAAGTTCCGTGCCAAAGCCGTCGTGCTCACCGTTGGGACGTTCCTCGACGGTAAAATTCATATCGGTCTGGATAACTACAGCGGTGGCCGTGCTGGTGATCCGCCGTCCATTCCGCTTTCTCGCCGTTTGCGTGAACTTCCGCTGCGCGTTGGTCGTCTGAAAACCGGGACACCACCGCGTATTGATGCTCGAACCATCGACTTTAGCGTGCTTGCGCAACAGCATGGCGATAACCCAATGCCGGTATTCTCTTTTATGGGCAATGCGTCCCAGCATCCACAGCAGGTGCCGTGTTATATCACCCATACCAACGAGAAAACCCATGATGTGATCCGCAGTAACCTCGATCGTAGCCCAATGTACGCAGGGGTGATCGAAGGTGTCGGCCCACGCTACTGCCCGTCGATCGAAGACAAAGTCATGCGCTTCGCCGACAGAAATCAGCATCAGATCTTCCTTGAACCGGAAGGGCTGACTTCTAACGAAATTTATCCGAACGGTATCTCCACCAGCCTGCCGTTCGATGTGCAGATGCAAATCGTCCGCTCTATGCAGGGGATGGAAAACGCGAAGATCGTGCGTCCGGGTTATGCCATTGAGTATGACTTCTTCGATCCTCGCGACCTGAAACCGACGCTGGAGAGCAAGTTTATCCAGGGGCTGTTCTTTGCTGGTCAGATTAACGGCACTACCGGTTACGAAGAAGCCGCTGCGCAAGGTTTGCTGGCCGGTCTTAACGCTGCCCGTCTGTCTGATGACAAAGAAGGTTGGGCTCCGGCGCGTTCTCAGGCCTATCTCGGCGTACTGGTTGATGACCTGTGCACTTTAGGAACCAAAGAACCGTATCGTATGTTTACCTCGCGCGCAGAATATCGTCTGATGCTGCGCGAAGATAATGCGGATCTGCGTTTGACTGAAATCGGTCGTGAACTGGGCCTGGTGGATGACGAACGTTGGGCGCGCTTTAACGAGAAACTTGAGAATATCGAGCGTGAGCGTCAGCGTCTGAAATCGACCTGGGTAACCCCGTCGGCGGAAGCTGCAGCCGAAGTGAATGCTCACCTGACTGCGCCGCTTTCCCGTGAAGCCAGTGGTGAAGATCTGCTGCGTCGTCCGGAAATGACTTATGAAAAATTAACCACGCTGACGCCGTTTGCCCCTGCGTTGACAGACGAACAGGCGGCGGAACAGGTTGAGATTCAGGTTAAATACGAAGGTTATATCGCGCGCCAGCAAGATGAGATCGAAAAGCAGCTGCGTAACGAGAACACCTTGCTACCCGCGACACTGGATTACCGCCAGGTATCCGGTCTTTCTAACGAAGTGATCGCCAAACTTAACGATCACAAACCGGCCTCTATCGGCCAGGCTTCACGTATTTCTGGCGTCACGCCTGCGGCCATCTCCATTCTGCTGGTGTGGCTGAAAAAACAGGGTATGCTGCGTCGTAGCGCATAACGCATTAAAAATGCCTGGTAAGCCCCCGCTTACCAGGCAACGCATCAAGAACAGGTAATCACCGTGCTCAACAAACTCTCCTTACTGCTGAAAGACGCAGGTATTTCGCTTACCGATCACCAGAAAAACCAGCTTATTGCCTACGTGAATATGCTGCATAAATGGAACAAAGCGTACAACCTGACTTCGGTCCGCGATCCTAATGAGATGCTGGTACGCCATATTCTCGATAGCATTGTGGTGGCACCGTATCTGCAAGGTGAACGGTTTATCGATGTCGGCACCGGTCCTGGACTGCCTGGCATTCCACTCTCTATCGTGCGTCCTGAAGCCCATTTCACTCTGTTGGATAGCCTTGGTAAACGCGTGCGTTTCCTTCGTCAGGTGCAACATGAGCTTAAACTGGAGAATATTGAGCCAGTACAGAGCAGGGTAGAAGAGTTTCCTTCAGAGCCGCCATTTGATGGCGTAATTAGCCGCGCTTTTGCCTCTCTGAACGATATGGTGAGCTGGTGCCACCATCTTCCTGGTGAGCAAGGCCGTTTCTACGCGCTGAAAGGGCAAATGCCGGAAGATGAAATCGCTTTGTTGCCCGAAGAATATCAGGTCGAATCAGTGGTTAAACTTCAGGTTCCAGCCCTGGATGGCGAACGTCATCTGGTGGTGATTAAAGCAAATAAAATTTAATTTTTATCAAAAAAATCATAAAAAATTGACCGGTTAGACTGTTAACAACAACCAGGTTTTCTACTGATATAACTGGTTACATTTAACGCCACGTTCACTCTTTTGCATCAACAAGATAACGTGGCTTTTTTTGGTAAGCAGAAAATAAGTCATTAGTGAAAATATCAGTCTGCTAAAAATCGGCGCTAAGAACCATCATTGGCTGTTAAAACATTATTAAAAATGTCAATGGGTGGTTTTTGTTGTGTAAATGTCATTTATTAAAACAGTATCTGTTTTTAGACTGAAATATCATAAACTTGCAAAGGCATCATTTGCCAAGTAAATAAATATGCTGTGCGCGAACATGCGCAATATGTGATCTGAAGCACGCTTTATCACCAGTGTTTACGCGTTATTTACAGTTTTTCATGATCGAACAGGGTTAGTAGAAAAGTCGCAATTGTATGCACTGGAAAAATATTTAAACATTTATTCACCTTTTGGCTACTTATTGTTTGAAATCACGGGGGCGCACCGTATAATTTGACCGCTTTTTGATGCTTGACTCTAAGCCTTAAAGAAAGTTTTATACGACACGCGGCATACCTCGAAGGGAGCAGGAGTGAAAAACGTGATGTCTGTGTCGCTCGTGAGTCGAAACGTTGCTCGGAAGCTTCTGCTCGTTCAGTTACTGGTGGTGATAGCAAGTGGATTGCTGTTCAGCCTCAAAGACCCCTTCTGGGGCGTCTCTGCAATAAGCGGGGGCCTGGCAGTCTTTCTGCCTAACGTTTTGTTTATGATATTTGCCTGGCGTCACCAGGCGCATACACCAGCGAAAGGCCGGGTGGCCTGGACATTCGCATTTGGTGAAGCTTTCAAAGTTCTGGCGATGTTGGTGTTACTGGTGGTGGCGTTGGCGGTTTTAAAGGCGGTATTCTTGCCGCTGATCGTTACGTGGGTTTTGGTGCTGGTGGTTCAGATACTGGCACCGGCTGTAATTAACAACAAAGGGTAAAAGGCATCATGGCTTCAGAAAATATGACGCCGCAGGATTACATAGGACACCACCTGAATAACCTTCAGCTGGACCTGCGTACATTCTCGCTGGTGGATCCACAAAACCCCCCAGCCACCTTCTGGACAATCAATATTGACTCCATGTTCTTCTCGGTGGTGCTGGGTCTGTTGTTCCTGGTTTTATTCCGTAGCGTAGCCAAAAAGGCGACCAGCGGTGTGCCAGGTAAGTTTCAGACCGCGATTGAGCTGGTGATCGGCTTTGTTAATGGTAGCGTGAAAGACATGTACCATGGCAAAAGCAAGCTGATTGCTCCGCTGGCCCTGACGATCTTCGTCTGGGTATTCCTGATGAACCTGATGGATTTACTGCCTATCGACCTGCTGCCGTACATTGCTGAACATGTACTGGGTCTGCCTGCACTGCGTGTGGTTCCGTCTGCGGACGTGAACGTAACGCTGTCTATGGCACTGGGCGTATTTATCCTGATTCTGTTCTACAGCATCAAAATGAAAGGCATCGGCGGCTTCACGAAAGAGTTGACGCTGCAGCCGTTCAATCACTGGGCGTTCATTCCTGTCAACTTAATCCTTGAAGGGGTAAGCCTGCTGTCCAAACCAGTTTCACTCGGTTTGCGACTGTTCGGTAACATGTATGCCGGTGAGCTGATTTTCATTCTGATTGCTGGTCTGTTGCCGTGGTGGTCACAGTGGATCCTGAATGTGCCGTGGGCCATTTTCCACATCCTGATCATTACGCTGCAAGCCTTCATCTTCATGGTTCTGACGATCGTCTATCTGTCGATGGCGTCTGAAGAACATTAATTTACCAACACTACTACGTTTTAACTGAAACAAACTGGAGACTGTCATGGAAAACCTGAATATGGATCTGCTGTACATGGCTGCCGCTGTGATGATGGGTCTGGCGGCAATCGGTGCTGCGATCGGTATCGGCATCCTCGGGGGTAAATTCCTGGAAGGCGCAGCGCGTCAACCTGATCTGATTCCTCTGCTGCGTACTCAGTTCTTTATCGTTATGGGTCTGGTGGATGCTATCCCGATGATCGCTGTAGGTCTGGGTCTGTACGTGATGTTCGCTGTCGCGTAGTAAGCGTTGCTTTTATTTAAAGAGCAATATCAGAACGTTAACTAAATAGAGGCATTGTGCTGTGAATCTTAACGCAACAATCCTCGGCCAGGCCATCGCGTTTGTCCTGTTCGTTCTGTTCTGCATGAAGTACGTATGGCCGCCATTAATGGCAGCCATCGAAAAACGTCAAAAAGAAATTGCTGACGGCCTTGCTTCCGCAGAACGAGCACATAAGGACCTTGACCTTGCAAAGGCCAGCGCGACCGACCAGCTGAAAAAAGCGAAAGCGGAAGCCCAGGTAATCATCGAGCAGGCGAACAAACGCCGCTCGCAGATTCTGGACGAAGCGAAAGCTGAGGCAGAACAGGAACGTACTAAAATCGTGGCCCAGGCGCAGGCGGAAATTGAAGCCGAGCGTAAACGTGCCCGTGAAGAGCTGCGTAAGCAAGTTGCTATCCTGGCTGTTGCTGGCGCCGAGAAGATCATCGAACGTTCCGTGGATGAAGCTGCTAACAGCGACATCGTGGATAAACTTGTCGCTGAACTGTAAGGAGGGAGGGGCTGATGTCTGAATTTATTACGGTAGCTCGCCCCTACGCCAAAGCAGCTTTTGACTTTGCCGTCGAACACCAAAGTGTAGAACGCTGGCAGGACATGCTGGCGTTTGCCGCCGAGGTAACCAAAAACGAACAAATGGCAGAGCTTCTCTCTGGCGCGCTTGCGCCAGAAACGCTTGCCGAGTCGTTTATCGCAGTTTGTGGTGAGCAACTGGACGAAAACGGTCAGAACCTTATTCGGGTAATGGCTGAAAATGGTCGTCTTAACGCGCTCCCGGATGTTTTGGAGCAGTTTATTCACCTTCGTGCCGTGAGTGAGGCTACCGCTGAGGTAGACGTCATTTCCGCTGCCGCACTGAGTGAACAACAGCTCGCGAAAATTTCTGCTGCGATGGAAAAACGTCTGTCACGCAAAGTTAAGCTGAATTGCAAAATCGATAAGTCTGTAATGGCAGGCGTTATCATCCGAGCGGGTGATATGGTCATTGATGGCAGCGTACGCGGTCGTCTTGAGCGCCTTGCAGACGTCTTGCAGTCTTAAGGGGACTGGAGCATGCAACTGAATTCCACCGAAATCAGCGAACTGATCAAGCAGCGCATTGCTCAGTTCAATGTTGTGAGTGAAGCTCACAACGAAGGTACTATTGTTTCTGTAAGTGACGGTGTTATCCGCATTCACGGCCTGGCCGATTGTATGCAGGGTGAAATGATCTCCCTGCCGGGTAACCGTTACGCTATCGCACTGAACCTCGAGCGCGACTCTGTTGGTGCGGTTGTTATGGGTCCGTATGCTGACCTTGCCGAAGGCATGAAAGTTAAGTGTACTGGCCGTATCCTGGAAGTTCCGGTTGGCCGTGGCCTGCTGGGCCGTGTGGTTAACACTCTGGGTGCACCAATCGACGGTAAAGGTCCGCTGGATCACGACGGCTTCTCTGCTGTAGAAGCAATCGCTCCGGGCGTTATCGAACGTCAGTCCGTAGATCAGCCGGTACAGACCGGTTATAAAGCCGTTGACTCCATGATCCCAATCGGTCGTGGTCAGCGTGAATTGATCATCGGTGACCGTCAGACCGGTAAAACCGCACTGGCTATCGATGCCATCATCAACCAGCGCGATTCCGGTATCAAATGTATCTATGTCGCTATCGGCCAGAAAGCGTCCACCATTTCTAACGTGGTACGTAAACTGGAAGAGCACGGCGCACTGGCTAACACCATCGTTGTGGTAGCAACCGCGTCTGAATCCGCTGCACTGCAATACCTGGCACCGTATGCCGGTTGCGCAATGGGCGAATACTTCCGTGACCGCGGTGAAGATGCGCTGATCATTTACGATGATCTGTCTAAACAGGCTGTTGCTTACCGTCAGATCTCGCTGCTGCTCCGTCGTCCGCCAGGACGTGAAGCATTCCCGGGCGACGTATTCTACCTCCACTCTCGTCTGCTGGAGCGTGCTGCGCGTGTTAACGCCGAATACGTTGAAGCTTTCACCAAAGGTGAAGTGAAAGGGAAAACCGGTTCACTGACCGCGCTGCCGATTATCGAAACTCAAGCGGGTGACGTTTCTGCGTTCGTTCCGACCAACGTAATCTCCATTACCGATGGTCAGATCTTCCTGGAAACCAACCTGTTCAACGCCGGTATTCGTCCTGCGGTTAACCCGGGTATTTCCGTATCCCGTGTTGGTGGTGCAGCACAGACCAAGATCATGAAAAAACTGTCCGGTGGTATCCGTACCGCTCTGGCACAGTATCGTGAACTGGCAGCGTTCTCTCAGTTTGCATCCGACCTTGACGATGCAACACGTAAGCAGCTTGACCACGGTCAAAAAGTGACCGAACTGCTGAAACAGAAACAGTATGCGCCGATGTCCGTTGCGCAGCAGTCTCTGGTTCTGTTCGCAGCAGAACGTGGTTACCTGGCGGATGTTGAACTGTCGAAAATCGGCAGCTTCGAAGCCGCTCTGCTGGCTTACGTCGACCGTGATCACGCTCCGTTGATGCAAGAGATCAACCAGACCGGTGGCTACAACGACGAAATCGAAGGCAAGCTGAAAGGCATCCTCGATTCCTTCAAAGCAACCCAATCCTGGTAACGTCTGGCGGCTTGCCTTAGGGCAGGCCGCAAGGCATTGAGGAGAAGCTCATGGCCGGCGCAAAAGAGATACGTAGTAAGATCGCAAGCGTCCAGAACACGCAAAAGATCACTAAAGCGATGGAGATGGTCGCCGCTTCCAAAATGCGTAAATCGCAGGATCGCATGGCGGCCAGCCGTCCTTATGCAGAAACCATGCGCAAAGTGATTGGTCACCTTGCACACGGTAATCTGGAATATAAGCACCCTTACCTGGAAGACCGCGACGTTAAACGCGTGGGCTACCTGGTGGTGTCGACCGACCGTGGTTTGTGCGGTGGTTTGAACATTAACCTGTTCAAAAAACTGCTGGCGGAAATGAAGACCTGGACCGACAAAGGCGTTCAATGCGACCTCGCAATGATCGGCTCGAAAGGCGTGTCGTTCTTCAACTCCGTGGGCGGCAATGTTGTTGCCCAGGTCACCGGCATGGGGGATAACCCTTCCCTGTCCGAACTGATCGGTCCGGTAAAAGTGATGTTGCAGGCCTACGACGAAGGCCGTCTGGACAAGCTTTACATTGTCAGCAACAAATTTATTAACACCATGTCTCAGGTTCCGACCATCAGCCAGCTGCTGCCGTTACCGGCATCAGATGATGATGATCTGAAACATAAATCCTGGGATTACCTGTACGAACCCGATCCGAAGGCGTTGCTGGATACCCTGCTGCGTCGTTATGTCGAATCTCAGGTTTATCAGGGCGTGGTTGAAAACCTGGCCAGCGAGCAGGCCGCCCGTATGGTGGCGATGAAAGCCGCGACCGACAATGGCGGCAGCCTGATTAAAGAGCTGCAGTTGGTATACAACAAAGCTCGTCAGGCCAGCATTACTCAGGAACTCACCGAGATCGTCTCGGGGGCCGCCGCGGTTTAAACAGGTTATTTCGTAGAGGATTTAAGATGGCTACTGGAAAGATTGTCCAGGTAATCGGCGCCGTAGTTGACGTCGAATTCCCTCAGGATGCCGTACCGCGCGTGTACGATGCTCTTGAGGTGCAAAATGGTAATGAGCGTCTGGTGCTGGAAGTTCAGCAGCAGCTCGGCGGCGGTATCGTGCGTACCATCGCAATGGGTTCCTCCGACGGTCTGCGTCGCGGTCTGGATGTAAAAGACCTCGAACACCCGATCGAAGTCCCGGTAGGTAAAGCGACTCTGGGCCGTATCATGAACGTACTGGGTGAACCGGTCGACATGAAAGGCGAGATCGGTGAAGAAGAGCGTTGGGCGATTCACCGCGCAGCACCTTCCTACGAAGAGCTGTCAAACTCTCAGGAACTGCTGGAAACCGGTATCAAAGTTATCGACCTGATGTGTCCGTTCGCTAAGGGCGGTAAAGTTGGTCTGTTCGGTGGTGCGGGTGTAGGTAAAACCGTAAACATGATGGAGCTCATTCGTAACATCGCGATCGAGCACTCCGGTTACTCTGTGTTTGCGGGCGTAGGTGAACGTACTCGTGAGGGGAACGACTTCTACCACGAAATGACCGACTCCAACGTTATCGATAAAGTATCCCTGGTGTATGGCCAGATGAACGAGCCGCCGGGAAACCGTCTGCGCGTAGCTCTGACCGGTCTGACCATGGCTGAGAAATTCCGTGACGAAGGTCGTGACGTTCTGCTGTTCGTTGACAACATCTATCGTTACACCCTGGCCGGTACGGAAGTATCCGCACTGCTGGGCCGTATGCCTTCAGCGGTAGGTTATCAGCCGACCCTGGCGGAAGAGATGGGCGTTCTGCAGGAACGTATCACCTCCACCAAAACCGGTTCTATCACCTCCGTACAGGCAGTATACGTACCTGCGGATGACTTGACTGACCCGTCTCCGGCAACCACCTTTGCGCACCTTGACGCAACCGTGGTACTGAGCCGTCAGATCGCGTCTCTGGGTATCTACCCGGCCGTTGACCCGCTGGACTCCACCAGCCGTCAGCTGGACCCGCTGGTGGTTGGTCAGGAACACTACGACACCGCGCGTGGCGTTCAGTCCATCCTGCAACGTTATCAGGAACTGAAAGACATCATCGCCATCCTGGGTATGGATGAACTGTCTGAAGAAGACAAACTGGTGGTAGCGCGTGCTCGTAAGATCCAGCGCTTCCTGTCCCAGCCGTTCTTCGTGGCAGAAGTATTCACCGGTTCTCCGGGTAAATACGTCTCCCTGAAAGACACCATCCGTGGCTTTAAAGGCATCATGGAAGGCGAATACGATCACCTGCCGGAGCAGGCGTTCTACATGGTCGGTTCCATCGAAGAAGCTGTGGA'
REV = str(Seq(FWD).reverse_complement())
K = 20
# functions
def get_khmer_kmers(fwd:str, rev:str, k:int) -> set[str]:
"""gets the kmers that appear once in the genome
Args:
fwd (str): the forward sequence
rev (str): the reverse sequence
k (int): the kmer length
Returns:
set[str]: kmers that appear once in the genome
"""
# make countgraphs
fcg = Countgraph(k, 1e7, 1)
rcg = Countgraph(k, 1e7, 1)
# consume sequences
fcg.consume(fwd)
rcg.consume(rev)
# get kmers that appear once on each strand; ignore chimeras made at junctions
out = {x for x in fcg.get_kmers(fwd) if fcg.get(x) == 1}
rks = {x for x in rcg.get_kmers(rev) if rcg.get(x) == 1}
# only keep kmers that appear once across both strands
out.symmetric_difference_update(rks)
return out
def get_oxli_kmers(seq:str, k:int) -> set[str]:
"""gets the kmers that appears once in the genome
Args:
seq (str): the sequence
k (int): the kmer length
Returns:
set[str]: kmers that appear once in the genome
"""
# build count table
kct = KmerCountTable(k, store_kmers=True)
kct.consume(seq)
# only keep kmers that appear once
kct.maxcut(1)
# get the kmers and add reverse complements to the set
out = {x for x,y in kct.dump_kmers(file=None)}
out.update({str(Seq(x).reverse_complement()) for x in out})
return out
# test case
khmer_kmers = get_khmer_kmers(FWD, REV, K)
oxli_kmers = get_oxli_kmers(FWD, K)
Examining the sets reveals the following information:
>>> len(khmer_kmers)
19950
>>> len(oxli_kmers)
19962
>>> oxli_kmers.difference(khmer_kmers)
{'AACAACAGACCCAGCACCAC',
'ATCGCAGCAGAAATTTTCGC',
'CAACCCGGCGATCGGCGGTA',
'CGGGCAGTAGCGTGGGCCGA',
'GAGTACGCAGCAGAGGAATC',
'GATTCCTCTGCTGCGTACTC',
'GCGAAAATTTCTGCTGCGAT',
'GTGGTGCTGGGTCTGTTGTT',
'TAACGCGGTATGAAAATGGA',
'TACCGCCGATCGCCGGGTTG',
'TCCATTTTCATACCGCGTTA',
'TCGGCCCACGCTACTGCCCG'}
I hope this helps.
Notes:
.update
with .symmetric_difference_update
in the get_oxli_kmers
function does not change the results in any way.Countgraph.get
method reports that all twelve of these kmers appear twice in the sequences.khmer
. when i do FWD.count(kmer)
and REV.count(kmer)
for each of those 12 kmers, they only appear once on one strand. Not sure why khmer
is saying they appear twice.@dr-joe-wirth thanks for preparing this example.
@ctb I'm not familiar with the workings of khmer, can you take a look at this? Happy to jump in and chase down bugs if you have a lead.
The first thing that comes to mind is that khmer has false positives - the countgraph will in some cases overcount things. @dr-joe-wirth if you change the 1e7 to 1e8, you should see fewer overcounts.
Reading through the rest of the comments now ;)
More comments: regarding choosing the canonical k-mer, this is implemented on the oxli side, not in the hashing mechanism we're taking from sourmash. So I think it's not that difficult to expose all k-mers, not just the canonical k-mers.
The speed is a different problem. I'm surprised (and unhappy) that we're slower than khmer!
so I may start there... with the speed...
so I may start there... with the speed...
wonderful. i am really excited for this package.
@ctb I can work on supporting an "all-kmers" mode if you want to open a new issue.
Hello,
I use
khmer
inprimerForge
to retrieve kmers that appear exactly once in the genome My current workflow is this:I am very interested in replacing
khmer
in my program withoxli
, but currentlyoxli
does not have aget_kmers
method.I'd really like if you could implement this. Or at least if you could add how to get all the kmers in a sequence to the documentation if this is not something you wish to implement in
oxli
.Thanks so much for building this!
edit: I simplified the code to more clearly demonstrate my use case