refresh-bio / KMC

Fast and frugal disk based k-mer counter
263 stars 73 forks source link

how to get counts of selected kmers #43

Closed DaveRoe closed 6 years ago

DaveRoe commented 6 years ago

What is the best way to obtain a count of specified collection of kmers. Currently, I do a 'dump' and then extract the ones I want. Is there a better way?

marekkokot commented 6 years ago

Hi,

It is definitely a better (and faster) way. There is an C++ API to access KMC database (https://github.com/refresh-bio/KMC/blob/master/API.pdf). You should open the database in RandomAccess Mode:

bool OpenForRA(const std::string &file_name)

Then you may use:

bool CheckKmer(CKmerAPI &kmer, uint32 &count)

If you want I may prepare some example code for you.

On the other hand, if the collection of query k-mers is of similar size as the database itself, it may be faster to use kmc_tools. It requires having a query collection in KMC database format (but maybe it is your case already?).

If you could describe your use case more detailed, I could possibly suggest the best option.

Anyway, thanks for using KMC.

DaveRoe commented 6 years ago

I find that your second suggestion works for me. I make a database from my probes, then intersect with the larger database, and then dump the intersection. It seems to give me what I need in an efficient manner. Thank you for your help!

zhenzhenyang-psu commented 4 years ago

Hello there, I would like to extract the counts for a set of kmers of interest. The kmcdb from the input fastq files that I provided is so huge that it contains millions of kmer sequences and their counts. However, I am only interested in extracting the counts for ~4000 kmer sequences. I wanted to make a database using the kmer sequences of these ~4000 ones, however, I only seen from the manual of making kmc database using fastq reads.

Would it be possible for you to prepare some example code? thanks very much, best, Zhenzhen

marekkokot commented 4 years ago

Hi, you are right, it must be in fastq/fasta format. I attach an example (I assume Linux) how you may achieve what you need. example.zip

there is run.py which performs what is needed. First, it counts k-mers of input.fa, the result is stored in kmcdb (it is your huge database equivalent). In a file kmers-of-interests.txt there are k-mers you are looking for. After running ./run.py there will be result.txt which contains k-mers of interests (if present in kmcbd) with counters from kmcdb. If performance is your concernt it may be still more adequate to use KMC API CheckKmer method. If your database is so huge it does not fit into memory (although I think it should for millions of k-mers), you may open a database in listing mode and check each k-mer against set of k-mers of interests. In fact, it is hard to predict which method is faster: like in my example or using KMC API.

Let me know if it helps.

zhenzhenyang-psu commented 4 years ago

Hello there, Thanks for providing such a useful script and excellent test input and output.

It is impressively fast. However, I compared the results with those from manual search of the Kmer sequences from fasta sequences, the counts exported seems off.

The actual results.txt file is as follows:

result.txt

AAAAAAAAAAAAAAAAAAAGGTCAGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTCTGGGAGGCCGAGGCGGATGGAGCACGCGGTCAC 3 AAAAAAAAAAAAAAAAAAAGGTCAGGTGCGGTGGCTCACGCCTGTGATCCCAGCACTCTGGGAGGCCGAGGCGGATGGAGCACGCGGTCAC 5 AAAAAAAAAAAAAAAAAAGAACATGAGCTGCTCCGAAAACACATGCCCTCAGCACTGCATTCTGCACAATGGGACTGCTCCGAAAACGCAT 2 AAAAAAAAAAAAAAAAAAGAATAGCTACTTGTGCTCACTTTTGGTCTTCATTTGCATAGAATACCTTTTTCCACCCTTTTTCCTTGAGTTT 3 AAAAAAAAAAAAAAAAAAGAATAGCTACTTGTGCTCACTTTTGGTTTTCATTTGCATAGAATACCTTTTTCCACCCTTTTTCCTTGAGTTT 2 AAAAAAAAAAAAGGCCAGGCACGGTGGCTCACGCCTGTAATCCCACCACTTTGGGAGGCCAAGGTGGGCGGATCACGAGATCAGGAGATCG 2 AAAAAAAAAAAGAAAAATACTCTTGCTATTGTAATTATAATTTTGCAAAAACCAGTTGTTATTGATGCTGTGCTGTTCTATGCTGACAAGA 2 AAAAAAAAAAAGAAAAATACTCTTGCTATTGTAATTATAATTTTGTAAAAACCAGTTGTTATTGATGCTGTGCTGTTCTATGCTGACAAGA 4 AAAAAAAAAAGAAAAAAATAGACTATGCCTTTCTCTAAAATCCAGAGACAGGAAAGGAATCGTGTAGTGATGTGTGCAGGTGGCAGGATCA 4 AAAAAAAAAAGAAAAAAATAGACTATGCCTTTCTCTAAAATCCAGGGACAGGAAAGGAATCGTGTAGTGATGTGTGCAGGTGGCAGGATCA 4 AAAAAAAAAAGGGGAGTTATTAATTTTATTAATCACATGATGGAGGCTTTCTGGGGAGAACAGGGTGGCTCCCAAGCAGGTCCAAGCATGG 3 AAAAAAAAAAGGGGAGTTATTAATTTTATTAATCACATGATGGAGTCTTTCTGGGGAGAACAGGGTGGCTCCCAAGCAGGTCCAAGCATGG 2 AAAAAAAAAATTAAATATTTCTTTAGTATCATTTCAATGGGCTTTCAGGAGCAGAGATAAGATGTCTTAATCAGAGATGCACCGTACAACT 2 AAAAAAAAAATTAAATATTTCTTTAGTATCATTTCAATGGGCTTTGAGGAGCAGAGATAAGATGTCTTAATCAGAGATGCACCGTACAACT 3 AAAAAAAGTACAGGCCAGTATTTCTGATGAATATAAATATAAAAAAATCCTCCAAAAACACTCGCATACCCAATCCAATAGCACATCAAAA 2 AAAAAAAGTACAGGCCAGTATTTCTGATGAATATAAATATAAAAATATCCTCCAAAAACACTCGCATACCCAATCCAATAGCACATCAAAA 4 AAAAAAAGTTATCACTACTATCATAAGAGACTGATATATATGAAACCAAAGAGTAAGTAGCGAAAGAACTAAAATTCTGAAAATATTTTTA 3 AAAAAAAGTTATCACTACTATCATAAGAGACTGATATATATGAAATCAAAGAGTAAGTAGCGAAAGAACTAAAATTCTGAAAATATTTTTA 2 . Based on my result from manual search, expected results.txt should be:

expected result.txt

AAAAAAAAAAAAAAAAAAAGGTCAGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTCTGGGAGGCCGAGGCGGATGGAGCACGCGGTCAC 10 AAAAAAAAAAAAAAAAAAAGGTCAGGTGCGGTGGCTCACGCCTGTGATCCCAGCACTCTGGGAGGCCGAGGCGGATGGAGCACGCGGTCAC 12 AAAAAAAAAAAAAAAAAAGAACATGAGCTGCTCCGAAAACACATGCCCTCAGCACTGCATTCTGCACAATGGGACTGCTCCGAAAACGCAT 12 AAAAAAAAAAAAAAAAAAGAATAGCTACTTGTGCTCACTTTTGGTCTTCATTTGCATAGAATACCTTTTTCCACCCTTTTTCCTTGAGTTT 10 AAAAAAAAAAAAAAAAAAGAATAGCTACTTGTGCTCACTTTTGGTTTTCATTTGCATAGAATACCTTTTTCCACCCTTTTTCCTTGAGTTT 12 AAAAAAAAAAAAGGCCAGGCACGGTGGCTCACGCCTGTAATCCCACCACTTTGGGAGGCCAAGGTGGGCGGATCACGAGATCAGGAGATCG 10 AAAAAAAAAAAGAAAAATACTCTTGCTATTGTAATTATAATTTTGCAAAAACCAGTTGTTATTGATGCTGTGCTGTTCTATGCTGACAAGA 12 AAAAAAAAAAAGAAAAATACTCTTGCTATTGTAATTATAATTTTGTAAAAACCAGTTGTTATTGATGCTGTGCTGTTCTATGCTGACAAGA 10 AAAAAAAAAAGAAAAAAATAGACTATGCCTTTCTCTAAAATCCAGAGACAGGAAAGGAATCGTGTAGTGATGTGTGCAGGTGGCAGGATCA 10 AAAAAAAAAAGAAAAAAATAGACTATGCCTTTCTCTAAAATCCAGGGACAGGAAAGGAATCGTGTAGTGATGTGTGCAGGTGGCAGGATCA 12 AAAAAAAAAAGGGGAGTTATTAATTTTATTAATCACATGATGGAGGCTTTCTGGGGAGAACAGGGTGGCTCCCAAGCAGGTCCAAGCATGG 12 AAAAAAAAAAGGGGAGTTATTAATTTTATTAATCACATGATGGAGTCTTTCTGGGGAGAACAGGGTGGCTCCCAAGCAGGTCCAAGCATGG 10 AAAAAAAAAATTAAATATTTCTTTAGTATCATTTCAATGGGCTTTCAGGAGCAGAGATAAGATGTCTTAATCAGAGATGCACCGTACAACT 10 AAAAAAAAAATTAAATATTTCTTTAGTATCATTTCAATGGGCTTTGAGGAGCAGAGATAAGATGTCTTAATCAGAGATGCACCGTACAACT 12 AAAAAAAGTACAGGCCAGTATTTCTGATGAATATAAATATAAAAAAATCCTCCAAAAACACTCGCATACCCAATCCAATAGCACATCAAAA 12 AAAAAAAGTACAGGCCAGTATTTCTGATGAATATAAATATAAAAATATCCTCCAAAAACACTCGCATACCCAATCCAATAGCACATCAAAA 10 AAAAAAAGTTATCACTACTATCATAAGAGACTGATATATATGAAACCAAAGAGTAAGTAGCGAAAGAACTAAAATTCTGAAAATATTTTTA 10 AAAAAAAGTTATCACTACTATCATAAGAGACTGATATATATGAAATCAAAGAGTAAGTAGCGAAAGAACTAAAATTCTGAAAATATTTTTA 12 AAAAAAAAAAAAAAAAAAGAACATGAGCTGCTCCGAAAACACATGTCCTCAGCACTGCATTCTGCACAATGGGACTGCTCCGAAAACGCAT 10 AAAAAAAAAAAAGGCCAGGCACGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGTGGGCGGATCACGAGATCAGGAGATCG 10

Do you know why the result may be unexpected? thanks much, Zhenzhen


From: marekkokot notifications@github.com Sent: Monday, July 13, 2020 4:57 AM To: refresh-bio/KMC KMC@noreply.github.com Cc: zhenzhenyang-psu yangzhenzhen1988@gmail.com; Comment comment@noreply.github.com Subject: Re: [refresh-bio/KMC] how to get counts of selected kmers (#43)

Hi, you are right, it must be in fastq/fasta format. I attach an example (I assume Linux) how you may achieve what you need. example.ziphttps://github.com/refresh-bio/KMC/files/4911300/example.zip

there is run.py which performs what is needed. First, it counts k-mers of input.fa, the result is stored in kmcdb (it is your huge database equivalent). In a file kmers-of-interests.txt there are k-mers you are looking for. After running ./run.py there will be result.txt which contains k-mers of interests (if present in kmcbd) with counters from kmcdb. If performance is your concernt it may be still more adequate to use KMC API CheckKmer method. If your database is so huge it does not fit into memory (although I think it should for millions of k-mers), you may open a database in listing mode and check each k-mer against set of k-mers of interests. In fact, it is hard to predict which method is faster: like in my example or using KMC API.

Let me know if it helps.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/refresh-bio/KMC/issues/43#issuecomment-657421353, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACFCJW2YI5L4QQZOQZA3KCTR3LD5ZANCNFSM4EGOZJTA.

marekkokot commented 4 years ago

Hi, wow, really long k-mers. Well, I don't know why the result is incorrect :( Are your input fastq files publicly available? If yes could you point me how to download it and how you count k-mers. If not, then maybe you are able to reproduce this behavior on some toy example. I am afraid without the possibility to reproduce I am not able to help.

zhenzhenyang-psu commented 4 years ago

Hello, Here is how I downloaded the data:

wget -c https://s3-us-west-2.amazonaws.com/human-pangenomics/HG002/hpp_HG002_NA24385_son_v1/PacBio_HiFi/15kb/m64012_190920_173625.Q20.fastq wget -c https://s3-us-west-2.amazonaws.com/human-pangenomics/HG002/hpp_HG002_NA24385_son_v1/PacBio_HiFi/15kb/m64012_190921_234837.Q20.fastq

I combined the two fastq files and converted them to fasta while running your python script.

and here is the content of the kmers-of-interests.txt:

AAAAAAAAAAAAAAAAAAAGGTCAGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTCTGGGAGGCCGAGGCGGATGGAGCACGCGGTCAC AAAAAAAAAAAAAAAAAAAGGTCAGGTGCGGTGGCTCACGCCTGTGATCCCAGCACTCTGGGAGGCCGAGGCGGATGGAGCACGCGGTCAC AAAAAAAAAAAAAAAAAAGAACATGAGCTGCTCCGAAAACACATGCCCTCAGCACTGCATTCTGCACAATGGGACTGCTCCGAAAACGCAT AAAAAAAAAAAAAAAAAAGAATAGCTACTTGTGCTCACTTTTGGTCTTCATTTGCATAGAATACCTTTTTCCACCCTTTTTCCTTGAGTTT AAAAAAAAAAAAAAAAAAGAATAGCTACTTGTGCTCACTTTTGGTTTTCATTTGCATAGAATACCTTTTTCCACCCTTTTTCCTTGAGTTT AAAAAAAAAAAAGGCCAGGCACGGTGGCTCACGCCTGTAATCCCACCACTTTGGGAGGCCAAGGTGGGCGGATCACGAGATCAGGAGATCG AAAAAAAAAAAGAAAAATACTCTTGCTATTGTAATTATAATTTTGCAAAAACCAGTTGTTATTGATGCTGTGCTGTTCTATGCTGACAAGA AAAAAAAAAAAGAAAAATACTCTTGCTATTGTAATTATAATTTTGTAAAAACCAGTTGTTATTGATGCTGTGCTGTTCTATGCTGACAAGA AAAAAAAAAAGAAAAAAATAGACTATGCCTTTCTCTAAAATCCAGAGACAGGAAAGGAATCGTGTAGTGATGTGTGCAGGTGGCAGGATCA AAAAAAAAAAGAAAAAAATAGACTATGCCTTTCTCTAAAATCCAGGGACAGGAAAGGAATCGTGTAGTGATGTGTGCAGGTGGCAGGATCA AAAAAAAAAAGGGGAGTTATTAATTTTATTAATCACATGATGGAGGCTTTCTGGGGAGAACAGGGTGGCTCCCAAGCAGGTCCAAGCATGG AAAAAAAAAAGGGGAGTTATTAATTTTATTAATCACATGATGGAGTCTTTCTGGGGAGAACAGGGTGGCTCCCAAGCAGGTCCAAGCATGG AAAAAAAAAATTAAATATTTCTTTAGTATCATTTCAATGGGCTTTCAGGAGCAGAGATAAGATGTCTTAATCAGAGATGCACCGTACAACT AAAAAAAAAATTAAATATTTCTTTAGTATCATTTCAATGGGCTTTGAGGAGCAGAGATAAGATGTCTTAATCAGAGATGCACCGTACAACT AAAAAAAGTACAGGCCAGTATTTCTGATGAATATAAATATAAAAAAATCCTCCAAAAACACTCGCATACCCAATCCAATAGCACATCAAAA AAAAAAAGTACAGGCCAGTATTTCTGATGAATATAAATATAAAAATATCCTCCAAAAACACTCGCATACCCAATCCAATAGCACATCAAAA AAAAAAAGTTATCACTACTATCATAAGAGACTGATATATATGAAACCAAAGAGTAAGTAGCGAAAGAACTAAAATTCTGAAAATATTTTTA AAAAAAAGTTATCACTACTATCATAAGAGACTGATATATATGAAATCAAAGAGTAAGTAGCGAAAGAACTAAAATTCTGAAAATATTTTTA AAAAAAAAAAAAAAAAAAGAACATGAGCTGCTCCGAAAACACATGTCCTCAGCACTGCATTCTGCACAATGGGACTGCTCCGAAAACGCAT AAAAAAAAAAAAGGCCAGGCACGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGTGGGCGGATCACGAGATCAGGAGATCG

thanks for taking a look. I really appreciate your help. Zhenzhen


From: marekkokot notifications@github.com Sent: Monday, July 13, 2020 12:29 PM To: refresh-bio/KMC KMC@noreply.github.com Cc: zhenzhenyang-psu yangzhenzhen1988@gmail.com; Comment comment@noreply.github.com Subject: Re: [refresh-bio/KMC] how to get counts of selected kmers (#43)

Hi, wow, really long k-mers. Well, I don't know why the result is incorrect :( Are your input fastq files publicly available? If yes could you point me how to download it and how you count k-mers. If not, then maybe you are able to reproduce this behavior on some toy example. I am afraid without the possibility to reproduce I am not able to help.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/refresh-bio/KMC/issues/43#issuecomment-657660905, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACFCJW3YOZOPRUWLXU2ZGK3R3MY6ZANCNFSM4EGOZJTA.

marekkokot commented 4 years ago

Hi,

Ok I am downloading the data. I am not sure why you converted and merged those files into one fasta? Could you also give me exact command line that you used for k-mer confing. Was that just ./kmc -k91 -fa file.fa kmcdb . or maybe you have specified some other parameters?

zhenzhenyang-psu commented 4 years ago

Ah, I thought the input only requires fasta format, maybe fastq files also work? ./kmc -k91 -fa file.fa kmcdb .

Yeap, I used the script you sent by just changing the kmer length

run_command("/public/home/yangzhzh/tools_zz/KMC_extract/example/kmc -t30 -k91 -fa hifi.15kb.2files.comb.Q20.fasta kmcdb .")

with open("kmers-of-interests.fa", "w") as out:

with open("kmers-of-interests.txt") as f:

    for kmer in f:

        kmer = kmer.strip()

        out.write(">\n")

        out.write(kmer)

        out.write("\n")

run_command("/public/home/yangzhzh/tools_zz/KMC_extract/example/kmc -k91 -ci1 -fa kmers-of-interests.fa kmers-of-interests .")

run_command("/public/home/yangzhzh/tools_zz/KMC_extract/example/kmc_tools simple kmcdb kmers-of-interests intersect common-kmers -ocleft")

run_command("/public/home/yangzhzh/tools_zz/KMC_extract/example/kmc_tools transform common-kmers dump result.txt")

thanks a lot! Zhenzhen


From: marekkokot notifications@github.com Sent: Monday, July 13, 2020 1:06 PM To: refresh-bio/KMC KMC@noreply.github.com Cc: zhenzhenyang-psu yangzhenzhen1988@gmail.com; Comment comment@noreply.github.com Subject: Re: [refresh-bio/KMC] how to get counts of selected kmers (#43)

Hi,

Ok I am downloading the data. I am not sure why you converted and merged those files into one fasta? Could you also give me exact command line that you used for k-mer confing. Was that just ./kmc -k91 -fa file.fa kmcdb . or maybe you have specified some other parameters?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/refresh-bio/KMC/issues/43#issuecomment-657679684, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACFCJWZTFOI2DH5SXEOPSHTR3M5KJANCNFSM4EGOZJTA.

zhenzhenyang-psu commented 4 years ago

hello, sorry to bother, is there a way to extract the read id for kmer sequences of interest?

zhenzhenyang-psu commented 4 years ago

hello there, does the database keep track of the read id and very left position of Kmers of interest?

marekkokot commented 4 years ago

Hello,

sorry, there were some other important things to do :( I am afraid I will not be able to take a look at why the results are incorrect at any time soon. I will be back at the end of the month. Cause I may forget about this issue, it would be appreciated if you could remind me.

There is no read id neither position in a read, just k-mers and counts as in case of most k-mer counters.

zhenzhenyang-psu commented 4 years ago

no worries at all. You are already doing me a great favor and I appreciate it much. Thanks for letting me know. I will remind you by the end of this month. best, zhenzhen

zhenzhenyang-psu commented 4 years ago

HI Marekkokot, I was wondering if there is a way to identify reads that contain exactly the same kmer sequences of interest from a number of input fastq reads. I have, for instance a million kmer sequences with which I want to search against the fastq reads, I know that if using a naive approach, one input fastq read will have to take 1 million search and I cannot imagine the time it would take if searching through the whole fastq dataset which contains a million fastq reads. Could building a kmer database using the fastq reads make it easier for the search of kmers of interest? sorry to bother and looking forward to your thoughts on this question. best, Zhenzhen

droeatumn commented 4 years ago

Zhenzhen, bbduk.sh may be helpful for you. See the "outm" stream. https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/

There are probably other tools too.

zhenzhenyang-psu commented 4 years ago

thanks very much, I will take a look:). Really appreciate it.