Open zhangrengang opened 2 years ago
Hi,
KMC will just count k-mers for one sample. If you need counts for m samples you need to run KMC m times.
To get what you want you could:
kmc_tools
Im afraid it would not be as efficient as you wish.
We could consider extending KMC database format to store multiple counters per k-mer and extend kmc_tools
to do step 3 from the above description probably more efficiently than just using KMC API (in this case also probably step 2 could be skipped).
The main problem is that it may still be not efficient enough, so we would need to start with some estimations and know something more about the data (is it sequencing, chromosomes?), what is the number of samples, etc. The second problem is that I am currently involved in some other work (partially KMC related) and I am afraid it may be hard to find a time to implement such an extension. We may consider this in the future.
For now you may try to use the steps I described above. If needed I may suggest something - just ask here.
On the other hand, if you want to compare samples maybe what you really need is our other tool, kmer-db (https://github.com/refresh-bio/kmer-db).
Best, Marek
Hi @marekkokot , thanks for your reply. kmer-db is not what I need, because it do not produce count for each kmer and each sample.
For chromosomes, I have implemented step 3 in a python script (https://github.com/zhangrengang/SubPhaser) which reads all kmer count (excluding count < 3) into RAM and forms the count matrix. Its efficiency is acceptable (RAM-consuming and slow but acceptable) with a 1TB-RAM computer for 21 chromosomes in wheat genome (14 Gb in total).
For sequencing data (80 samples × 10Gb data), recently I implement step 3 by merging the sorted dump files with command sort -m
. It is RAM-efficient but quite slow.
So it may be better to do step 3 or both steps 2 and 3 with naive KMC modules.
How do you get sorted dumps? Do you use sort
command on textual k-mer representation (from kmc_dump
or kmc_tools
) or do you use kmc_tools transform <input_db> sort <output_db>
. The second one should be much faster.
Instead of sort -m
for merging it could be better to write a simple C++ application utilizing KMC API. There would be some buffering per each database opened in the listening mode (I don't remember buffer sizes right now, they could be probably quite easily adjusted though), but maybe it would not be a problem. If you don't like C++ there is a python wrapper for the KMC API (but it is much much much slower than native C++ code).
I have the algorithm ready in my head (it's rather simple though), but I'm afraid I don't have time to implement it right now :-( Maybe you will be able to implement it yourself. In any doubts ask here.
Yes, firstly I use sort
command (very slow) but finally I find kmc_tools transform <input_db> dump -s <output>
. Is kmc_tools transform <input_db> dump -s <output>
similar with kmc_tools transform <input_db> sort <output_db>
?
I am not able to code with C++ now and it may be too difficult to me. I am just using Python which is very simple. If you can implement it in the future, I can wait it. I just use sort -m
for the moment.
Hi, yes it is similar, the dump
will produce a textual representation.
I got interested in this and implemented what you need, but outside this repo.
It is almost untested and undocumented and probably many improvements are possible, but this may be still better than what you currently have.
Here is the repo: https://github.com/marekkokot/kmc-matrixer
Take a look at run.sh
If something is not working you may ask, but I will be probably too busy in the next couple of weeks to do something big on this topic. If this works, please let me know if it is faster (or/and more memory frugal) in comparison with your current implementation.
Thank you very much. I will test it and let you know the results.
Hi @marekkokot , I have tested it with a small dataset (6 samples × 10Gb data per sample). kmc-matrixer
costs ~30 min and ~ 500 Mb MEM. It is much much faster than my implementation with sort -m
. And it is more memory frugal. sort -m
(sort -k 1 --parallel=20 -S 1G -m
) is not very slow but I have to convert its output to matrix with a python script which is very very slow.
It is what I need. Thanks again!
Great! Maybe in the future, we will build this functionality into kmc_tools
. This could work faster due to multithreading and some optimizations, but it is really a lot of work, so I'm glad that simple implementation is fine for you.
It is fine enough for me. Thanks!
Just wanted to say that I also find this useful and would be glad to see improvement and integration into kmc_tools. Thanks!
where m is number of samples or individuals and n is number of kmer species, and each row indicates the counts of a kmer across mutiple samples.
I aim to compare the difference between samples for each kmer (i.e., to identify different kmers) with this count table. Do any way to do this with fast speed and efficient memory cost?