How to generate multi-samples kmer count table?

zhangrengang commented 2 years ago

Hi, I want a count table (m× n) like this:	# kmer seq	sample1	sample2
AAA...	1	2	...
AAC...	0	5	...
AAG...	4	2	...

where m is number of samples or individuals and n is number of kmer species, and each row indicates the counts of a kmer across mutiple samples.

I aim to compare the difference between samples for each kmer (i.e., to identify different kmers) with this count table. Do any way to do this with fast speed and efficient memory cost?

marekkokot commented 2 years ago

Hi,

KMC will just count k-mers for one sample. If you need counts for m samples you need to run KMC m times.

To get what you want you could:

Run kmc to count k-mers for each sample
Sort each result using kmc_tools
Write script/program utilizing KMC API to consume m sorted KMC databases and produce what you need.

Im afraid it would not be as efficient as you wish.

We could consider extending KMC database format to store multiple counters per k-mer and extend kmc_tools to do step 3 from the above description probably more efficiently than just using KMC API (in this case also probably step 2 could be skipped).

The main problem is that it may still be not efficient enough, so we would need to start with some estimations and know something more about the data (is it sequencing, chromosomes?), what is the number of samples, etc. The second problem is that I am currently involved in some other work (partially KMC related) and I am afraid it may be hard to find a time to implement such an extension. We may consider this in the future.

For now you may try to use the steps I described above. If needed I may suggest something - just ask here.

On the other hand, if you want to compare samples maybe what you really need is our other tool, kmer-db (https://github.com/refresh-bio/kmer-db).

Best, Marek

zhangrengang commented 2 years ago

Hi @marekkokot , thanks for your reply. kmer-db is not what I need, because it do not produce count for each kmer and each sample.

For chromosomes, I have implemented step 3 in a python script (https://github.com/zhangrengang/SubPhaser) which reads all kmer count (excluding count < 3) into RAM and forms the count matrix. Its efficiency is acceptable (RAM-consuming and slow but acceptable) with a 1TB-RAM computer for 21 chromosomes in wheat genome (14 Gb in total).

For sequencing data (80 samples × 10Gb data), recently I implement step 3 by merging the sorted dump files with command sort -m. It is RAM-efficient but quite slow.

So it may be better to do step 3 or both steps 2 and 3 with naive KMC modules.

marekkokot commented 2 years ago

How do you get sorted dumps? Do you use sort command on textual k-mer representation (from kmc_dump or kmc_tools) or do you use kmc_tools transform <input_db> sort <output_db>. The second one should be much faster.

Instead of sort -m for merging it could be better to write a simple C++ application utilizing KMC API. There would be some buffering per each database opened in the listening mode (I don't remember buffer sizes right now, they could be probably quite easily adjusted though), but maybe it would not be a problem. If you don't like C++ there is a python wrapper for the KMC API (but it is much much much slower than native C++ code).

I have the algorithm ready in my head (it's rather simple though), but I'm afraid I don't have time to implement it right now :-( Maybe you will be able to implement it yourself. In any doubts ask here.

zhangrengang commented 2 years ago

Yes, firstly I use sort command (very slow) but finally I find kmc_tools transform <input_db> dump -s <output>. Is kmc_tools transform <input_db> dump -s <output> similar with kmc_tools transform <input_db> sort <output_db>?

I am not able to code with C++ now and it may be too difficult to me. I am just using Python which is very simple. If you can implement it in the future, I can wait it. I just use sort -m for the moment.

marekkokot commented 2 years ago

Hi, yes it is similar, the dump will produce a textual representation. I got interested in this and implemented what you need, but outside this repo. It is almost untested and undocumented and probably many improvements are possible, but this may be still better than what you currently have. Here is the repo: https://github.com/marekkokot/kmc-matrixer

Take a look at run.sh

If something is not working you may ask, but I will be probably too busy in the next couple of weeks to do something big on this topic. If this works, please let me know if it is faster (or/and more memory frugal) in comparison with your current implementation.

zhangrengang commented 2 years ago

Thank you very much. I will test it and let you know the results.

zhangrengang commented 2 years ago

Hi @marekkokot , I have tested it with a small dataset (6 samples × 10Gb data per sample). kmc-matrixer costs ~30 min and ~ 500 Mb MEM. It is much much faster than my implementation with sort -m. And it is more memory frugal. sort -m (sort -k 1 --parallel=20 -S 1G -m) is not very slow but I have to convert its output to matrix with a python script which is very very slow.

It is what I need. Thanks again!

marekkokot commented 2 years ago

Great! Maybe in the future, we will build this functionality into kmc_tools. This could work faster due to multithreading and some optimizations, but it is really a lot of work, so I'm glad that simple implementation is fine for you.

zhangrengang commented 2 years ago

It is fine enough for me. Thanks!

soungalo commented 2 years ago

Just wanted to say that I also find this useful and would be glad to see improvement and integration into kmc_tools. Thanks!

refresh-bio / KMC

How to generate multi-samples kmer count table? #185