refresh-bio / KMC

Fast and frugal disk based k-mer counter
252 stars 73 forks source link

kmc_tools simple intersect & union, not considering reverse complement? #227

Open zhaotao1987 opened 5 months ago

zhaotao1987 commented 5 months ago

Hello, in my project I would like to just calculate the kmer counts for one side of the reads, read_1.fq.gz, for example. And then afterwards, I would like to compare kmer.dump across samples, I would like to calculate shared kmers relative to total unique kmers, to this end, considering reverse complement is important, but I found kmc_tools intersect does not consider reverse complemented kmers between samples? Or, I have to do kmer calculation for both reads for each sample, then compare, that is the only way? Thanks very much.

marekkokot commented 5 months ago

Hi,

I am not fully understanding. In general, kmc_tools is unaware of strands (at least as far as I remember). It just takes k-mers as they are in the kmc database. Now, when you run kmc, you may count either:

zhaotao1987 commented 5 months ago

Hi Marek,

Thank you for your prompt response; much appreciated. Great to learn that by default, KMC already considers reverse complement (RC).

I'm currently working on improvise something and would like to share my approach with you, seeking your input on its viability. I'm utilizing resequencing data from 300 samples, encompassing various species/subspecies within Malus. My goal is to explore the phylogenetic relationships among these samples based on kmer similarities, first to generate a kmer similarity matrix for all samples. The steps involve kmer counting for each sample and kmer comparison for every pair of samples ( kmc -> kmc_dump -> kmer_comparison.py ). To optimize efficiency in terms of time and space, I'm contemplating using only the R1 read for each sample.

So learned from what you said, if I used the -b mode, RC kmers from different samples might be treated as distinct kmers, thus introducing bias into the results. To clarify, for each comparison, I'm calculating shared (unique) kmers relative to total (unique) kmers. I'm pleased to learn that the software inherently considers both original and RC simultaneously by default. (oh, maybe still problematic... since the lower of these two was chosen , it may varied across samples.. ), I seems to me that _kmctools simple intersection function does not consider RC?

Moving on to another aspect of my project, I have a question about the appropriate values for -ci and -cx when comparing shared kmers between samples. I think -ci should be set at 2 to keep all shared ones and discard error-prone kmers occurring only once. As for -cx, I've used 1000, but I'd appreciate your thoughts on this choice. The kmer size is 21.

Thank you once again for your assistance. BTW, I am not using kmc_tools intersection and union for calculating the ratio, which generates unnecessary big files. The python script is what I used for this purpose.

zhaotao1987 commented 5 months ago

kmer_comparison.1.zip