refresh-bio / KMC

Fast and frugal disk based k-mer counter
253 stars 73 forks source link

confusion with results interpretation #177

Open Aannaw opened 2 years ago

Aannaw commented 2 years ago

I want to assess my assembly genomes. So I run the kmc count (ci0) with illumina_reads and assembly genomes and get two kmc database indepedently. Then I use KMC_tools analyze to create a matrix of shared K-mers between two KMC's databases. Then I visualize the matrix by spectra.py. The 0x kmer nearly is absent; The 1x kmer only a peak with no Heterozygous peak. Is it ture ? due to the low Heterozygosity? Looking forward with your reply

marekkokot commented 2 years ago

Hi,

ci0 does not make sense and it will work as -ci1. KMC will not k-mers that are absent in the input. Here is some explanation #163 Does it explain what you observe?

Aannaw commented 2 years ago

Thanks for your reply. I used the kmc_tools to create the matrix between two KMC's databases of illumina reads and assembly genome. I got some line of 0x 1x 2x and so on. Can you tell me what do they mean. Thanks very much. 图片1

marekkokot commented 2 years ago

Hi,

I am not sure how you create the matrix. Could you give me the full command lines of using kmc, kmc_tools and how you generate the plot?

Aannaw commented 2 years ago

Hi I refer to the link: https://github.com/dfguan/KMC. kmc kmc1

marekkokot commented 2 years ago

Hi, hmm, this is a modification of KMC, so maybe you should ask na author, @dfguan ? I am not sure what analyse operation does. Neverthless, this for looks interesting.

dfguan commented 2 years ago

Hi, the X axis represent the k-mer count in reads and y axis is the number of its corresponding frequency, lines with different colors are the k-mer count in the assembly. I use this plot to view how clean our primary assembly is. You may refer to KAT plot for details (https://kat.readthedocs.io/en/latest/walkthrough.html). Best, Dengfeng.

Aannaw commented 2 years ago

Hi @dfguan Thanks for your reply As you say, the lines with different colors refer to the number of occurrences of kmer in assembly. But my plot is quite different with your example plot. On my 1x plot line, I did not have a heterozygous peak. I I run the kmc count with ci0, but my 0x plot line is nearly absent. Maybe it is due to the low heterozygosity of my assembly genome? If so, does the low peak on the 0x my plot need to purge? Can you help me to interpret my plot with kmer from assembly genome and illumina reads. I can not know if my assembly genome is clean and needs purge_dups.