voutcn / megahit

Ultra-fast and memory-efficient (meta-)genome assembler
http://www.ncbi.nlm.nih.gov/pubmed/25609793
GNU General Public License v3.0
585 stars 134 forks source link

Paired-end reads and --min-count #325

Open ZeweiSong opened 2 years ago

ZeweiSong commented 2 years ago

Hi,

I'm not sure how megahit count kmers in the overlapped region of paired reads. Does megahit count these kmers as 1 or 2?

I've noticed that if we just count the kmers of two paired-ended reads, even occurance will be way more than that of odd since kmers in overlapped region are counted twice, so just wondering if this is conflict with the --min-count.

Here is what I got when counting two paired-end reads versue only one end:

Paire-ended: Occurance Count 2 2383138535 3 65657 4 1018993593 5 15223 6 503925867 7 9070 8 252284832 9 6233 10 138395953 11 4262 12 84423102 13 3609 14 56102368 15 2743 16 39759176 17 2333

Single: Occurance Count 2 1018993188 3 503931449 4 252289500 5 138399405 6 84426110 7 56104644 8 39761115 9 29617241 10 22881602 11 18179282 12 14838954 13 12345445 14 10467986 15 8965741 16 7751902 17 6773112

Thanks!

Zewei

GabeAl commented 2 years ago

Hi Zewei,

Great to see you here!

I've noticed this as well -- I think the purpose of the min-count is to discard spurious sequences (errors). So a perfect overlap of the R1 and R2 reads does indeed give twice the confidence that the underlying sequence (the k-mer) is correct ("solid"). Odd cases are less likely because they imply one of the pairs must have an error, which is more rare (leading to the smaller numbers of odd # solid k-mers). The more-supported k-mer is hence retained in these cases.

I'm curious what you mean by a conflict, since this distribution is exactly what I would expect from a k-mer error filtering method (e.g., paired-end reads of high quality would each provide 2 counts of confidence for a k-mer). But if I'm misunderstanding the purpose of min-count, I am also curious to know if this is the intended behavior.