Paired-end reads and --min-count

voutcn / megahit

Ultra-fast and memory-efficient (meta-)genome assembler

GNU General Public License v3.0

585 stars 134 forks source link

Hi,

I'm not sure how megahit count kmers in the overlapped region of paired reads. Does megahit count these kmers as 1 or 2?

I've noticed that if we just count the kmers of two paired-ended reads, even occurance will be way more than that of odd since kmers in overlapped region are counted twice, so just wondering if this is conflict with the --min-count.

Here is what I got when counting two paired-end reads versue only one end:

Paire-ended: Occurance Count 2 2383138535 3 65657 4 1018993593 5 15223 6 503925867 7 9070 8 252284832 9 6233 10 138395953 11 4262 12 84423102 13 3609 14 56102368 15 2743 16 39759176 17 2333

Single: Occurance Count 2 1018993188 3 503931449 4 252289500 5 138399405 6 84426110 7 56104644 8 39761115 9 29617241 10 22881602 11 18179282 12 14838954 13 12345445 14 10467986 15 8965741 16 7751902 17 6773112

Thanks!

Zewei

Hi Zewei,

Great to see you here!

I've noticed this as well -- I think the purpose of the min-count is to discard spurious sequences (errors). So a perfect overlap of the R1 and R2 reads does indeed give twice the confidence that the underlying sequence (the k-mer) is correct ("solid"). Odd cases are less likely because they imply one of the pairs must have an error, which is more rare (leading to the smaller numbers of odd # solid k-mers). The more-supported k-mer is hence retained in these cases.

I'm curious what you mean by a conflict, since this distribution is exactly what I would expect from a k-mer error filtering method (e.g., paired-end reads of high quality would each provide 2 counts of confidence for a k-mer). But if I'm misunderstanding the purpose of min-count, I am also curious to know if this is the intended behavior.

voutcn / megahit

Paired-end reads and --min-count #325