Duplicated reads handling

carlosmag commented 4 years ago

Hello!

Is the tool able to detect PCR duplicates or should we remove them previously? Thanks in advance.

hsinnan75 commented 4 years ago

To be honest, I did not test MapCaller on datasets with PCR duplicates. I am not sure how MapCaller will perform when there are PCR duplicates. It depends on whether the variants occur in the duplicated reads or not. However, MapCaller can be run in somatic mutation detection mode, which is more sensitive to alleles with low frequencies (as low as 0.02). You may run MapCaller with the argument "-somatic" and set a larger allele depth by "-ad" (default is 10). In doing so, I think MapCaller can detect variants in PCR duplication regions. If you find the result does not meet your expectation, please let me know. Thank you!

tseemann commented 4 years ago

By PCR duplicate you mean a two paired-reads with the same exact (R1,R2) sequence?

I don't think the algorithm and data structure MapCaller uses to achieve its speed can handle detection of PCR duplicate reads, because the coordinate information of each read alignment is discarded after the nucleotide frequences are incremented?

hsinnan75 commented 4 years ago

I ever added a data structure to keep track of coordinate information. I used a bit to indicate if a coordinate has been a starting position of a read alignment before. That is if v(P)=true meaning position P has been a starting position of a read alignment. Therefore, we could discard those alignments whose v(P) are true to remove PCR duplicates. However, I did not have any datasets with PCR duplicates to test this idea. But it is doable in MapCaller.

carlosmag commented 4 years ago

It would be really nice to have an option to discard duplicate alignments! I have been using sambamba markdup to do so.

duplicatesbadforvariantcalling

adapted from Variant calling workshop following GATK "Best Practices" for DNA-Sequencing, pages 40-41 (Robert Bukowski, Qi Sun and Minghui Wang ; Bioinformatics Facility of the Institute of Biotechnology, Cornell University)

If you want to perform some testing, Mycobacterium tuberculosis genome ERR2653232 has a 12% duplication rate, according to fastp. Reference genome used: MTBC inferred ancestral

hsinnan75 commented 4 years ago

Thank you! I'll add this feature and use the data set to test.

tseemann commented 4 years ago

Having the same start coordinate is not sufficient? it needs R1 and R2 to be jointly the same as R1' and R2' for PE reads.

For SE reads, if the read length is 100bp, and you sequence > 100x depth, then you expect lots of "duplicates". For PE this is less likely.

P.S. I have been using samtools markdup (not the old samtools rmdup).

hsinnan75 commented 4 years ago

I uploaded a new version (0.9.9.a) to filter out PCR-duplicates. You may specify the maximal number of read alignments with the same start coordinate. I tested this feature on the data set of ERR2653232. Here are the results of two cases.

//without filtering 919(snp); 81(ins); 65(del); 0(trans); 0(inversion) // with filtering 883(snp); 70(ins); 53(del); 0(trans); 0(inversion)

MapCaller produced less number of variants with PCR-duplicates filtering. I agreed a PCR duplicate implies we can find the case of R1 and R2 to be jointly the same as R1' and R2' for PE reads. However, it requires more space to maintain a data structure for that purpose. Therefore, I set a threshold to limit the maximal number of read alignment with the same start coordinate. I will do more tests to see if this strategy works for most PCR-duplicate cases.

Please let me know if you have any suggestions. Thanks.

carlosmag commented 4 years ago

Thanks for the quick feedback!

For this improvement to be more informative, it would be nice to have an INFO field reporting the number of read alignments with the same start coordinate. In that way, one could filter the VCF file instead of running MapCaller with distinct tresholds...

If you need more datasets, here is reported what seems to be a tumour genome with >30% duplication rate.

carlosmag commented 4 years ago

Thanks for including DUP field in VCF output!

However, distinct thresholds of -dup lead to radically different reported DP, AD and DUP metrics.

Default threshold: MapCaller -ploidy 1 -i MTB_anc.fasta -f ERR2653232_1.fastq -f2 ERR2653232_1.fastq -vcf output_varsdefault.vcf ; sed 's/\x0/T/g' output_varsdefault.vcf | grep -v '^#' | head -3

MTB_anc 2532 . C T 30 PASS DP=352;AD=352;DUP=5;AF=1.000;GT=1|1;TYPE=SUBSTITUTE MTB_anc 8040 . G A 30 PASS DP=333;AD=328;DUP=5;AF=0.985;GT=1|1;TYPE=SUBSTITUTE MTB_anc 9143 . C T 30 PASS DP=350;AD=350;DUP=5;AF=1.000;GT=1|1;TYPE=SUBSTITUTE

-dup 2: MapCaller -ploidy 1 -i MTB_anc.fasta -f ERR2653232_1.fastq -f2 ERR2653232_1.fastq -dup 2 -vcf output_varsaddefaultdup2.vcf ; sed 's/\x0/T/g' output_varsaddefaultdup2.vcf | grep -v '^#' | head -3

MTB_anc 2532 . C T 30 PASS DP=162;AD=162;DUP=2;AF=1.000;GT=1|1;TYPE=SUBSTITUTE MTB_anc 8040 . G A 30 PASS DP=154;AD=151;DUP=2;AF=0.981;GT=1|1;TYPE=SUBSTITUTE MTB_anc 9143 . C T 30 PASS DP=163;AD=163;DUP=2;AF=1.000;GT=1|1;TYPE=SUBSTITUTE

ERR2653232 Reference genome

Is this a bug or am I missing something? Thank you!

hsinnan75 commented 4 years ago

The argument -dup is used to specify the maximal number of read alignments allowed for each position. "-dup 2" means MapCaller will only use the first two alignments with the same start position to infer variants. Therefore, DP and AD might be different when using different -dup. Since the read length of ERR2653232 is 151, if there is no any read duplicates, the average DP should be 151 when the average DUP is 1. If read length is L and sequencing coverage is N, the average dup is N/L. I use -dup to prevent PCR-duplicates from dominating the variant calling. The theory is based on PCR-duplates will produce multiple read alignments with the same start position. However, it is possible two paired-end reads will produce alignments with the same start position at one end, but different start position at the other end. Thus the default dup is 5 to allow some read alignments of such cases. It is the most strict when dup is set to 1, which means no other read alignments with the same start position is allowed. All the statistical numbers are based on read alignments that are used for variant calling. That's why you see different DP and AD when you applying different dup values. It is not a bug.

carlosmag commented 4 years ago

Thanks for the detailed explanation! Maybe it will be more clear if there was metric that records total depth when filtering for duplicates, secondary alignments (#24) and so on. In that way, it would be possible to know how many reads were discarded (total depth - considered depth). Pilon does that, by the way. What do you think?

hsinnan75 / MapCaller

Duplicated reads handling #23