mpilup read-pair overlap detection introduces strand-bias

samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html

http://samtools.github.io/bcftools/

Other

650 stars 240 forks source link

mpilup read-pair overlap detection introduces strand-bias #1058

Open NagaComBio opened 5 years ago

NagaComBio commented 5 years ago

Since the read-pair overlap detection in mpileup adds the base qualities together in forward-read and makes the reverse-read to Q0, this introduces a strand bias.

If I have understood this behavior correctly, my question would be, is there any plan to include such a random assignment? or any workaround within mpileup to overcome this. I haven't found any discussion about the issue or workaround to make this as a random choice, which I assume will not result in a strand bias.

Currently, we use -x in mpileup and a downstream script which retains the read with a high-quality base this works well without introducing the bias. But, it would be better if the read-pair overlap detection is done during mpileup, so that we can use the metrics from the INFO column.

pd3 commented 5 years ago

Can you please try if this tentative fix in htslib helps and behaves well with your data? https://github.com/pd3/htslib/commit/e8ba1364aa93d90f387f6e6a99d402960570ee49

NagaComBio commented 5 years ago

Thank you tentative fix, Petr. I have tested this commit on 30 exomes, but the pseudo-random fix hasn't removed the false bias we see in somatic variants.

The pseudo-random version we have in the downstream script is with the following algorithm: Retain the read with the high-quality base if the bases are different, and if they are both the same base retain the first read based on the coordinate sorting.

pd3 commented 5 years ago

That's the same algorithm, only in pd3/htslib@e8ba136 we alternate between the reads. Any chance you could make a slice of your bam available for testing to see what your reads look like?

NagaComBio commented 5 years ago

Yes, it is almost the same algorithm. We use Pysam in the bias analysis script, I think we should have a patch similar to the one above, I am working on it and will update here soon.