samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
680 stars 240 forks source link

Significant difference between IDV and AD when calling certain RNA-seq indels with mpileup #2277

Open luyh-xp opened 2 months ago

luyh-xp commented 2 months ago

I'm using mpileup to compute the mutation frequencies of a list of known variant sites in RNA-seq. For a small number of indels the AD numbers in the output are much lower than what I'm seeing in IGV. For instance, when I run the following command: bcftools mpileup -r 11:69587265 -f <GRCm38 fasta> --annotate FORMAT/AD,FORMAT/DP,INFO/AD -F 0.001 --max-depth 10000 --max-idepth 10000 -Q20 -x -A --no-BAQ --tandem-qual 10000 <RNA-seq bam> | grep -v "^#"

Here is the output I'm getting: 11 69587265 . CA CAA 0 . INDEL;IDV=77;IMF=0.616;DP=124;AD=47,2;I16=24,23,1,1,1880,75200,80,3200,872,17236,40,800,1057,25169,46,1066;QS=0.908467,0.0915332;VDB=0.56;SGB=-0.453602;RPBZ=-0.571191;MQBZ=2.56432;MQSBZ=0;BQBZ=0;SCBZ=-1.87886;MQ0F=0 PL:DP:AD 0,106,55:49:47,2

And here is how this site looks like in IGV: 296fe156-fbce-4e9d-b37c-38660d014890

I am aware that some low quality reads are filtered during variant calling and the reported AD number will be lower than the raw IDV. However upon manual examination most of the reads seem to be high-quality and correctly aligned. Such a huge drop from 77 to 2 seems counter-intuitive to me and I wonder if this is the expected behavior of mpileup.

(One thing I notice is that most of the read pairs containing the insertion are spliced except 2, which happens to be the same as the AD reported)

jkbonfield commented 2 months ago

What version of bcftools is this? The latest one has a revised indel caller with --indels-cns which may do a better job.

luyh-xp commented 2 months ago

@jkbonfield Thank you for the suggestion. I'm using the latest bcftools 1.20. I tried rerunning the command with --indels-cns but unfortunately it doesn't resolve this issue. Somehow the REF AD number is also off this time:

11 69587265 . CA CAA 0 . INDEL;IDV=2;IMF=0.016129;DP=124;AD=0,2;I16=0,0,1,1,0,0,80,3200,0,0,40,800,0,0,46,1066;QS=0,1;VDB=0.56;SGB=-0.453602;RPBZ=-0.571191;MQBZ=2.56432;MQSBZ=0;BQBZ=0;SCBZ=-1.87886;MQ0F=0 PL:DP:AD 40,6,0:2:0,2

jkbonfield commented 2 months ago

Is there test data available?

Also sorry I don't use IGV and don't understand the display. What are the shaded boxes showing? Why don't all alignments have shading? Is it soft-clipping for example? (In which case the reads are tiny.) Seeing the SAM file would be helpful.

luyh-xp commented 2 months ago

Indel_AD_GRCm38.sam.txt @jkbonfield I have created an example sam file containing only reads overlapping with this exon. Can confirm that same command as above produces the same result i.e. IDV=77 but ALT AD = 2. Thanks a lot for looking into this!

jkbonfield commented 2 months ago

So the --indels-cns code has an explicit bit of code to filter out reads with skips in the CIGAR string.

https://github.com/samtools/bcftools/blob/develop/bam2bcf_edlib.c#L1567-L1573

Commenting out those lines gives me this:

11  69587265    .   CA  CAA 0   .   INDEL;IDV=77;IMF=0.631148;DP=122;AD=43,77;I16=21,22,38,39,1720,68800,3080,123200,826,16418,1540,30800,969,23179,1824,44106;QS=0.349662,0.650338;VDB=0.0434579;SGB=-0.693147;RPBZ=-0.826049;MQBZ=1.83725;MQSBZ=0;BQBZ=0;SCBZ=-1.9937;MQ0F=0  PL:DP:AD    139,0,96:120:43,77

Putting it through bcftools call gives me QUAL of 106 and GT 0/1, matching the most likely PL information from above.

My question to myself now is why I ever added those lines to filter out ref skips?

jkbonfield commented 2 months ago

In answer to my own question - it wasn't me who added those. They were duplicated (by request) from bam2bcf_indel.c, which itself was moved from Samtools.

They appeared here in 2011 with no real explanation other than to fix a bug. What bug I have no idea.

The same code exists in bam2bcf_indel.c, which is what you get if you don't use --indels-cns.

It's clearly code that is long overdue for reevaluation. It's quite possible whatever it was that broke bcftools calling when reads had ref-skips in them is no longer affected. It certainly seemed to give us the correct answer anyway.

luyh-xp commented 2 months ago

Thank you so much for figuring this out! I agree that reads with Ns in CIGARs shouldn't be simply filtered out like that since they are abundant in RNA-seq libraries.

So for now the solution is to comment out bam2bcf_edlib.c:1567-1573 and bam2bcf_indel.c:853-859 and re-compile bcftools?

jkbonfield commented 2 months ago

That works, or check out one of the two PRs above (the second makes it optional so is probably the one Petr will go with) and you don't need to edit the code, but the changes are minimal obviously.

You only need to change one bit too. bam2bcf_edlib.c is for the --indels-cns mode while bam2bcf_indel.c is the original and default mpileup code.