Open luyh-xp opened 2 months ago
What version of bcftools is this? The latest one has a revised indel caller with --indels-cns
which may do a better job.
@jkbonfield Thank you for the suggestion. I'm using the latest bcftools 1.20. I tried rerunning the command with --indels-cns but unfortunately it doesn't resolve this issue. Somehow the REF AD number is also off this time:
11 69587265 . CA CAA 0 . INDEL;IDV=2;IMF=0.016129;DP=124;AD=0,2;I16=0,0,1,1,0,0,80,3200,0,0,40,800,0,0,46,1066;QS=0,1;VDB=0.56;SGB=-0.453602;RPBZ=-0.571191;MQBZ=2.56432;MQSBZ=0;BQBZ=0;SCBZ=-1.87886;MQ0F=0 PL:DP:AD 40,6,0:2:0,2
Is there test data available?
Also sorry I don't use IGV and don't understand the display. What are the shaded boxes showing? Why don't all alignments have shading? Is it soft-clipping for example? (In which case the reads are tiny.) Seeing the SAM file would be helpful.
Indel_AD_GRCm38.sam.txt @jkbonfield I have created an example sam file containing only reads overlapping with this exon. Can confirm that same command as above produces the same result i.e. IDV=77 but ALT AD = 2. Thanks a lot for looking into this!
So the --indels-cns
code has an explicit bit of code to filter out reads with skips in the CIGAR string.
https://github.com/samtools/bcftools/blob/develop/bam2bcf_edlib.c#L1567-L1573
Commenting out those lines gives me this:
11 69587265 . CA CAA 0 . INDEL;IDV=77;IMF=0.631148;DP=122;AD=43,77;I16=21,22,38,39,1720,68800,3080,123200,826,16418,1540,30800,969,23179,1824,44106;QS=0.349662,0.650338;VDB=0.0434579;SGB=-0.693147;RPBZ=-0.826049;MQBZ=1.83725;MQSBZ=0;BQBZ=0;SCBZ=-1.9937;MQ0F=0 PL:DP:AD 139,0,96:120:43,77
Putting it through bcftools call gives me QUAL of 106 and GT 0/1, matching the most likely PL information from above.
My question to myself now is why I ever added those lines to filter out ref skips?
In answer to my own question - it wasn't me who added those. They were duplicated (by request) from bam2bcf_indel.c, which itself was moved from Samtools.
They appeared here in 2011 with no real explanation other than to fix a bug. What bug I have no idea.
The same code exists in bam2bcf_indel.c, which is what you get if you don't use --indels-cns
.
It's clearly code that is long overdue for reevaluation. It's quite possible whatever it was that broke bcftools calling when reads had ref-skips in them is no longer affected. It certainly seemed to give us the correct answer anyway.
Thank you so much for figuring this out! I agree that reads with Ns in CIGARs shouldn't be simply filtered out like that since they are abundant in RNA-seq libraries.
So for now the solution is to comment out bam2bcf_edlib.c:1567-1573 and bam2bcf_indel.c:853-859 and re-compile bcftools?
That works, or check out one of the two PRs above (the second makes it optional so is probably the one Petr will go with) and you don't need to edit the code, but the changes are minimal obviously.
You only need to change one bit too. bam2bcf_edlib.c is for the --indels-cns
mode while bam2bcf_indel.c is the original and default mpileup code.
I'm using mpileup to compute the mutation frequencies of a list of known variant sites in RNA-seq. For a small number of indels the AD numbers in the output are much lower than what I'm seeing in IGV. For instance, when I run the following command:
bcftools mpileup -r 11:69587265 -f <GRCm38 fasta> --annotate FORMAT/AD,FORMAT/DP,INFO/AD -F 0.001 --max-depth 10000 --max-idepth 10000 -Q20 -x -A --no-BAQ --tandem-qual 10000 <RNA-seq bam> | grep -v "^#"
Here is the output I'm getting:
11 69587265 . CA CAA 0 . INDEL;IDV=77;IMF=0.616;DP=124;AD=47,2;I16=24,23,1,1,1880,75200,80,3200,872,17236,40,800,1057,25169,46,1066;QS=0.908467,0.0915332;VDB=0.56;SGB=-0.453602;RPBZ=-0.571191;MQBZ=2.56432;MQSBZ=0;BQBZ=0;SCBZ=-1.87886;MQ0F=0 PL:DP:AD 0,106,55:49:47,2
And here is how this site looks like in IGV:
I am aware that some low quality reads are filtered during variant calling and the reported AD number will be lower than the raw IDV. However upon manual examination most of the reads seem to be high-quality and correctly aligned. Such a huge drop from 77 to 2 seems counter-intuitive to me and I wonder if this is the expected behavior of mpileup.
(One thing I notice is that most of the read pairs containing the insertion are spliced except 2, which happens to be the same as the AD reported)