Apparent de novo deletion is not recognized as such with --indels-cns

samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html

http://samtools.github.io/bcftools/

Other

663 stars 240 forks source link

Apparent de novo deletion is not recognized as such with --indels-cns #2113

Closed pd3 closed 6 months ago

pd3 commented 7 months ago

Apparent de novo deletion is not recognized as such with mpileup --indels-cns because alternate AD counts in parents are thought to be bigger than zero.

In the test case, the top sample is the proband, followed by the parents, and their AD counts are reported to be

chr:101  ATTAT,A    23,16   26,2    32,3

The test data is part of https://github.com/pd3/mpileup-tests/ and the commands to reproduce the problem are

cd mpileup-tests
./run-tests.pl -r trio-ad-counts.bam

or, manually,

cd mpileup-tests/dat
bcftools mpileup --indels-cns -a AD -t chr:101 -f indels/trio-ad-counts.fa indels/trio-ad-counts.bam| bcftools call -m | bcftools view -i 'N_ALT>0' | bcftools norm -f indels/trio-ad-counts.fa | bcftools query -f '%CHROM:%POS %REF %ALT [ %AD]\n'

pd3 commented 7 months ago

Another test case

cd mpileup-tests
./run-tests.pl -r trio-ad-counts.2.bam

cd mpileup-tests/dat
bcftools mpileup --indels-cns -a AD -t chr:101 -f indels/trio-ad-counts.2.fa indels/trio-ad-counts.2.bam| bcftools call -m | bcftools view -i 'N_ALT>0' | bcftools norm -f indels/trio-ad-counts.2.fa | bcftools query -f '%CHROM:%POS %REF %ALT [ %AD]\n'

jkbonfield commented 7 months ago

I don't understand what's wrong here (top example, I've yet to look at the second). Can you tell me what output you're expecting. As far as I can see it's working.

$ bcftools mpileup --indels-cns -a AD -t chr:101 -f indels/trio-ad-counts.fa indels/trio-ad-counts.bam 2>/dev/null|bcftools call -vm - 2>/dev/null | bcftools query -f '%CHROM:%POS %REF %ALT [ %GT %AD]\n'
chr:101 ATTATTTATT ATTATT  0/1 23,16 0/0 26,2 0/0 32,3

0/1 and AD 23,16 is correct as far as I can see. Arguably 26,2 and 32,3 should be 26,0 and 32,0, but it's still close enough. indels-2.0 gives me "0/1 24,17 0/0 39,0 0/0 43,0" which is overly optimistic as some of those assigned to REF could also align to ALT as they don't extend into the unique data to correctly anchor. Develop gives " 0/1 14,16 0/0 28,0 0/0 35,0" which is better, but missing a few on REF for the first sample.

pd3 commented 7 months ago

I don't understand what's wrong here (top example, I've yet to look at the second). Can you tell me what output you're expecting. As far as I can see it's working.

I am not sure how else to word the issue to make it more clear. In the parents there are no reads supporting the variants, but non-zero counts are reported. Unfortunately, "close enough" is not good enough when asking whether the variant is a de novo variant or a mosaic variant.

I also tested --indels-2.0 and the default, but let's not clutter the issue with observations that are not relevant to the problem at hand.

jkbonfield commented 7 months ago

I think it's unreasonable to require the AD fields to be perfect, and indels-cns is considerably closer to the optimal scores than either develop or indels-2.0. I'd argue that whatever tool is assuming them to be perfect is in error.

Having looked at your second example, this is definitely true as there are infact alignments that support the deletion in the parents. I've no doubt those alignments come down to sequencing error or replication problems in the library, as a slight stutter is common in STRs. However the consensus call for the parents is very strong 0/0 with a high PL score. Is that not enough to determine it as denovo? Otherwise it feels like we're using double standards for our evaluation.

Agreed it would be nice for the first example to not call the small number of ALT alignments, but fundamentally this is a problem of downstream evaluation.

jkbonfield commented 7 months ago

An example of the 1st vs 3rd sample in the 2nd dataset:

There is clearly at least 1 alignment in the bottom sample which aligns to the ALT of the 1st sample better than it does to REF.

Similary 1st vs 2nd sample in dataset 2:

It's impossible to say if these low level variations in the parent sample are just sequencing errors or real mosaic variations. I think labelling these as erroneous would be wrong and we'd cause many other real variants to be missed.

That said, I do agree the numbers of ALT seem a bit high (especially the first example) and it needs investigating, but I still think this is a problem of downstream analysis being flawed.

Edit: as before, please don't rely on a crude pictorial representation from IGV for fine analysis of what's going on. It simply obscures too much of the real issue and there's no replacement for getting in there and tinkering with alignments manually to do a human assessment.

jkbonfield commented 7 months ago

I enabled the debugging, and renamed the reads so I could distinguish (I was highly confused for a while until I noted the same read names occur in 3 different read-groups), and I see the assignments with the wrong allele are basically 0 quality as they align to multiple alleles. Ie:

FIN read0-rg1   1       -4      0
FIN read1-rg1   1       -4      0
FIN read2-rg1   1       -4      0
FIN read3-rg1   1       -4      0
FIN read4-rg1   1       -4      0
FIN read6-rg1   0       0       100
FIN read3-rg1   0       0       100
FIN read7-rg1   0       0       100
FIN read8-rg1   0       0       100
...
FIN read0-rg2   1       -4      0
FIN read1-rg2   1       -4      0
FIN read3-rg2   1       -4      0
FIN read6-rg2   0       0       60
FIN read7-rg2   0       0       100
FIN read8-rg2   0       0       100
FIN read9-rg2   0       0       100
FIN read11-rg2  0       0       100

That was the first data set.

Those reports come from this code:

#ifdef ALIGN_DEBUG                                                              
            fprintf(stderr, "FIN %s\t%d\t%d\t%d\n",                             
                    bam_get_qname(p->b), (p->aux>>16)&0x3f,                     
                    bca->indel_types[(p->aux>>16)&0x3f], p->aux&0xff);          
#endif

I'm unsure if there is a way to filter the AD assignment to only include figures where p->aux&0xff is above a minimum score (SeqQ? IndelQ? I forget the bit-packing atm and will check tomorrow). I can't quite figure this out as it's saying rg1=3 alt and rg2=2 alt, whereas I see 5 and 3 respectively at qual 0. I thought we rejected those as "cannot assign", but maybe we're assigning equally to both?

It seems the logical option though. If we are placing absolute trust in our AD values, then we need to have a minimum score for AD assignment so we don't erroneously allocate alignments when they're marginal. I thought --ambig-reads drop was meant to discard these, but it needs further debugging clearly.

Anyway, it appears that fundamentally it's on the correct track, but with a fly in the ointment somewhere. Will look tomorrow.

pd3 commented 7 months ago

this is a problem of downstream analysis being flawed

Thank you for your opinion. I am well aware of what genotype was called, what are the values of PL, of the ambiguity of indel reads, etc. Nonetheless, this issue was open to record the AD counting inaccuracy. It feels a bit presumptuous to assume the analysis, which you know nothing about, is flawed.

don't rely on a crude pictorial representation from IGV for fine analysis of what's going on

I am well aware. Unfortunately, it is still the best tool out there for practical use.

Back to the topic, there is the --ambig-reads option which is supposed to control AD counting in case of reads with ambiguous alignment. It is unfortunate the ALT reads in the parents are reported unexpectedly high, given the ambiguous reads were supposed not to be included in the AD counts.

jkbonfield commented 7 months ago

I can see why it's miscounting things. The ADF/ADR assignment is strictly on base quality, using min_baseQ as the deciding point. For indels, base quality is sometimes p->aux&0xff and sometimes bam_get_qual(p->b)[p->qpos]. It was always this way too, so this isn't a new invention.

I think when p->aux&0xff is zero we shouldn't be reassigning quality to increase it, but I'm unsure of whether that breaks accuracy or has other consequences, so I'd need to evaluate the impact. If it does, we can just take an earlier copy and use that in the AD assignment later on instead of q.

Minimally though, this fixes test case 1. It does not fix test case 2, which nothing will fix as the results are correct to have non-zero ALT AD.

diff --git a/bam2bcf.c b/bam2bcf.c
index 85ae34ba..561b1d24 100644
--- a/bam2bcf.c
+++ b/bam2bcf.c
@@ -314,7 +314,7 @@ int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base, bcf_callaux_t
             if (bca->edlib) {
                 if (indel_in_sample) {
                     seqQ = q = (p->aux & 0xff); // mp2 + builtin indel-bias
-                } else {
+                } else if (p->aux & 0xff) {
                     // An indel in another sample, but not this.  So just use
                     // basic sequence confidences.
                     q = bam_get_qual(p->b)[p->qpos];