pd3 / mpileup-tests

Test cases for mpileup variant calling
MIT License
1 stars 3 forks source link

Findings from analysing my new bcftools mpileup caller #3

Open jkbonfield opened 6 months ago

jkbonfield commented 6 months ago

Hello,

In going through all the tests here I've found numerous mistakes in this test corpus. Many "expected" values show one allele only of a two allele variant. This is mostly true for the CHM1/CHM13 calls, where we already have the SynDip truth set which is (usually) correct. It feels like the "truth" here was simply copied from whatever one specific version of bcftools once produced, rather than being curated. As such I think it's not particularly useful as any improvement in accuracy is reflected as a higher failure rate in this corpus. I did find one SynDip case where I disagreed with the published truth set however.

I spent several days manually going through this lot. Incase it's useful to you I may as well share the findings. Sorry it's quite long, but I've tried to be as detailed as possible in analysing each failed test. The ones that are absent from this list are ones where I agree mostly (bar differing AD numbers, which is a different problem with the tests).

CHM1_CHM13_2.45x-1-3306513.fa
Truth set:
3306513 .       GCCCC   G,GCC   1/2
3306524 .       C       A   0/1

Data:
GCCCC    1 (ref)   called 8
G**CC    9     called 8
G***C    5     called 12
G****    1
?       28+

SNP correlation confirms indel.
Bulk of     C->A changes are G**CC (9/10)
Bulk of non C->A are         G***C (5/8 of spanning)

Number confirming REF is tiny.

REAL MISS: we call GCCC G 0/1.

Incorrect assignment of reads to REF when they're not spanning the
poly-C region.

8 x -2 winner
HK3T7CCXX160124:3:2109:3853:48881  scores -2/0   -3/28  +0/84      56      60
HK2WYCCXX160124:8:1201:17401:61134 scores -2/39  -3/52  +0/78      25      60
HK2WYCCXX160124:7:1108:18385:7497  scores -2/14  -3/21  +0/35      14      60
HK3MJCCXX160124:6:2119:16204:11101 scores -2/23  -3/46  +0/92      45      60
HK35WCCXX160124:1:2106:8278:72403  scores -2/56  -3/63  +0/77      13      60
HK35WCCXX160124:8:2212:2118:71295  scores -2/56  -3/63  +0/77      13      60
HK3T7CCXX160124:4:1202:4117:42903  scores -2/56  -3/63  +0/77      13      60
HK2WYCCXX160124:5:1122:17858:38825 scores -2/42  -3/49  +0/63      13      60

6 x -3 winner
HK3MJCCXX160124:8:2104:8785:27468  scores -3/156 -2/182 +0/234     46      80
HK2WYCCXX160124:8:2107:24160:49777 scores -3/21  -2/28  +0/35      13      80
HK2WYCCXX160124:4:1224:28199:15144 scores -3/84  -2/91  +0/105     13      80
HK35WCCXX160124:1:2103:18659:43343 scores -3/209 -2/228 +0/266     32      80
HK3T7CCXX160124:4:2217:14428:69766 scores -3/40  -2/60  +0/100     38      80
HK2WYCCXX160124:8:2103:18974:40970 scores -3/49  -2/56  +0/70      13      80

68 x 0 winner <--- problem cases
HK3MJCCXX160124:6:1103:25976:8974  scores +0/100 -3/125 -2/150     45      80
HK35WCCXX160124:1:1107:18314:17395 scores +0/116 -3/145 -2/145     52      80
HK2WYCCXX160124:7:1106:19908:23952 scores +0/27  -2/81  -3/108     60      60
HK3MJCCXX160124:8:1210:22912:17026 scores +0/7   -3/14  -2/21      14      80
HK35WCCXX160124:7:2121:22851:64773 scores +0/36  -3/90  -2/90      80      80
HK35WCCXX160124:5:2206:22556:16164 scores +0/24  -3/60  -2/60      66      80
HK3T7CCXX160124:3:2120:11414:5018  scores +0/19  -3/38  -2/57      36      80
HK3T7CCXX160124:3:1119:13839:20155 scores +0/14  -3/28  -2/28      26      80

Example case of HK3MJCCXX160124:6:1103:25976:8974:

HK3MJCCXX160124:6:1103:25976:8974
GGGGACTTTGACTTGCTGCCTCGTCCGAGCATGAGCGGAAGGCACAGAGGGGCCCTCGGGAATCTCCATGCACTTTTCTG
GGGGACTTTGACTTGCTGCCTCGTCCGAGCATGAGCGGAAGGCACAGAGGGGCCCTCGGGAAGCTCCATGCACTTCTCTG
                                                              x            x    
GGGGTGCGCCCC-CCCCCCCCCAC
GGGGTCCGCCCCACCCCCCCCCCC
     x      +         x 
type -3 7d / 1e HK3MJCCXX160124:6:1103:25976:8974

GGGGACTTTGACTTGCTGCCTCGTCCGAGCATGAGCGGAAGGCACAGAGGGGCCCTCGGGAATCTCCATGCACTTTTCTG
GGGGACTTTGACTTGCTGCCTCGTCCGAGCATGAGCGGAAGGCACAGAGGGGCCCTCGGGAAGCTCCATGCACTTCTCTG
                                                              x            x    
GGGGTGCGCCCC-CCCCCCCCCCC
GGGGTCCGCCCCACCCCCCCCCCC
     x      +           
type 0 64 / 18  HK3MJCCXX160124:6:1103:25976:8974

REF (type 0) is better score than type -3 or -2, so we assign as REF.
However it's just luck and we don't really span poly-C.  Could use STR
locations previously identified to heavily reduce seqQ for such reads
as they're less informative.

We attempt with already in the "iscore" calculation, but it could be
improved for end-proximity instead of a generalised STR within region.

-----------------------------------------------------------------------------

CHM1-CHM13-2.45x/CHM1_CHM13_2.45x-1-4046152.fa
Truth set:
4046152 .       CATATATATATATATATATATATATATATAT C,CATATATATATATATAT 1/2

Data:
CATATATATATATATATATATATATATATAT 0
CATATATATATATATAT       1
C               3
Don't span          13

REAL MISS: Issue of reads not spanning poly-AT?
also TEST IS WRONG, but the data is simply not possible to reliably
call anyway.

Can get the answer with mpileup -m1 | call -P0.1, but it'll have a
huge FP rate elsewhere.

I'm not sure what this test is attempting to demonstrate.

-----------------------------------------------------------------------------
CHM1-CHM13-2.45x/CHM1_CHM13_2.45x-1-4196536.bam

4196536 .       ATCAACATGGGAGTGGGTTAG   A  0/1
Calls del correctly, but also calls A->G due to contamination?

TEST IS WRONG

-----------------------------------------------------------------------------
4582929 .       CATCTATCTATCT   C,CATCTATCT 1/2
-1 ATCT and -3 ATCT.  Ref match ones don't span.

-1  2
-3  2
Everything else isn't spanning

TEST IS WRONG

-----------------------------------------------------------------------------
5339195 .       TTGTGTG T,TTG  1/2

T        18
TTG  17

Easy call.
TEST IS WRONG

-----------------------------------------------------------------------------
5559788 .       G       GTAGA,GTAGATAGATAGA 1/2   Wrong?

TAGATAGA      8 (we call 9)
TAGA         14 (we call 11)
REF       0 (we call 9)

REAL MISS: we call 0/1 instead of 1/2.
Indels-2.0 gets this one correct, albeit with incorrect AD.

In a TAGA repeat, I see evidence of +1 and +2 copies, but not the +3
quoted in the SynDip truth file.

TEST IS WRONG: it claims the truth is +2 copies when it's a mixture.

-----------------------------------------------------------------------------
5576604 .       AACACAC A,AACAC 1/2

A   22 (call 18)
AACAC   7  (call 9)

TEST IS WRONG

-----------------------------------------------------------------------------
5833384 .       CAAATAAAT       C,CAAAT 1/2
C       10   call 10
CAAAT       13   call 13

We call this correctly, and it's trivial
TEST IS WRONG

-----------------------------------------------------------------------------
6087279 .       AGTGTGTGTGTGTGTGT       A,AGTGTGTGTGTGT 1/2

AGTGTGTGTGTGTGTGT   0
AGTGTGTGTGTGT       7, call 24
A           7, call 9

We call too many in AD as most don't span, but GT is correct.

TEST IS WRONG, one allele only

-----------------------------------------------------------------------------
6793250 .       T       TCACACACACACA,TCACACACACACACACA 1/2

TCACACACACACA       14, call 20
TCACACACACACACACA   14, call 13

TEST IS WRONG, one allele only

-----------------------------------------------------------------------------
7602898 .       AGTGTGTGTGTGT   A,AGTTTGTGTGTGTGTGTGTGTGT 1/2

AGTGTGTGTGTGT
A               27, call 22
AGTTTGTGTGTGTGTGTGTGTGT     13, call 9

Needs initial "TTT" motif to separate well for AD assignment.

TEST IS WRONG, one allele only

-----------------------------------------------------------------------------
7951596 .       A       AAAAGAAAG,AAAAGAAAGAAAG 1/2   wrong
AAAAG       8, call 7
AAAAGAAAG   8, call 8
AAAAGAAAGAAAG   0

TEST IS WRONG, but so is truth set?

-----------------------------------------------------------------------------
8018929 .       TG      T,TGGG 1/2

        manual  indels-cns  indels-2.0   dev
T       13      15           4           12
TG       0       0          21           11
TGGG     5       6           3            5

Correct, we call both.  Indels 2.0 and dev both majorly overcall ref
as all the unassigned reads that don't verify either candidate allele
get included into REF instead of being nullified.

TEST IS WRONG: it lists the +2G only and misses -1G del.

-----------------------------------------------------------------------------
8302740 .       CTTTATTTATTTA   C,CTTTATTTA 1/2

TEST IS WRONG, one allele only

-----------------------------------------------------------------------------
8315433 .       C       CT,CTTCT 1/2
8315519 .       T       TCTCTCTCC 1/1

Data looks like
+ TTCT
+TTTCT

REAL MISS: we call C CTTTCT 1/1 ??
But could be +T and a correlated change to the TCTC repeat later...

NB at --indel-size 30 we call C CTTTCT,CTTCT.  This does match what the
data looks like, but multiple representations given later CT rep too.

TEST IS WRONG, one allele only

-----------------------------------------------------------------------------
8558792 .       T       TACACAC,TATACACACACAC 1/2

REAL MISS: we call T TACACAC (rescued with --indel-size 40)
TEST IS WRONG: it calls the other allele TATACACACACAC

Lots of short reads that don't span.  Nasty one.  Needs good attention
to clipping and STR boundaries to avoid miscalling.

T
T------ACACAC  14 span, +4 without the T
TATACACACACAC  4 span, +9 more short but with T

-----------------------------------------------------------------------------
9936903 .       TAC     T,TACACACACACACAC 1/2

Everything calls correct (dev, indels-2.0, indels-cns).

TEST IS WRONG, one allele only

TACACACACACACAC    11, call 13
T          17, call 17

-----------------------------------------------------------------------------
Complex:
821325  .       ACAGT   A 0/1

Contains a large amount of contamination, with 4 alleles.
We also call extra variants from that contamination.

(End of SynDip truth set)
=============================================================================

indels/mosaic1.[123].bam
1.1.bam has a het C CA insertion (called by mpileup, not call)
1.2.bam has CA C (called by mpileup, not call)
1.3.bam has CA C (called by mpileup, not call)

As a mosaic, we don't call CA C,CAA

Data is almost non-existant.  I don't understand.
Eg 1.3 has just 1 read with the insertion.

-----------------------------------------------------------------------------

indels/noisy-softclips.bam
Evidence for needing sorting first:
        expected: chr1:74 CAAAAAA C,CAAAA 8.52199 1/2:26,2,5
        found:    chr1:74 CAAAAAA CAAAA,C 11.1474  0/0:8,23,3

-----------------------------------------------------------------------------
bcftools mpileup --indels-cns -a AD -t chr4:85 -f indels/invisible-left-clips.fa indels/invisible-left-clips.bam| bcftools call -m | bcftools view -i 'N_ALT>0' | bcftools norm -f indels/invisible-left-clips.fa | bcftools query -f '%CHROM:%POS %REF %ALT [ %GT:%AD]\n'

No call.
Should be CAAA C,CA

Implies 2 or 3 "A"s removed.  I see no evidence for this in the data.

My counts
CAAA 2
CAA  3
CA   2
C    2

It just looks like the usual variability in poly-A with very low
counts of any call type.  Indels-2.0 calls with QUAL 103 which seems
optimistic.  Dev and indels-cns doesn't call.

call -P0.1 makes dev call this (QUAL 31), but not indels-cns.
(It does at -P 0.9, for Q9)

I think no-call is correct here unless we have other data types
demonstrating the real call.

-----------------------------------------------------------------------------
invisible-left-clips.2.bam

Similarly nigh on impossible to call. I don't know the origin of this
data to know what the true value is.

Clai mis GACA G, which I find no evidence of.

-----------------------------------------------------------------------------
indels/shouldnt-be-called.bam

A total mess.  Lots shouldn't be called, but the one in the test file
isn't with indels-cns.

-----------------------------------------------------------------------------
indels/homopolymer-seqQ.png

Claimed as shouldn't be called. Not called.

Probably a false call too.  Deep data and only a low amount of poly-A variability

-----------------------------------------------------------------------------
indels/trio-puzzling.png

Deep data as it's a trio.

Doesn't look that puzzling!
Probable misalignment / contamination.  But if we take the deep data
as truth:

RG0: A AGAGT 0/1
RG1: A AGAGT 0/1
RG2: A AGAGT 0/1

So no diff between trio members.  PRobably all 3 with systematically
same error.

Raw mpileup (minus call) generates 3 PL:ADs of:

5,0,100:57,18   0,36,139:55,12  0,169,172:93,8

So it thinks sample1 is a mix and samples 2/3 are more likely REF.
Looking at depths I see:

       REF    ALT   ALT%
rg0    80     27    25
rg1    83     15    15
rg2   136      9     6

AD values are a bit out, eg 80/27 vs 57/18 or 136/9 vs 93/8.  However
the evidence is still the same which is that rg0 has a much higher
proportio of ALT than rg1 or 2, so it seems logical that rg0 is called
and rg1/2 may not be depending on options.

indels-2.0 is adamant and dead certain on PL values, while indels-cns
is more ambivalent about rg0/1 and only certain (REF) on rg2.

Do we know the actual truth here anyway?  My guess is all 3 are false,
rather than all 3 being true.

-----------------------------------------------------------------------------

indels/sars-cov-depth.bam

HOM DEL, depth 32.

All reads end close to the deletion, so may trigger suspicion in
callers.  4 start very close to deletion too and/or have big
soft-clips.  Still, I'd expect AD of 0,28 at least.

Cured with mpileup -A which gives AD 0,29. (Lack of PROPER_PAIR flag)
Also should use "--ns UNMAP,SECONDARY,QCFAIL" for this data.

-----------------------------------------------------------------------------

pacbio/pacbio-puzzling.bam

Ref  43 (called 30)
+1C  11 (called 13)

+1C all have low conf 1-2bp further on, unlike ref. => FP
Shouldn't call, but we give it qual 13 so it's marginal anyway.

Could consider a lower "-o" param.
Can also consider more aggressive seqQ reduction in --poly-mqual.  Eg:

    seqQ   += MIN(qavg/20,  min_q - qavg/5); // was qavg/10

Both need to be considered in view of FP vs FN however.

The other potential fix is correlation of minimum run-quality between
alleles.  Eg allele 1 has average min_q of 10 while allele 2 has
average min_q of 50 => probable error in separating into two alleles.
jkbonfield commented 6 months ago

I'll also add that after several days of analysing this, I still believe fundamentally this isn't a useful strategy to take. Hence I'm happy to report my findings to you so you can act on them as you wish, but I am not willing to put time and effort into making PRs because it doesn't fit my own personal testing methodology.

In attempting to fix specific issues found here, I nearly always made the overall performance of my branch get worse. Focusing on a few specific examples is therefore detrimental.

It's useful to know where the tool is working and where it isn't, but what works and what doesn't changes every time we modify the software. I've found it much easier to take the following strategy.

  1. Call a known truth set with current bcftools (eg HG002). Compare using bcftools isec to get correct, false positives and false negative sets. You can also split into alleles to handle the GT cases or compare GT separately to ensure the alleles are correctly recorded.

  2. Do the same with a modified bcftools.

  3. Then compare specific types of failure from 1 and 2 above. Eg do a bcftools isec on old-FP.vcf and new-FP.vcf to see which false positives we've fixed and which new ones we've now made. Similarly for false negatives.

  4. For any comparison in 3 above, subdivide into type of variant. Eg are insertions more error prone than deletions? Are we improving overall but at the detrimental cost of GT 1/2 calls? Etc. Basically we can profile the changes and get an overall view of what's going on, with statistically significant numbers rather than 1-2 specific tests designed to demonstrate a specific issue. This avoids over-correcting to a small hand-crafted training set.

  5. From any stage above, we can plot false positive vs false negative against a specific QUAL>=$x filter. This permits us to see how the trade-off works, as well as where the baseline starts (QUAL>=0) for maximum sensitivity.

  6. Finally given most of the published HG002 data is quite deep, use samtools -s to produce shallower copies of the same data. This allows us to then verify the performance on both deep and shallow data, to see if we've introduced more biases there.

I've spent weeks, probably months overall, tinkering with bcftools and I've found the above to be vastly more useful that attempting to maintain a curated set of interesting cases. Indeed I can recreate any number of interesting cases simply using the above procedure, with the added benefit of (mostly) knowing what the right answer is.

(What I haven't done is analyse all the HG00* samples to produce a small population for joint analysis.)

pd3 commented 6 months ago

Thank you for looking at this.

It feels like the "truth" here was simply copied from whatever one specific version of bcftools once produced, rather than being curated.

Yes, that is exactly how it was produced, and it should not be a surprise - we discussed it offline, several times. Most of the tests in the test suite were not curated.

TEST IS WRONG

Thank you for curating these tests, I will go through your findings. It would have been much more useful if you created a pull request. The free format texts will require another person to go through everything again.

I still believe fundamentally this isn't a useful strategy to take

I know it is simpler to focus on a simple metric without bothering what is the impact of the changes on specific indel classes. But I still believe the opposite, both the overall stats and the individual cases are needed.

jkbonfield commented 6 months ago

As I said above, I won't be doing a PR as I neither have the time to turn all my observations into edits. The test framework needs a substantial overhaul to be useful given it flags minor things like subtle AD variations as errors. Secondly I simply disagree with this strategy.

I know it is simpler to focus on a simple metric without bothering what is the impact of the changes on specific indel classes. But I still believe the opposite, both the overall stats and the individual cases are needed.

Please reread what I said above. I'm not saying that I'm only looking at overall impact and not studying specific cases. I'm saying that I produce such test cases on-the-fly using bcftools isec, so there is no need to keep your own set. Place trust in the truth sets and the community. It is less error prone than us eyeballing things, given they are working with vast amounts of data on a huge plethora of platforms, plus it's simply less time consuming meaning we can evaluate more things.

pd3 commented 6 months ago

It lacks transparency. It is undocumented. It is more time consuming. It is a good complementary approach, but less reproducible. Any change would require the developer to download your datasets, do ad hoc test selection. The suite is not meant to be eye balled, etc.

We discussed this many times. I see no point of bringing it up again.

jkbonfield commented 6 months ago

I'm not trying to argue the relevance of such tests, simply saying how I personally don't find them useful. You do and that's just fine. Hence why I reported the problems I found.

My comment above was simply my explanation for why I didn't make a PR. The information however for you to edit the file if you wish is in my original post, which is why I made it.

pd3 commented 5 months ago

Unfortunately the comments are not that helpful. They require to go through everything again and repeat all the work you've already done.

I checked that many of the cases you highlighted as wrong were not curated yet. They were added to document a case and need a detailed look to decide what output we want to see. The script run-tests.pl has the option

-c, --curated-only      Run only curated tests, i.e. tests with comments

which allows to disregard such tests.

Some of the comments seem to misunderstand or neglect the existing curating comments. For example this one

Doesn't look that puzzling!

seems to be a reaction to the name of the test. However, the curating comment hints on puzzling being the fact that with the used settings it was called in the proband only, even though it is invisible in the parents. Therefore it was falsely marked as a de novo variant in subsequent analysis, which was the problem: either it should be called in all samples (with low quality), or not at all. Note that the test is marked as "FIXME", meaning the test case is intended to highlight a problem that must be addressed in future revisions of the caller.

FIXME: an insertion is called in one sample only and insert reads in other samples are invisible because accompanying mismatches are too rare to make it into the consensus

Thank you for the effort you put into this but, alas, it is not useful as is. I'll keep it open for future reference.

jkbonfield commented 5 months ago

I hadn't realised there were curated and non-curated tests. Maybe this ought to be automatically tagged in the output rather than just failing because something differs.

Also, for the non-curated ones that came from a known truth set (such as SynDip), we'd be best simply trusting the truth set is correct rather than populating the initial variant from an old copy of bcftools which works against us when evaluating improvements.

jkbonfield commented 5 months ago

Doesn't look that puzzling!

seems to be a reaction to the name of the test. However, the curating comment hints on puzzling being the fact that with the used settings it was called in the proband only, even though it is invisible in the parents. Therefore it was falsely marked as a de novo variant in subsequent analysis, which was the problem: either it should be called in all samples (with low quality), or not at all. Note that the test is marked as "FIXME", meaning the test case is intended to highlight a problem that must be addressed in future revisions of the caller.

I think the issue here is the nature of the data. I see what you're saying, but it's not how I view this problem.

I went through and counted the variant depth. Quoting from above:

       REF    ALT   ALT%
rg0    80     27    25
rg1    83     15    15
rg2   136      9     6

So we see that combined depth is varying a lot, from 98 to 145, and the allele frequency as a percentage is also highly variable. It's easy to see why naively it's called in rg0 and not in rg2 (or indeed apparently rg1, but it's marginal). However this has nothing to do with the trio nature of the samples which I feel is possibly barking up the wrong tree.

The assertion that it should be consistent is an implication that the offspring should just be a sampling from both parents in equal proportion. However that's only true is randomness doesn't skew things either way or, more importantly, the variation is actually real. I strongly suspect in this case it's down to a contamination, and the high depth rg2 therefore has a lower proportion of contaminant alignments. As such this isn't a puzzling trio, but a test for contamination screening. Trying to fix detection of novel variants is best done by ways of downgrading suspect alignments or through phasing analysis.

However I don't know for sure as I've no idea where this data came from and what the real evidence is. It just has that feel about it.