Open jkbonfield opened 6 months ago
I'll also add that after several days of analysing this, I still believe fundamentally this isn't a useful strategy to take. Hence I'm happy to report my findings to you so you can act on them as you wish, but I am not willing to put time and effort into making PRs because it doesn't fit my own personal testing methodology.
In attempting to fix specific issues found here, I nearly always made the overall performance of my branch get worse. Focusing on a few specific examples is therefore detrimental.
It's useful to know where the tool is working and where it isn't, but what works and what doesn't changes every time we modify the software. I've found it much easier to take the following strategy.
Call a known truth set with current bcftools (eg HG002). Compare using bcftools isec to get correct, false positives and false negative sets. You can also split into alleles to handle the GT cases or compare GT separately to ensure the alleles are correctly recorded.
Do the same with a modified bcftools.
Then compare specific types of failure from 1 and 2 above. Eg do a bcftools isec on old-FP.vcf and new-FP.vcf to see which false positives we've fixed and which new ones we've now made. Similarly for false negatives.
For any comparison in 3 above, subdivide into type of variant. Eg are insertions more error prone than deletions? Are we improving overall but at the detrimental cost of GT 1/2 calls? Etc. Basically we can profile the changes and get an overall view of what's going on, with statistically significant numbers rather than 1-2 specific tests designed to demonstrate a specific issue. This avoids over-correcting to a small hand-crafted training set.
From any stage above, we can plot false positive vs false negative against a specific QUAL>=$x
filter. This permits us to see how the trade-off works, as well as where the baseline starts (QUAL>=0) for maximum sensitivity.
Finally given most of the published HG002 data is quite deep, use samtools -s
to produce shallower copies of the same data. This allows us to then verify the performance on both deep and shallow data, to see if we've introduced more biases there.
I've spent weeks, probably months overall, tinkering with bcftools and I've found the above to be vastly more useful that attempting to maintain a curated set of interesting cases. Indeed I can recreate any number of interesting cases simply using the above procedure, with the added benefit of (mostly) knowing what the right answer is.
(What I haven't done is analyse all the HG00* samples to produce a small population for joint analysis.)
Thank you for looking at this.
It feels like the "truth" here was simply copied from whatever one specific version of bcftools once produced, rather than being curated.
Yes, that is exactly how it was produced, and it should not be a surprise - we discussed it offline, several times. Most of the tests in the test suite were not curated.
TEST IS WRONG
Thank you for curating these tests, I will go through your findings. It would have been much more useful if you created a pull request. The free format texts will require another person to go through everything again.
I still believe fundamentally this isn't a useful strategy to take
I know it is simpler to focus on a simple metric without bothering what is the impact of the changes on specific indel classes. But I still believe the opposite, both the overall stats and the individual cases are needed.
As I said above, I won't be doing a PR as I neither have the time to turn all my observations into edits. The test framework needs a substantial overhaul to be useful given it flags minor things like subtle AD variations as errors. Secondly I simply disagree with this strategy.
I know it is simpler to focus on a simple metric without bothering what is the impact of the changes on specific indel classes. But I still believe the opposite, both the overall stats and the individual cases are needed.
Please reread what I said above. I'm not saying that I'm only looking at overall impact and not studying specific cases. I'm saying that I produce such test cases on-the-fly using bcftools isec, so there is no need to keep your own set. Place trust in the truth sets and the community. It is less error prone than us eyeballing things, given they are working with vast amounts of data on a huge plethora of platforms, plus it's simply less time consuming meaning we can evaluate more things.
It lacks transparency. It is undocumented. It is more time consuming. It is a good complementary approach, but less reproducible. Any change would require the developer to download your datasets, do ad hoc test selection. The suite is not meant to be eye balled, etc.
We discussed this many times. I see no point of bringing it up again.
I'm not trying to argue the relevance of such tests, simply saying how I personally don't find them useful. You do and that's just fine. Hence why I reported the problems I found.
My comment above was simply my explanation for why I didn't make a PR. The information however for you to edit the file if you wish is in my original post, which is why I made it.
Unfortunately the comments are not that helpful. They require to go through everything again and repeat all the work you've already done.
I checked that many of the cases you highlighted as wrong were not curated yet. They were added to document a case and need a detailed look to decide what output we want to see. The script run-tests.pl
has the option
-c, --curated-only Run only curated tests, i.e. tests with comments
which allows to disregard such tests.
Some of the comments seem to misunderstand or neglect the existing curating comments. For example this one
Doesn't look that puzzling!
seems to be a reaction to the name of the test. However, the curating comment hints on puzzling being the fact that with the used settings it was called in the proband only, even though it is invisible in the parents. Therefore it was falsely marked as a de novo variant in subsequent analysis, which was the problem: either it should be called in all samples (with low quality), or not at all. Note that the test is marked as "FIXME", meaning the test case is intended to highlight a problem that must be addressed in future revisions of the caller.
FIXME: an insertion is called in one sample only and insert reads in other samples are invisible because accompanying mismatches are too rare to make it into the consensus
Thank you for the effort you put into this but, alas, it is not useful as is. I'll keep it open for future reference.
I hadn't realised there were curated and non-curated tests. Maybe this ought to be automatically tagged in the output rather than just failing because something differs.
Also, for the non-curated ones that came from a known truth set (such as SynDip), we'd be best simply trusting the truth set is correct rather than populating the initial variant from an old copy of bcftools which works against us when evaluating improvements.
Doesn't look that puzzling!
seems to be a reaction to the name of the test. However, the curating comment hints on puzzling being the fact that with the used settings it was called in the proband only, even though it is invisible in the parents. Therefore it was falsely marked as a de novo variant in subsequent analysis, which was the problem: either it should be called in all samples (with low quality), or not at all. Note that the test is marked as "FIXME", meaning the test case is intended to highlight a problem that must be addressed in future revisions of the caller.
I think the issue here is the nature of the data. I see what you're saying, but it's not how I view this problem.
I went through and counted the variant depth. Quoting from above:
REF ALT ALT%
rg0 80 27 25
rg1 83 15 15
rg2 136 9 6
So we see that combined depth is varying a lot, from 98 to 145, and the allele frequency as a percentage is also highly variable. It's easy to see why naively it's called in rg0 and not in rg2 (or indeed apparently rg1, but it's marginal). However this has nothing to do with the trio nature of the samples which I feel is possibly barking up the wrong tree.
The assertion that it should be consistent is an implication that the offspring should just be a sampling from both parents in equal proportion. However that's only true is randomness doesn't skew things either way or, more importantly, the variation is actually real. I strongly suspect in this case it's down to a contamination, and the high depth rg2 therefore has a lower proportion of contaminant alignments. As such this isn't a puzzling trio, but a test for contamination screening. Trying to fix detection of novel variants is best done by ways of downgrading suspect alignments or through phasing analysis.
However I don't know for sure as I've no idea where this data came from and what the real evidence is. It just has that feel about it.
Hello,
In going through all the tests here I've found numerous mistakes in this test corpus. Many "expected" values show one allele only of a two allele variant. This is mostly true for the CHM1/CHM13 calls, where we already have the SynDip truth set which is (usually) correct. It feels like the "truth" here was simply copied from whatever one specific version of bcftools once produced, rather than being curated. As such I think it's not particularly useful as any improvement in accuracy is reflected as a higher failure rate in this corpus. I did find one SynDip case where I disagreed with the published truth set however.
I spent several days manually going through this lot. Incase it's useful to you I may as well share the findings. Sorry it's quite long, but I've tried to be as detailed as possible in analysing each failed test. The ones that are absent from this list are ones where I agree mostly (bar differing AD numbers, which is a different problem with the tests).