Closed mbhall88 closed 3 years ago
There are a number of things I need to clarify before I can progress with filtering.
MEAN_FWD_COVG
, MED_FWD_COVG
or SUM_FWD_COVG
) as the description in the VCF header for each doesn't really provide much information. varifier
precision/recall definition (binary, edit distance, fraction) to use? This is more related to #49 but it informs the choice of filters.Also, which fields do we want to filter on? I know we have done filtering for the paper but want to clarify it here.
--min_kmer_covg
flag I think. However, I'm not sure what value Leandro used for this - or if we want to use that filter here.Leandro can answer these best in terms of how you get this data from the vcf and how he did it. Mean forward coverage means the mean of the coverages of the minimizing kmers on the alloele, in the forward direction etc. I would use edit distance. We will definitely need to use gt-conf, even if imperfect.
- How do I calculate depth for a variant in a pandora VCF. This stems from the fact that I don't fully understand what all the different coverage-related FORMAT fields mean (i.e.
MEAN_FWD_COVG
,MED_FWD_COVG
orSUM_FWD_COVG
) as the description in the VCF header for each doesn't really provide much information.
PS: I wrote this before reading Zam's answer, sorry. Zam explains correctly and succinctly, my answer is longer, but maybe conveys the same message...
this was very clear to me ~6 months ago, but I think I still remember well. Let's walkthrough the code, as you might spot sth weird. Coverage statistics are computed here: https://github.com/rmcolq/pandora/blob/25d03288d6692e23c386ff29d63eaba328ce9579/include/sampleinfo.h#L180-L215
It is basically summarising the forward/reverse coverages of an allele (allele_to_forward_coverages
and allele_to_reverse_coverages
) with mean, sum or median.
These allele_to_forward_coverages
and allele_to_reverse_coverages
are set here: https://github.com/rmcolq/pandora/blob/25d03288d6692e23c386ff29d63eaba328ce9579/src/localPRG.cpp#L1588-L1590
based on coverages computed by this function: https://github.com/rmcolq/pandora/blob/25d03288d6692e23c386ff29d63eaba328ce9579/src/localPRG.cpp#L1451-L1527
My understanding of this function is that it gets the minimizer kmer coverages along the kmer path for each allele. As we don't compute per-base coverage, this is an approximation of the per-base coverage based on the indexed minimizers.
So, MEAN_FWD_COVG
is an approximation of the mean coverage on the allele in the forward strand, based on the coverage of the minimizers. If you want the mean coverage on the allele, you have to sum MEAN_FWD_COVG
and MEAN_REV_COVG
. This also applies to SUM_FWD_COVG
and SUM_REV_COVG
, but not to MED_FWD_COVG
nor to MED_REV_COVG
.
This part of the code was refactored by me, so that is why I understand a little bit more of it. Feel free to ask any additional stats, it will not be hard to add it.
- Min depth seemed to be effective in the paper. But need to wait for the resolution of 1. before I can do this.
Yeah, in the paper we consider the coverage/depth of a variant for a sample to be MEAN_FWD_COVG + MEAN_REV_COVG
(https://github.com/leoisl/pandora1_paper/blob/9ef34e8b5bbab6e08bc168b7d44e50d0a82d3b60/evaluate/vcf.py#L113-L115). It indeed is one of the most effective filters. But we should be cautious that these coverage statistics, alongside with the genotype confidence, are not normalised. So applying a depth filter of 10 will have different effects on a 50x coverage dataset and a 100x coverage dataset. Although I think you are probably subsampling with rasusa, this problem still happens if you are subsampling to 100x, but you have datasets with less than 100x coverage. Solving this will be done after the paper (but there are lots of other issues to solve after paper also, we need to prioritise them).
- Strand bias - which is also related to knowing the correct way to calculate depth here.
this is the strand bias filter we are applying in the paper: https://github.com/leoisl/pandora1_paper/blob/master/evaluate/strand_bias_filter.py . Does not look that effective though, but it seems it could be effective on nanopore data, so it is worth trying in your data!
- Do we also want to filter on GT_CONF? I guess this could be dangerous given we have a wide range of coverages in the samples?
Yeah, this is worrying... we have a GT_CONF_PERCENTILER WIP, which would normalise the GT_CONFs, that was actually finished, but the results we got with it were way worst than before, so there is sth wrong with that... but it seems to be a hard-to-find bug...
- In order to use the GAP filter I need to rerun pandora with the
--min_kmer_covg
flag I think. However, I'm not sure what value Leandro used for this - or if we want to use that filter here.
I don't actually see this option here: https://github.com/rmcolq/pandora/blob/25d03288d6692e23c386ff29d63eaba328ce9579/src/compare_main.cpp#L40-L72
min_kmer_covg
I guess is set automatically here: https://github.com/rmcolq/pandora/blob/25d03288d6692e23c386ff29d63eaba328ce9579/src/compare_main.cpp#L437-L438 , which is the 1/10 of the expected depth of the first sample. This is important because this is assuming all samples have ~same coverage. If the first sample is a 100x, then min_kmer_covg
will be around 10, and will affect differently samples with coverages different from 100x. We actually solved this issue in the gt conf percentiler branch by calculating the min_kmer_covg
for each sample, but that branch is still not stable...
You can still try GT_CONF and see what it does as a first pass. The bias is predictable - any given threshold will be harsher on lower covg samples. I had this "bug" in cortex for ten years (still there, not addressed except via minos ). Min_kmer_covg should just be 1/10 of the depth of the current sample. Horrible workaround is to process lowest covg sample first. How exactly is this used?
Actually by "You can still try GT_CONF and see what it does as a first pass." - i mean please do use it and see empirically what we see.
Min_kmer_covg should just be 1/10 of the depth of the current sample. Horrible workaround is to process lowest covg sample first. How exactly is this used?
min_kmer_covg
is the lower bound saying that if kmers have less than this coverage, then it is actually not present (to account for sequencing errors, I think). This is an easy fix, actually I think just a cherry pick in one of the commits of the gt conf percentiler branch should solve this. The fix basically stores an array, one min_kmer_covg
for each sample, instead of storing a single value for min_kmer_covg
. Should we try this?
IDK what would be the impact of this fix in the results... I had implemented some fixes in the past that totally made sense to me, but then they changed the results considerably... but that was a bunch of 4-5 small fixes, this is an isolated one... maybe we should try it... This is why I am pushing towards easy and reproducible results, to catch these things more easily and quickly.
This is also a small fix. If we are thinking about the bigger issue, which is making pandora work on samples with different depths, there are a bunch more issues to fix...
That single thing is clearly the right thing to do. Leandro please don't do it til after the paper (obviously Michael, feel free) . That would fix a bug affecting multiple samples, with different depths. Leandro, you don't need to guess in advance how it will affect results.
The only other thing I'm aware of is wanting to implement GCP to allow normalised filter thresholds across samples. There are plenty of biases/effects due to coverage - eg de novo performance is coverage dependant, so will perform differently for low/high coverage samples. Filtering with GT_CONF rather than GCP would be harsher on low covg samples, but still valid. De novo is lower recall on low covg samples but we're not stressed about that.
The VCFs I am working with are from map
.
Ok, so digging into the min_kmer_covg
thing a little further, it was a possible CLI param for map
, however, it was hidden from the --help
menu. I traced it through to the localPRG class and realised it was unused anyway. So I have removed it in https://github.com/rmcolq/pandora/pull/224/commits/2e7cbbef070f3087669dbf33a6c25892db35a316
Regarding everything else, I will start playing around with the filters we have talked about on my samples I have truth assemblies for and update here with how these go.
This is trialling a couple of filters. I haven't tried all filters yet as I think the results are bad enough that we may need to reconsider either trying new PRG densities, or digging into pandora's code.
filter codes:
The compass calls are just SNPs.
Precision may not be all that bad with a bit of work on the fitlers, but the recall is bad.
One thing I did notice when looking through the precision VCF is a lot of calls with the varifier result FP_REF_PROBE_BETTER_MATCH
. As an example of the counts for sample mada_125 (the sample with the worst precision)
Counter({'TP': 1049,
'FP': 31,
'Partial_TP': 106,
'FP_REF_PROBE_BETTER_MATCH': 253,
'FP_PROBE_UNMAPPED': 5})
although this doesn't directly relate to recall I guess.
I'll have to dig into the varifier output to see if I can figure out some info on how to approach trying to improve recal...
This could easily be due to the issue martin described on the varifier channel. in particular if the ref-probe-better-match things are all in regions with clustered variants
This could easily be due to the issue martin described on the varifier channel. in particular if the ref-probe-better-match things are all in regions with clustered variants
But we don't see this problem with either compass or bcftools
...
To clarify: I am mostly referring to recall here
I have finished the filters parameter sweep. This only explores filters relating to the resulting pandora VCF.
Filter definitions:
For pure VCF filters, at least for the SNPs only, it seems as though strand bias 1% and min. genotype confidence of 25 is the best choice.
Looks good. if you added d3, you might find strand bias 1% and genotype confidence <something lower than 25> might give as good precision with less loss of recall
Interesting, I'll take a look at that today
Looks good. if you added d3, you might find strand bias 1% and genotype confidence <something lower than 25> might give as good precision with less loss of recall
Indeed, as usual, you are onto it.
Let me know what you think about this, but it might be useful for us to start looking at F-scores as a way of integrating precision and recall?
Of particular interest for us would be F-beta as it allows us to weight one of precision or recall as more important than the other.
With that in mind, here are some plots
F1 score gives equal weight to precision and recall - it is the harmonic mean of the two.
This much more of interest to us as we value precision more highly. If we set beta to be 0.5, then we value precision twice as much as recall.
Some statistics for the different metrics for pandora filters
# SNPs only
e0.08E0.08g25s1 has the highest median precision of 0.99906279
E0.08none has the highest median recall of 0.8436644
d3s1g10 has the highest median f1 of 0.901754416729267
d3s1g15 has the highest median fbeta of 0.953842695494108
# all variants
e0.08E0.08g25s1I20 has the highest median precision of 0.94015748
E0.08none has the highest median recall of 0.8867745
E0.08g25s1I20 has the highest median f1 of 0.84287854813337
e0.08E0.08g25s1I20 has the highest median fbeta of 0.8816711888137843
And for completeness, we can have a look at F-beta's relationship with coverage. The different styled lines are for the different filter/error rate parameters
I am leaning towards d3s1g15
. Although, with the default error rates, this does much better for SNPs than all variants. But, if we set the error rates to 0.08 it does much better for all variants than it does for SNPs...
Final filters can be seen at #62
There are two types of filtered VCF that we want to produce here:
bcftools
.Question for @iqbal-lab : What size indels do we want to restrict to?