mbhall88 / head_to_head_pipeline

Snakemake pipelines to run the analysis for the Illumina vs. Nanopore comparison.
GNU General Public License v3.0
5 stars 2 forks source link

Filter pandora map VCFs #48

Closed mbhall88 closed 3 years ago

mbhall88 commented 4 years ago

There are two types of filtered VCF that we want to produce here:

  1. Remove "large" things. Effectively we want to restrict to indels of a certain size.
  2. Remove non-SNPs. As this VCF will be used in the concordance/distance comparison with compass and bcftools.

Question for @iqbal-lab : What size indels do we want to restrict to?

mbhall88 commented 3 years ago

There are a number of things I need to clarify before I can progress with filtering.

  1. How do I calculate depth for a variant in a pandora VCF. This stems from the fact that I don't fully understand what all the different coverage-related FORMAT fields mean (i.e. MEAN_FWD_COVG, MED_FWD_COVG or SUM_FWD_COVG) as the description in the VCF header for each doesn't really provide much information.
  2. Which varifier precision/recall definition (binary, edit distance, fraction) to use? This is more related to #49 but it informs the choice of filters.

Also, which fields do we want to filter on? I know we have done filtering for the paper but want to clarify it here.

iqbal-lab commented 3 years ago

Leandro can answer these best in terms of how you get this data from the vcf and how he did it. Mean forward coverage means the mean of the coverages of the minimizing kmers on the alloele, in the forward direction etc. I would use edit distance. We will definitely need to use gt-conf, even if imperfect.

leoisl commented 3 years ago
  • How do I calculate depth for a variant in a pandora VCF. This stems from the fact that I don't fully understand what all the different coverage-related FORMAT fields mean (i.e. MEAN_FWD_COVG, MED_FWD_COVG or SUM_FWD_COVG) as the description in the VCF header for each doesn't really provide much information.

PS: I wrote this before reading Zam's answer, sorry. Zam explains correctly and succinctly, my answer is longer, but maybe conveys the same message...

this was very clear to me ~6 months ago, but I think I still remember well. Let's walkthrough the code, as you might spot sth weird. Coverage statistics are computed here: https://github.com/rmcolq/pandora/blob/25d03288d6692e23c386ff29d63eaba328ce9579/include/sampleinfo.h#L180-L215

It is basically summarising the forward/reverse coverages of an allele (allele_to_forward_coverages and allele_to_reverse_coverages) with mean, sum or median. These allele_to_forward_coverages and allele_to_reverse_coverages are set here: https://github.com/rmcolq/pandora/blob/25d03288d6692e23c386ff29d63eaba328ce9579/src/localPRG.cpp#L1588-L1590

based on coverages computed by this function: https://github.com/rmcolq/pandora/blob/25d03288d6692e23c386ff29d63eaba328ce9579/src/localPRG.cpp#L1451-L1527

My understanding of this function is that it gets the minimizer kmer coverages along the kmer path for each allele. As we don't compute per-base coverage, this is an approximation of the per-base coverage based on the indexed minimizers.

So, MEAN_FWD_COVG is an approximation of the mean coverage on the allele in the forward strand, based on the coverage of the minimizers. If you want the mean coverage on the allele, you have to sum MEAN_FWD_COVG and MEAN_REV_COVG. This also applies to SUM_FWD_COVG and SUM_REV_COVG, but not to MED_FWD_COVG nor to MED_REV_COVG.

This part of the code was refactored by me, so that is why I understand a little bit more of it. Feel free to ask any additional stats, it will not be hard to add it.

  • Min depth seemed to be effective in the paper. But need to wait for the resolution of 1. before I can do this.

Yeah, in the paper we consider the coverage/depth of a variant for a sample to be MEAN_FWD_COVG + MEAN_REV_COVG (https://github.com/leoisl/pandora1_paper/blob/9ef34e8b5bbab6e08bc168b7d44e50d0a82d3b60/evaluate/vcf.py#L113-L115). It indeed is one of the most effective filters. But we should be cautious that these coverage statistics, alongside with the genotype confidence, are not normalised. So applying a depth filter of 10 will have different effects on a 50x coverage dataset and a 100x coverage dataset. Although I think you are probably subsampling with rasusa, this problem still happens if you are subsampling to 100x, but you have datasets with less than 100x coverage. Solving this will be done after the paper (but there are lots of other issues to solve after paper also, we need to prioritise them).

  • Strand bias - which is also related to knowing the correct way to calculate depth here.

this is the strand bias filter we are applying in the paper: https://github.com/leoisl/pandora1_paper/blob/master/evaluate/strand_bias_filter.py . Does not look that effective though, but it seems it could be effective on nanopore data, so it is worth trying in your data!

  • Do we also want to filter on GT_CONF? I guess this could be dangerous given we have a wide range of coverages in the samples?

Yeah, this is worrying... we have a GT_CONF_PERCENTILER WIP, which would normalise the GT_CONFs, that was actually finished, but the results we got with it were way worst than before, so there is sth wrong with that... but it seems to be a hard-to-find bug...

  • In order to use the GAP filter I need to rerun pandora with the --min_kmer_covg flag I think. However, I'm not sure what value Leandro used for this - or if we want to use that filter here.

I don't actually see this option here: https://github.com/rmcolq/pandora/blob/25d03288d6692e23c386ff29d63eaba328ce9579/src/compare_main.cpp#L40-L72

min_kmer_covg I guess is set automatically here: https://github.com/rmcolq/pandora/blob/25d03288d6692e23c386ff29d63eaba328ce9579/src/compare_main.cpp#L437-L438 , which is the 1/10 of the expected depth of the first sample. This is important because this is assuming all samples have ~same coverage. If the first sample is a 100x, then min_kmer_covg will be around 10, and will affect differently samples with coverages different from 100x. We actually solved this issue in the gt conf percentiler branch by calculating the min_kmer_covg for each sample, but that branch is still not stable...

iqbal-lab commented 3 years ago

You can still try GT_CONF and see what it does as a first pass. The bias is predictable - any given threshold will be harsher on lower covg samples. I had this "bug" in cortex for ten years (still there, not addressed except via minos ). Min_kmer_covg should just be 1/10 of the depth of the current sample. Horrible workaround is to process lowest covg sample first. How exactly is this used?

iqbal-lab commented 3 years ago

Actually by "You can still try GT_CONF and see what it does as a first pass." - i mean please do use it and see empirically what we see.

leoisl commented 3 years ago

Min_kmer_covg should just be 1/10 of the depth of the current sample. Horrible workaround is to process lowest covg sample first. How exactly is this used?

min_kmer_covg is the lower bound saying that if kmers have less than this coverage, then it is actually not present (to account for sequencing errors, I think). This is an easy fix, actually I think just a cherry pick in one of the commits of the gt conf percentiler branch should solve this. The fix basically stores an array, one min_kmer_covg for each sample, instead of storing a single value for min_kmer_covg. Should we try this?

IDK what would be the impact of this fix in the results... I had implemented some fixes in the past that totally made sense to me, but then they changed the results considerably... but that was a bunch of 4-5 small fixes, this is an isolated one... maybe we should try it... This is why I am pushing towards easy and reproducible results, to catch these things more easily and quickly.

This is also a small fix. If we are thinking about the bigger issue, which is making pandora work on samples with different depths, there are a bunch more issues to fix...

iqbal-lab commented 3 years ago

That single thing is clearly the right thing to do. Leandro please don't do it til after the paper (obviously Michael, feel free) . That would fix a bug affecting multiple samples, with different depths. Leandro, you don't need to guess in advance how it will affect results.

The only other thing I'm aware of is wanting to implement GCP to allow normalised filter thresholds across samples. There are plenty of biases/effects due to coverage - eg de novo performance is coverage dependant, so will perform differently for low/high coverage samples. Filtering with GT_CONF rather than GCP would be harsher on low covg samples, but still valid. De novo is lower recall on low covg samples but we're not stressed about that.

mbhall88 commented 3 years ago

The VCFs I am working with are from map.

Ok, so digging into the min_kmer_covg thing a little further, it was a possible CLI param for map, however, it was hidden from the --help menu. I traced it through to the localPRG class and realised it was unused anyway. So I have removed it in https://github.com/rmcolq/pandora/pull/224/commits/2e7cbbef070f3087669dbf33a6c25892db35a316

Regarding everything else, I will start playing around with the filters we have talked about on my samples I have truth assemblies for and update here with how these go.

mbhall88 commented 3 years ago

Varifier results from pandora VCFs (all variants - not just SNPs)

This is trialling a couple of filters. I haven't tried all filters yet as I think the results are bad enough that we may need to reconsider either trying new PRG densities, or digging into pandora's code.

filter codes:

The compass calls are just SNPs.

Recall

image

Precision

image

Precision may not be all that bad with a bit of work on the fitlers, but the recall is bad.

One thing I did notice when looking through the precision VCF is a lot of calls with the varifier result FP_REF_PROBE_BETTER_MATCH. As an example of the counts for sample mada_125 (the sample with the worst precision)

Counter({'TP': 1049,
'FP': 31,
'Partial_TP': 106,
'FP_REF_PROBE_BETTER_MATCH': 253,
'FP_PROBE_UNMAPPED': 5})

although this doesn't directly relate to recall I guess.

I'll have to dig into the varifier output to see if I can figure out some info on how to approach trying to improve recal...

iqbal-lab commented 3 years ago

This could easily be due to the issue martin described on the varifier channel. in particular if the ref-probe-better-match things are all in regions with clustered variants

mbhall88 commented 3 years ago

This could easily be due to the issue martin described on the varifier channel. in particular if the ref-probe-better-match things are all in regions with clustered variants

But we don't see this problem with either compass or bcftools...

To clarify: I am mostly referring to recall here

mbhall88 commented 3 years ago

I have finished the filters parameter sweep. This only explores filters relating to the resulting pandora VCF.

Filter definitions:

Recall - SNPs only

image

Precision - SNPs only

image

Recall - all variants

image

Precision - all variants

image

Conclusion

For pure VCF filters, at least for the SNPs only, it seems as though strand bias 1% and min. genotype confidence of 25 is the best choice.

iqbal-lab commented 3 years ago

Looks good. if you added d3, you might find strand bias 1% and genotype confidence <something lower than 25> might give as good precision with less loss of recall

mbhall88 commented 3 years ago

Interesting, I'll take a look at that today

mbhall88 commented 3 years ago

Looks good. if you added d3, you might find strand bias 1% and genotype confidence <something lower than 25> might give as good precision with less loss of recall

Indeed, as usual, you are onto it.

image

image

image

image


Let me know what you think about this, but it might be useful for us to start looking at F-scores as a way of integrating precision and recall?

Of particular interest for us would be F-beta as it allows us to weight one of precision or recall as more important than the other.

With that in mind, here are some plots

F1 score

F1 score gives equal weight to precision and recall - it is the harmonic mean of the two.

image

image

F-beta score

This much more of interest to us as we value precision more highly. If we set beta to be 0.5, then we value precision twice as much as recall.

image

image

Some statistics for the different metrics for pandora filters

# SNPs only
e0.08E0.08g25s1 has the highest median precision of 0.99906279
E0.08none has the highest median recall of 0.8436644
d3s1g10 has the highest median f1 of 0.901754416729267
d3s1g15 has the highest median fbeta of 0.953842695494108

# all variants 
e0.08E0.08g25s1I20 has the highest median precision of 0.94015748
E0.08none has the highest median recall of 0.8867745
E0.08g25s1I20 has the highest median f1 of 0.84287854813337
e0.08E0.08g25s1I20 has the highest median fbeta of 0.8816711888137843

And for completeness, we can have a look at F-beta's relationship with coverage. The different styled lines are for the different filter/error rate parameters

image

image


Conclusion

I am leaning towards d3s1g15. Although, with the default error rates, this does much better for SNPs than all variants. But, if we set the error rates to 0.08 it does much better for all variants than it does for SNPs...

mbhall88 commented 3 years ago

Final filters can be seen at #62