mbhall88 commented 4 years ago

@iqbal-lab are there standard filters you would like to apply? If not, how do you suggest we go about determining what filters to use?

iqbal-lab commented 4 years ago

Min coverage (5x?) Strand bias - 20% of reads must come from each strand

mbhall88 commented 4 years ago

@iqbal-lab should we also throw in a QUAL filter? See https://samtools.github.io/bcftools/howtos/variant-calling.html

When I look at some stats on one particular VCF there is quite a few SNPs below this level...

iqbal-lab commented 4 years ago

Suck it and see. Worked in pandora e coli, but not v discriminating, v many calls with identical qual

mbhall88 commented 4 years ago

For the coverage, would you prefer me to do this based on expected coverage? i.e. I make an initial pass through the VCF and get the value for DP fields. Then try something like filter any variant with DP<(t x m) where m is median (mean?) DP and t is some scaling factor (0.2?) and DP>(u * m) where u is some max. threshold (2.0?)?

Or do you think fixed-threshold is better?

mbhall88 commented 4 years ago

Suck it and see. Worked in pandora e coli, but not v discriminating, v many calls with identical qual

Not really sure what you mean by this?

We have a pretty wide range of QUAL with this sample I have been using.

Example

# QUAL  [2]id   [3]Quality      [4]number of SNPs       [5]number of transitions (1st ALT)      [6]number of transversions (1st ALT)    [7]number of indels
QUAL    0       3       1       0       1       0
QUAL    0       4       11      8       3       0
QUAL    0       5       11      4       7       0
QUAL    0       6       6       2       4       0
QUAL    0       7       8       3       5       0
QUAL    0       8       15      6       9       0
QUAL    0       9       4       2       2       0
QUAL    0       10      8       5       3       0
QUAL    0       11      3       3       0       0
QUAL    0       12      7       3       4       0
QUAL    0       13      5       3       2       0
QUAL    0       14      6       4       2       0
QUAL    0       15      6       4       2       0
QUAL    0       16      6       2       4       0
QUAL    0       17      9       5       4       0
QUAL    0       18      10      6       4       0
QUAL    0       19      6       3       3       0
QUAL    0       20      6       4       2       0
QUAL    0       21      6       3       3       0
...
QUAL    0       194     6       3       3       0
QUAL    0       195     3       2       1       0
QUAL    0       196     5       3       2       0
QUAL    0       197     3       3       0       0
QUAL    0       198     10      4       6       0
QUAL    0       199     4       3       1       0
QUAL    0       200     3       1       2       0
QUAL    0       201     3       1       2       0
QUAL    0       202     2       1       1       0
QUAL    0       203     1       0       1       0
QUAL    0       204     4       2       2       0
QUAL    0       205     4       4       0       0
QUAL    0       206     3       2       1       0
QUAL    0       207     6       3       3       0
QUAL    0       208     4       3       1       0
QUAL    0       209     2       2       0       0
QUAL    0       210     5       1       4       0
QUAL    0       211     5       2       3       0
QUAL    0       212     2       2       0       0
QUAL    0       213     7       3       4       0
QUAL    0       214     4       4       0       0
QUAL    0       215     6       3       3       0
QUAL    0       216     6       2       4       0
QUAL    0       217     8       3       5       0
QUAL    0       218     8       5       3       0
QUAL    0       219     3       2       1       0
QUAL    0       220     3       2       1       0
QUAL    0       221     3       3       0       0
QUAL    0       222     4       2       2       0
QUAL    0       223     8       4       4       0
QUAL    0       224     4       3       1       0
QUAL    0       225     465     305     160     0
QUAL    0       226     4       2       2       0
QUAL    0       227     4       3       1       0
QUAL    0       228     332     207     125     0

iqbal-lab commented 4 years ago

You understand me correctly, but data seemed v different with e coli illumina

mbhall88 commented 4 years ago

You understand me correctly, but data seemed v different with e coli illumina

I don't know if you're saying to do it or not? That's what I am confused about... (not sure if I am missing some vital British subtext?)

iqbal-lab commented 4 years ago

I'm saying feel free to try it. I can't tell in advance if it will help." Suck it and see" meant try it and evaluate results.

mbhall88 commented 4 years ago

Not sure if you missed https://github.com/mbhall88/head_to_head_pipeline/issues/39#issuecomment-662849030 but are you happy with making coverage filter dynamic rather than a fixed threshold?

iqbal-lab commented 4 years ago

I vote Dynamic. Especially as we likely to rerun on subsampled reads.

mbhall88 commented 4 years ago

Here are results from the first sweep of filter params.

Context: This is with sample mada_112 full coverage (150x) and all default bcftools parameters. An explanation of the filters as per the CLI menu

  -d, --min-depth FLOAT           Minimum depth as a percentage of the
                                  expected (median) depth. This filter has ID:
                                  ld. Set to 0 to disable  [default: 0.2]
  -D, --max-depth FLOAT           Maximum depth as a fraction of the expected
                                  (median) depth. This filter has ID: hd. Set
                                  to 0 to disable  [default: 2.0]
  -s, --min-strand-bias INT       Filter a variant if either strand has less
                                  than INT% of the (high-quality) depth on the
                                  called allele (DP4). This filter has ID: sb.
                                  Set to 0 to disable  [default: 25]
  -q, --min-qual FLOAT            Filter a variant if QUAL is less than INT.
                                  This filter has ID: lq. Set to 0 to disable
                                  [default: 0.0]

	call_rate	concordance	gw_call_rate	gw_concordance	filter
no_filter	100.000000%	100.000000%	99.999975%	99.999779%	no_filter
min-depth=0.1	100.000000%	100.000000%	99.999902%	99.999779%	min-depth=0.1
max-depth=2.0	100.000000%	100.000000%	99.999975%	99.999779%	max-depth=2.0
min-qual=20.0	100.000000%	100.000000%	99.999705%	99.999902%	min-qual=20.0
min-strand-bias=20	96.675900%	100.000000%	97.531206%	99.999874%	min-strand-bias=20
all_filters	96.675900%	100.000000%	97.530960%	99.999950%	all_filters

mbhall88 commented 4 years ago

And for the same data, but the 60x subsampled version

	call_rate	concordance	gw_call_rate	gw_concordance	filter
no_filter	100.000000%	99.861496%	99.999189%	99.999730%	no_filter
min-depth=0.1	100.000000%	99.861496%	99.999091%	99.999730%	min-depth=0.1
max-depth=2.0	100.000000%	99.861496%	99.976139%	99.999730%	max-depth=2.0
min-qual=20.0	99.722992%	100.000000%	99.998747%	99.999853%	min-qual=20.0
min-strand-bias=20	96.121884%	99.855908%	96.761672%	99.999924%	min-strand-bias=20
all_filters	95.844875%	100.000000%	96.739114%	99.999949%	all_filters

mbhall88 commented 4 years ago

It seems from these that strand bias at 20% hurts us in both metrics. Minimum depth of 0.1 also doesn't seem to offer any benefits over no-filter. Max. depth 2.0 also doesn't seem to offer an advantage in either dataset. Min. QUAL of 20 does seem to have quite a positive impact on concordance - at the cost of call rate...

mbhall88 commented 4 years ago

Results for a different sample, R23887, which has 143x coverage.

	call_rate	concordance	gw_call_rate	gw_concordance	filter
no_filter	100.000000%	100.000000%	99.996653%	99.999828%	no_filter
min-depth=0.1	100.000000%	100.000000%	99.993675%	99.999828%	min-depth=0.1
max-depth=2.0	100.000000%	100.000000%	99.990943%	99.999828%	max-depth=2.0
min-qual=20.0	100.000000%	100.000000%	99.996087%	99.999951%	min-qual=20.0
min-strand-bias=20	96.883396%	100.000000%	97.726728%	99.999824%	min-strand-bias=20
all_filters	96.883396%	100.000000%	97.719173%	99.999950%	all_filters

mbhall88 commented 4 years ago

R23887 with some different params

	call_rate	concordance	gw_call_rate	gw_concordance	filter
no_filter	100.000000%	100.000000%	99.996653%	99.999828%	no_filter
min-depth=0.05	100.000000%	100.000000%	99.993700%	99.999828%	min-depth=0.05
max-depth=2.2	100.000000%	100.000000%	99.996653%	99.999828%	max-depth=2.2
min-qual=15.0	100.000000%	100.000000%	99.996210%	99.999951%	min-qual=15.0
min-strand-bias=10	99.408920%	100.000000%	99.448115%	99.999827%	min-strand-bias=10
all_filters	99.408920%	100.000000%	99.446319%	99.999951%	all_filters

mbhall88 commented 4 years ago

mada_112 60x with some different params

	call_rate	concordance	gw_call_rate	gw_concordance	filter
no_filter	100.000000%	99.861496%	99.999189%	99.999730%	no_filter
min-depth=0.15	100.000000%	99.861496%	99.998182%	99.999730%	min-depth=0.15
max-depth=2.5	100.000000%	99.861496%	99.999189%	99.999730%	max-depth=2.5
min-qual=15.0	99.861496%	100.000000%	99.998943%	99.999853%	min-qual=15.0
min-strand-bias=15	97.783934%	99.858357%	98.158142%	99.999925%	min-strand-bias=15
all_filters	97.645429%	100.000000%	98.157577%	99.999950%	all_filters

mbhall88 commented 4 years ago

Last one for the day. R23887 with different params

	call_rate	concordance	gw_call_rate	gw_concordance	filter
no_filter	100.000000%	100.000000%	99.996653%	99.999828%	no_filter
min-depth=0.2	100.000000%	100.000000%	99.992371%	99.999828%	min-depth=0.2
max-depth=2.1	100.000000%	100.000000%	99.996653%	99.999828%	max-depth=2.1
min-qual=30.0	100.000000%	100.000000%	82.714386%	99.999970%	min-qual=30.0
min-strand-bias=5	99.623858%	100.000000%	99.820813%	99.999827%	min-strand-bias=5
all_filters	99.623858%	100.000000%	82.598004%	99.999970%	all_filters

mbhall88 commented 4 years ago

I haven't really seen any evidence in these tables that any filter other than minimum quality is useful.... Happy to try other values/filters if you like @iqbal-lab ?

iqbal-lab commented 4 years ago

Min qual seems sufficient.

i think the value of filters might be more obvious at 15x or 30x. It's quite common for Mtb to end up with low depth. eg see how many we discarded.

Seems fine for now. We will need to return to the question of how the tools do at lower depth

mbhall88 commented 4 years ago

Ok, so I have run the same parameters as in the initial result tables on a sample, mada_131 that has 32x nanopore coverage (median 28x from pileup).

	call_rate	concordance	gw_call_rate	gw_concordance	filter
no_filter	99.918167%	98.525799%	99.989171%	99.998942%	no_filter
min-depth=0.1	99.918167%	98.525799%	99.989048%	99.998942%	min-depth=0.1
max-depth=2.0	99.836334%	98.524590%	99.707278%	99.998939%	max-depth=2.0
min-qual=20.0	97.545008%	99.664430%	99.984200%	99.999705%	min-qual=20.0
min-strand-bias=20	86.824877%	98.963242%	91.408123%	99.999623%	min-strand-bias=20
all_filters	85.188216%	99.903939%	91.133269%	99.999919%	all_filters

i think the value of filters might be more obvious at 15x or 30x. It's quite common for Mtb to end up with low depth. eg see how many we discarded.

As always, you're spot on it seems.

In conclusion, I will run all samples through the call-filter-concordance pipeline (using just min QUAL filter) and then look at the plot of call rate vs. concordance. We can play with filter params from there and see how it affects the whole dataset.

mbhall88 commented 4 years ago

I have been using varifier [#42] to try and hone the filters. Here are "checkpoint" plots for recall and precision with various filters.

A note on the filter "keys" used on the X-axes below:

q = min qual
d = min depth
D = max depth
sb = strand bias

These plots are purely SNPs at the moment.

Recall

Precision

Discussion

There is a noticeable outlier in the precision plots: mada_102. The assembly for this sample looks good. In collaboration with Martin, we found that most of the FPs are due to unmapped probes. Martin looked at the assembly compared to H37Rv and some Illumina mapping and found that these unmapped problems cluster in a few regions. These regions are (correctly) not in the assembly (compared to H37Rv), but for some reason bcftools is making calls on the nanopore data in these regions on H37Rv.

From these basic filters, it looks as though the q60 filter gives us the best balance of recall and precision (with more emphasis on improving precision).

This is not the complete picture though. I have been digging into the varifier precision VCFs and think there might be some bcftools INFO tags we can filter on to remove even more FPs. Further analysis of this will follow.

mbhall88 commented 4 years ago

Investigating `bcftools` filters based on `INFO` metrics

Definitions

BQB = Mann Whitney U test of base quality bias (bigger is better)
MQB = Mann Whitney U test of mapping quality bias
RPB = Mann Whitney U test of read position bias
SGB = Segregation-based metric^*
VDB = Variant distance bias for filtering splice-site artifacts in RNA-seq^{^}

Methods

I have gone through the varifier precision VCF and pulled out the values for these INFO fields and whether or not the variant is considered a TP or FP.

BQB

From this plot, if we were to filter BQB at 0.2, we would remove a lot of FPs, but we would remove a lot of TPs at the same time.

MQB

Setting this filter to <0.2 would remove a very small number of both FPs and TPs

RPB

If we set this filter to <0.05 (each bar/bin is 0.01) then we would remove many more FPs than TPs

SGB

If we set this filter to >-0.50 then we would remove slightly more FPs than TPs

VDB

Note: this is a zoomed in subsection of the X-axis.

If we set this filter to <0.002 we remove a lot more FPs than TPs

Conclusion

I will try filtering with the current min qual 60. In addition, I will try out RPB<0.05, SGB>-0.5, and VDB<0.002

mbhall88 commented 4 years ago

Based on the investigation above, the results of judging filtering using [#42] are below, **along with a suggested final filter combination.

A reminder on the filter "keys" used on the X-axes below:

q = min qual
sb = strand bias
r = Mann Whitney U test of read position bias
G = Segregation-based metric
V = Variant distance bias for filtering splice-site artifacts in RNA-seq

These plots are purely SNPs at the moment.

Recall

Precision

Precision zoomed

Conclusion

The extra filters investigated above make an appreciable difference to all samples, and a huge difference for the outlier sample.
I would suggest moving forward with the q60 r0.05 G-0.5 V0.002 filter combination.

mbhall88 / head_to_head_pipeline

Filter nanopore baseline variant calls #39

Recall

Precision

Discussion

Investigating `bcftools` filters based on `INFO` metrics

Definitions

Methods

BQB

MQB

RPB

SGB

VDB

Conclusion

Recall

Precision

Precision zoomed

Conclusion

mbhall88 / head_to_head_pipeline

Filter nanopore baseline variant calls #39

Recall

Precision

Discussion

Investigating bcftools filters based on INFO metrics

Definitions

Methods

BQB

MQB

RPB

SGB

VDB

Conclusion

Recall

Precision

Precision zoomed

Conclusion

Investigating `bcftools` filters based on `INFO` metrics