Closed mbhall88 closed 4 years ago
Min coverage (5x?) Strand bias - 20% of reads must come from each strand
@iqbal-lab should we also throw in a QUAL filter? See https://samtools.github.io/bcftools/howtos/variant-calling.html
When I look at some stats on one particular VCF there is quite a few SNPs below this level...
Suck it and see. Worked in pandora e coli, but not v discriminating, v many calls with identical qual
For the coverage, would you prefer me to do this based on expected coverage? i.e. I make an initial pass through the VCF and get the value for DP fields. Then try something like filter any variant with DP<(t x m)
where m
is median (mean?) DP and t
is some scaling factor (0.2?) and DP>(u * m)
where u
is some max. threshold (2.0?)?
Or do you think fixed-threshold is better?
Suck it and see. Worked in pandora e coli, but not v discriminating, v many calls with identical qual
Not really sure what you mean by this?
We have a pretty wide range of QUAL with this sample I have been using.
Example
# QUAL [2]id [3]Quality [4]number of SNPs [5]number of transitions (1st ALT) [6]number of transversions (1st ALT) [7]number of indels
QUAL 0 3 1 0 1 0
QUAL 0 4 11 8 3 0
QUAL 0 5 11 4 7 0
QUAL 0 6 6 2 4 0
QUAL 0 7 8 3 5 0
QUAL 0 8 15 6 9 0
QUAL 0 9 4 2 2 0
QUAL 0 10 8 5 3 0
QUAL 0 11 3 3 0 0
QUAL 0 12 7 3 4 0
QUAL 0 13 5 3 2 0
QUAL 0 14 6 4 2 0
QUAL 0 15 6 4 2 0
QUAL 0 16 6 2 4 0
QUAL 0 17 9 5 4 0
QUAL 0 18 10 6 4 0
QUAL 0 19 6 3 3 0
QUAL 0 20 6 4 2 0
QUAL 0 21 6 3 3 0
...
QUAL 0 194 6 3 3 0
QUAL 0 195 3 2 1 0
QUAL 0 196 5 3 2 0
QUAL 0 197 3 3 0 0
QUAL 0 198 10 4 6 0
QUAL 0 199 4 3 1 0
QUAL 0 200 3 1 2 0
QUAL 0 201 3 1 2 0
QUAL 0 202 2 1 1 0
QUAL 0 203 1 0 1 0
QUAL 0 204 4 2 2 0
QUAL 0 205 4 4 0 0
QUAL 0 206 3 2 1 0
QUAL 0 207 6 3 3 0
QUAL 0 208 4 3 1 0
QUAL 0 209 2 2 0 0
QUAL 0 210 5 1 4 0
QUAL 0 211 5 2 3 0
QUAL 0 212 2 2 0 0
QUAL 0 213 7 3 4 0
QUAL 0 214 4 4 0 0
QUAL 0 215 6 3 3 0
QUAL 0 216 6 2 4 0
QUAL 0 217 8 3 5 0
QUAL 0 218 8 5 3 0
QUAL 0 219 3 2 1 0
QUAL 0 220 3 2 1 0
QUAL 0 221 3 3 0 0
QUAL 0 222 4 2 2 0
QUAL 0 223 8 4 4 0
QUAL 0 224 4 3 1 0
QUAL 0 225 465 305 160 0
QUAL 0 226 4 2 2 0
QUAL 0 227 4 3 1 0
QUAL 0 228 332 207 125 0
You understand me correctly, but data seemed v different with e coli illumina
You understand me correctly, but data seemed v different with e coli illumina
I don't know if you're saying to do it or not? That's what I am confused about... (not sure if I am missing some vital British subtext?)
I'm saying feel free to try it. I can't tell in advance if it will help." Suck it and see" meant try it and evaluate results.
Not sure if you missed https://github.com/mbhall88/head_to_head_pipeline/issues/39#issuecomment-662849030 but are you happy with making coverage filter dynamic rather than a fixed threshold?
I vote Dynamic. Especially as we likely to rerun on subsampled reads.
Here are results from the first sweep of filter params.
Context: This is with sample mada_112
full coverage (150x) and all default bcftools
parameters. An explanation of the filters as per the CLI menu
-d, --min-depth FLOAT Minimum depth as a percentage of the
expected (median) depth. This filter has ID:
ld. Set to 0 to disable [default: 0.2]
-D, --max-depth FLOAT Maximum depth as a fraction of the expected
(median) depth. This filter has ID: hd. Set
to 0 to disable [default: 2.0]
-s, --min-strand-bias INT Filter a variant if either strand has less
than INT% of the (high-quality) depth on the
called allele (DP4). This filter has ID: sb.
Set to 0 to disable [default: 25]
-q, --min-qual FLOAT Filter a variant if QUAL is less than INT.
This filter has ID: lq. Set to 0 to disable
[default: 0.0]
call_rate | concordance | gw_call_rate | gw_concordance | filter | |
---|---|---|---|---|---|
no_filter | 100.000000% | 100.000000% | 99.999975% | 99.999779% | no_filter |
min-depth=0.1 | 100.000000% | 100.000000% | 99.999902% | 99.999779% | min-depth=0.1 |
max-depth=2.0 | 100.000000% | 100.000000% | 99.999975% | 99.999779% | max-depth=2.0 |
min-qual=20.0 | 100.000000% | 100.000000% | 99.999705% | 99.999902% | min-qual=20.0 |
min-strand-bias=20 | 96.675900% | 100.000000% | 97.531206% | 99.999874% | min-strand-bias=20 |
all_filters | 96.675900% | 100.000000% | 97.530960% | 99.999950% | all_filters |
And for the same data, but the 60x subsampled version
call_rate | concordance | gw_call_rate | gw_concordance | filter | |
---|---|---|---|---|---|
no_filter | 100.000000% | 99.861496% | 99.999189% | 99.999730% | no_filter |
min-depth=0.1 | 100.000000% | 99.861496% | 99.999091% | 99.999730% | min-depth=0.1 |
max-depth=2.0 | 100.000000% | 99.861496% | 99.976139% | 99.999730% | max-depth=2.0 |
min-qual=20.0 | 99.722992% | 100.000000% | 99.998747% | 99.999853% | min-qual=20.0 |
min-strand-bias=20 | 96.121884% | 99.855908% | 96.761672% | 99.999924% | min-strand-bias=20 |
all_filters | 95.844875% | 100.000000% | 96.739114% | 99.999949% | all_filters |
It seems from these that strand bias at 20% hurts us in both metrics. Minimum depth of 0.1 also doesn't seem to offer any benefits over no-filter. Max. depth 2.0 also doesn't seem to offer an advantage in either dataset. Min. QUAL of 20 does seem to have quite a positive impact on concordance - at the cost of call rate...
Results for a different sample, R23887, which has 143x coverage.
call_rate | concordance | gw_call_rate | gw_concordance | filter | |
---|---|---|---|---|---|
no_filter | 100.000000% | 100.000000% | 99.996653% | 99.999828% | no_filter |
min-depth=0.1 | 100.000000% | 100.000000% | 99.993675% | 99.999828% | min-depth=0.1 |
max-depth=2.0 | 100.000000% | 100.000000% | 99.990943% | 99.999828% | max-depth=2.0 |
min-qual=20.0 | 100.000000% | 100.000000% | 99.996087% | 99.999951% | min-qual=20.0 |
min-strand-bias=20 | 96.883396% | 100.000000% | 97.726728% | 99.999824% | min-strand-bias=20 |
all_filters | 96.883396% | 100.000000% | 97.719173% | 99.999950% | all_filters |
R23887 with some different params
call_rate | concordance | gw_call_rate | gw_concordance | filter | |
---|---|---|---|---|---|
no_filter | 100.000000% | 100.000000% | 99.996653% | 99.999828% | no_filter |
min-depth=0.05 | 100.000000% | 100.000000% | 99.993700% | 99.999828% | min-depth=0.05 |
max-depth=2.2 | 100.000000% | 100.000000% | 99.996653% | 99.999828% | max-depth=2.2 |
min-qual=15.0 | 100.000000% | 100.000000% | 99.996210% | 99.999951% | min-qual=15.0 |
min-strand-bias=10 | 99.408920% | 100.000000% | 99.448115% | 99.999827% | min-strand-bias=10 |
all_filters | 99.408920% | 100.000000% | 99.446319% | 99.999951% | all_filters |
mada_112 60x with some different params
call_rate | concordance | gw_call_rate | gw_concordance | filter | |
---|---|---|---|---|---|
no_filter | 100.000000% | 99.861496% | 99.999189% | 99.999730% | no_filter |
min-depth=0.15 | 100.000000% | 99.861496% | 99.998182% | 99.999730% | min-depth=0.15 |
max-depth=2.5 | 100.000000% | 99.861496% | 99.999189% | 99.999730% | max-depth=2.5 |
min-qual=15.0 | 99.861496% | 100.000000% | 99.998943% | 99.999853% | min-qual=15.0 |
min-strand-bias=15 | 97.783934% | 99.858357% | 98.158142% | 99.999925% | min-strand-bias=15 |
all_filters | 97.645429% | 100.000000% | 98.157577% | 99.999950% | all_filters |
Last one for the day. R23887 with different params
call_rate | concordance | gw_call_rate | gw_concordance | filter | |
---|---|---|---|---|---|
no_filter | 100.000000% | 100.000000% | 99.996653% | 99.999828% | no_filter |
min-depth=0.2 | 100.000000% | 100.000000% | 99.992371% | 99.999828% | min-depth=0.2 |
max-depth=2.1 | 100.000000% | 100.000000% | 99.996653% | 99.999828% | max-depth=2.1 |
min-qual=30.0 | 100.000000% | 100.000000% | 82.714386% | 99.999970% | min-qual=30.0 |
min-strand-bias=5 | 99.623858% | 100.000000% | 99.820813% | 99.999827% | min-strand-bias=5 |
all_filters | 99.623858% | 100.000000% | 82.598004% | 99.999970% | all_filters |
I haven't really seen any evidence in these tables that any filter other than minimum quality is useful.... Happy to try other values/filters if you like @iqbal-lab ?
Min qual seems sufficient.
i think the value of filters might be more obvious at 15x or 30x. It's quite common for Mtb to end up with low depth. eg see how many we discarded.
Seems fine for now. We will need to return to the question of how the tools do at lower depth
Ok, so I have run the same parameters as in the initial result tables on a sample, mada_131 that has 32x nanopore coverage (median 28x from pileup).
call_rate | concordance | gw_call_rate | gw_concordance | filter | |
---|---|---|---|---|---|
no_filter | 99.918167% | 98.525799% | 99.989171% | 99.998942% | no_filter |
min-depth=0.1 | 99.918167% | 98.525799% | 99.989048% | 99.998942% | min-depth=0.1 |
max-depth=2.0 | 99.836334% | 98.524590% | 99.707278% | 99.998939% | max-depth=2.0 |
min-qual=20.0 | 97.545008% | 99.664430% | 99.984200% | 99.999705% | min-qual=20.0 |
min-strand-bias=20 | 86.824877% | 98.963242% | 91.408123% | 99.999623% | min-strand-bias=20 |
all_filters | 85.188216% | 99.903939% | 91.133269% | 99.999919% | all_filters |
i think the value of filters might be more obvious at 15x or 30x. It's quite common for Mtb to end up with low depth. eg see how many we discarded.
As always, you're spot on it seems.
In conclusion, I will run all samples through the call-filter-concordance pipeline (using just min QUAL filter) and then look at the plot of call rate vs. concordance. We can play with filter params from there and see how it affects the whole dataset.
I have been using varifier
[#42] to try and hone the filters. Here are "checkpoint" plots for recall and precision with various filters.
A note on the filter "keys" used on the X-axes below:
These plots are purely SNPs at the moment.
There is a noticeable outlier in the precision plots: mada_102
. The assembly for this sample looks good. In collaboration with Martin, we found that most of the FPs are due to unmapped probes. Martin looked at the assembly compared to H37Rv and some Illumina mapping and found that these unmapped problems cluster in a few regions. These regions are (correctly) not in the assembly (compared to H37Rv), but for some reason bcftools is making calls on the nanopore data in these regions on H37Rv.
From these basic filters, it looks as though the q60 filter gives us the best balance of recall and precision (with more emphasis on improving precision).
This is not the complete picture though. I have been digging into the varifier
precision VCFs and think there might be some bcftools INFO tags we can filter on to remove even more FPs. Further analysis of this will follow.
bcftools
filters based on INFO
metricsBQB = Mann Whitney U test of base quality bias (bigger is better)
MQB = Mann Whitney U test of mapping quality bias
RPB = Mann Whitney U test of read position bias
SGB = Segregation-based metric*
VDB = Variant distance bias for filtering splice-site artifacts in RNA-seq^
I have gone through the varifier
precision VCF and pulled out the values for these INFO fields and whether or not the variant is considered a TP or FP.
From this plot, if we were to filter BQB at 0.2, we would remove a lot of FPs, but we would remove a lot of TPs at the same time.
Setting this filter to <0.2 would remove a very small number of both FPs and TPs
If we set this filter to <0.05 (each bar/bin is 0.01) then we would remove many more FPs than TPs
If we set this filter to >-0.50 then we would remove slightly more FPs than TPs
Note: this is a zoomed in subsection of the X-axis.
If we set this filter to <0.002 we remove a lot more FPs than TPs
I will try filtering with the current min qual 60. In addition, I will try out RPB<0.05, SGB>-0.5, and VDB<0.002
Based on the investigation above, the results of judging filtering using [#42] are below, **along with a suggested final filter combination.
A reminder on the filter "keys" used on the X-axes below:
These plots are purely SNPs at the moment.
The extra filters investigated above make an appreciable difference to all samples, and a huge difference for the outlier sample.
I would suggest moving forward with the q60 r0.05 G-0.5 V0.002 filter combination.
@iqbal-lab are there standard filters you would like to apply? If not, how do you suggest we go about determining what filters to use?