Closed mbhall88 closed 4 years ago
Code for investigation can be found on branch concordance-outliers
.
For each of the outliers, calculate the concordance of the nanopore calls to the Illumina calls of all other samples.
This sample has perfect ALT concordance (1.0) and genome-wide concordance of 0.9999983 with mada_1-50. There were no false refs, 7 false alts, and 4 false nulls
This sample had best concordance with mada_2-46
This sample had best concordance with mada_1-37.
The best concordance for these samples was actually with mada samples so it is unlikely they are sample swaps.
mada_2-46 and mada_1-50 seem extremely likely to be sample swaps. In addition, to have the best concordance with each other, they were sequence in the same nanopore run madagascar_tb_aug_4
. According to the spreadsheet, mada_2-46 had barcode 10 and mada_1-50 had barcode 09 - so I guess this is mixed up.
mada_1-34's best match was not on the same nanopore run so it would seem unlikely this is a sample swap. In addition, the nanopore data attached to mada_1-37 failed QC.
I will contact Simon and Sylvianne and make sure they're ok with me switching mada_2-46 and mada_1-50 barcode numbers.
The remainder I will exclude from the paper?
From this plot of call rate against concordance, we can see there are 5 samples that are clear outliers when it comes to concordance. These 5 samples are:
I looked at a random selection of VCF positions where these samples made incorrect calls and could not see any discernable metric causing this low concordance.
For the following plots I extracted all positions (for each of the 5 samples) where
bcftools
made an incorrect (false) call and compared those to positions where it made a correct (true) callStrand bias
There does not appear to be any enrichment for strand bias in the false calls.
VCF QUAL score
Note: the false calls have a higher QUAL due to the fact that most did not have an ALT, hence the QUAL calculation is slightly different, yielding a higher QUAL score. See here for elaboration on how the calculation is different.
The main takeaway here is that if we were to raise the minimum QUAL filtering cutoff to 30, we would remove the smaller peak in the false calls. This would, of course, be at the expense of a drop in call rate. @iqbal-lab do you think this is an acceptable change to the filtering?
Median depth
Strangely, the false calls seem to be higher depth. I'm not sure there is much that can be done here.
Conclusion
There does not appear to be any simple fix with respect to filters that can explain the low concordance of these samples.
Next thing
Plot the concordance for each of these samples against all other samples to see if they may be sample swaps. For instance, is mada_2-46 has better concordance with a different sample, it is quite likely it has been swapped and we may need to exclude these samples.