Investigate concordance outliers

mbhall88 / head_to_head_pipeline

Snakemake pipelines to run the analysis for the Illumina vs. Nanopore comparison.

GNU General Public License v3.0

5 stars 2 forks source link

concordance-outliers

From this plot of call rate against concordance, we can see there are 5 samples that are clear outliers when it comes to concordance. These 5 samples are:

mada_2-46
R27006
R13303
mada_1-34
mada_1-50

I looked at a random selection of VCF positions where these samples made incorrect calls and could not see any discernable metric causing this low concordance.

For the following plots I extracted all positions (for each of the 5 samples) where bcftools made an incorrect (false) call and compared those to positions where it made a correct (true) call

Strand bias

There does not appear to be any enrichment for strand bias in the false calls.

VCF QUAL score

Note: the false calls have a higher QUAL due to the fact that most did not have an ALT, hence the QUAL calculation is slightly different, yielding a higher QUAL score. See here for elaboration on how the calculation is different.
The main takeaway here is that if we were to raise the minimum QUAL filtering cutoff to 30, we would remove the smaller peak in the false calls. This would, of course, be at the expense of a drop in call rate. @iqbal-lab do you think this is an acceptable change to the filtering?

Median depth

Strangely, the false calls seem to be higher depth. I'm not sure there is much that can be done here.

Conclusion

There does not appear to be any simple fix with respect to filters that can explain the low concordance of these samples.

Next thing

Plot the concordance for each of these samples against all other samples to see if they may be sample swaps. For instance, is mada_2-46 has better concordance with a different sample, it is quite likely it has been swapped and we may need to exclude these samples.

Results

mada_2-46

This sample has perfect ALT concordance (1.0) and genome-wide concordance of 0.9999983 with mada_1-50. There were no false refs, 7 false alts, and 4 false nulls

mada_1-50

This sample had best concordance with mada_2-46

mada_1-34

This sample had best concordance with mada_1-37.

R13303 and R27006

The best concordance for these samples was actually with mada samples so it is unlikely they are sample swaps.

Conclusion

mada_2-46 and mada_1-50 seem extremely likely to be sample swaps. In addition, to have the best concordance with each other, they were sequence in the same nanopore run madagascar_tb_aug_4. According to the spreadsheet, mada_2-46 had barcode 10 and mada_1-50 had barcode 09 - so I guess this is mixed up.
mada_1-34's best match was not on the same nanopore run so it would seem unlikely this is a sample swap. In addition, the nanopore data attached to mada_1-37 failed QC.

I will contact Simon and Sylvianne and make sure they're ok with me switching mada_2-46 and mada_1-50 barcode numbers.

The remainder I will exclude from the paper?

mbhall88 / head_to_head_pipeline