jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Excluding hypervariable sites? #182

Closed hyanwong closed 1 year ago

hyanwong commented 1 year ago

I just saw https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 (from Nicola De Maio & colleges, FWIW). They say:

we propose masking sites that appear to be highly homoplasic and have no phylogenetic signal and/or low prevalence – these can be recurrent artefacts, or otherwise hypermutable low-fitness sites that might similarly cause phylogenetic noise. A current list of these is: 187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408, 14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681, 28077, 28826, 28854, 29700. We provide technical details of how these sites were identified below, however please note that all lists of sites outlined here are a work in progress, and might be affected by many choices made in the preliminary phylogenetic steps.

It might be worth seeing if (a) we identify these (or others) as hypervariable (b) if we exclude them, so we get better results (e.g. fewer false positives)

hyanwong commented 1 year ago

Oh, doh, we do this already.

In addition, we exclude 481 problematic sites flagged as prone to sequencing errors or as highly homoplasic entirely (https://github.com/W-L/ ProblematicSites_SARS-CoV2/, accessed 2022-09-22)

Sorry.