clustering consensus - Githubissues

Hi Kirston

Clustering only really makes sense if you presume most of the sequences to not be hypermutated (and more specifically, most of the sequences within each cluster to not be hypermutated; i.e., phylogenetic signal should be stronger than suspected hypermutation signal).

Better than a consensus sequence is an actual ancestral sequence presumed not to be hypermutated (i.e. from a replicating virus). The closer the potentially hypermutated sequence is evolutionarily to the presumedly non-hypermutated sequence, the better. As this reduces the "noise", and hones in on the signal. But even matching sequences to canonical strain representatives from GenBank (ideally frequently references sequences, less likely to themselves be suspect of bearing patterns of some low level hypermutation) is a pretty good bet. The noise can be a little harder to filter out sometimes with more divergent sequences/strains, but this approach still has the advantage of being tied to actual virus sequences (consensus is a phalacy).

I wouldn't filter out more obviously hypermutated sequences from the cluster in order to get a "better" consensus. This is really doctoring the results. You're biasing yourself towards "findings" if you do that. Better to either stick to unamended cluster consensus or real reference sequences.

matsengrp / hyperfreq

clustering consensus #26