matsengrp / hyperfreq

Bayesian tool for APOBEC hypermutation analysis
GNU General Public License v3.0
10 stars 2 forks source link

clustering consensus #26

Open KMBarton opened 8 years ago

KMBarton commented 8 years ago

Hi,

I am having some trouble using hyperfreq with some samples in which a large percentage of the sequences are hypermutated. I assume it is because of the consensus sequences that they are compared to. Do you by chance have an example of how to generate a consensus based on clustering as you discuss in the wiki page? Also, would it be possible to run a rough hypermut analysis and remove all the suspected hypermuts and generate a consensus from that to compare my sequences to? Thank you for taking the time to design this great tool.

Kind regards, Kirston

metasoarous commented 8 years ago

Hi Kirston

Clustering only really makes sense if you presume most of the sequences to not be hypermutated (and more specifically, most of the sequences within each cluster to not be hypermutated; i.e., phylogenetic signal should be stronger than suspected hypermutation signal).

Better than a consensus sequence is an actual ancestral sequence presumed not to be hypermutated (i.e. from a replicating virus). The closer the potentially hypermutated sequence is evolutionarily to the presumedly non-hypermutated sequence, the better. As this reduces the "noise", and hones in on the signal. But even matching sequences to canonical strain representatives from GenBank (ideally frequently references sequences, less likely to themselves be suspect of bearing patterns of some low level hypermutation) is a pretty good bet. The noise can be a little harder to filter out sometimes with more divergent sequences/strains, but this approach still has the advantage of being tied to actual virus sequences (consensus is a phalacy).

I wouldn't filter out more obviously hypermutated sequences from the cluster in order to get a "better" consensus. This is really doctoring the results. You're biasing yourself towards "findings" if you do that. Better to either stick to unamended cluster consensus or real reference sequences.