matsen / pplacer

Phylogenetic placement and downstream analysis
http://matsen.fredhutch.org/pplacer/
GNU General Public License v3.0
74 stars 18 forks source link

feature suggestion: weights for reference sequences #342

Open nhoffman opened 9 years ago

nhoffman commented 9 years ago

This is sort of a half-baked thought at this point, but it has occurred to me that much information is lost when selecting representative reference sequences to include in a reference package: consider the case when the observed biological diversity for a species consists of many identical or very closely related reference sequences, and a small number of more divergent sequences. It is likely that in this case we would select only one representative of the most prevalent variant to include in the reference package - and in this case pplacer has no way to know which of the reference sequences are more "authoritative" when performing classification. I wonder if there would be some way to represent the prevalence of individual reference sequences among all candidate reference sequences in the form of a weight, and whether the taxonomic assignment could be informed by these weights. Whether it would matter is of course another question... I could imagine that it might help mitigate classification artifacts caused by including "outlier" reference sequences in the reference package.