Adding AA info to LBI - Githubissues

I got around to trying one of the methods of combining amino acid and local branching info that we mentioned in the paper, with somewhat interesting results. I can think of two ways to do this: either add some locality to the consensus calculation, or add AA info to the lb calculation. I chose to do the latter: starting with the nucleotide tree, I set the length of each edge to the amino acid hamming distance between the sequences of its two nodes. Then calculate lbi and lbr on this "AA tree".

This "aa-lbi" generally does much better than plain "nuc-lbi" (and never does worse). And in a lot of cases it does significantly better than aa-cdist, which is something that no other metric managed. Here are some plots from the paper, with aa-lbi added in pink:

But, and this is a big but, like the nuc version, aa-lbi performs very poorly with low selection strength:

I think this plot encapsulates most a lot of what we know about the various metrics:

having AA info (i.e. ignoring synonymous mutations) is very important
at low selection strength the consensus approach is much better than the local branching approach
at high selection strength the local branching approach is a bit better than the consensus approach
for choosing among (rather than within) families, the consensus approach is better

In practice, I don't think it makes sense to recommend aa-lbi any time soon. The low-selection-strength regime is super important, and I don't think there's any method of measuring selection strength that's nearly accurate enough to figure out if you're on the left or right of that graph. But this does show that there's significant information missing from aa-cdist (which we couldn't prove in the paper), so there is room for improvement with a more sophisticated approach.

The performance of aa-lbr, on the other hand, is mostly identical to nuc-lbr. I think this could be because nuc-lbr can use a much larger tau than nuc-lbi (see paper for reasons), it might be able to "see past" all the synonymous mutation noise, so removing that noise makes less of a difference.

The tau optimization behavior of aa-lbi and aa-lbr are also quite different to their nuc analogues. Mostly, they are much less sensitive to changes in tau. This might be because removing the useless synonymous mutations makes it less important how far you're looking into the tree, but I'm not very convinced by this explanation.

psathyrella / selection-metric-comments

Adding AA info to LBI #7