Issues with COAR - Githubissues

The alignment step doesn't always do a good job of handling cases where the true and inferred lineages don't have neatly-corresponding sequences. For instance here it aligns two nodes with hamming distance 20: And here are the two nodes in the true tree: and in the inferred tree: To me, the inferred tree is clearly not claiming that these two nodes are equivalent, it rather just has an extra node near root. One misalignment like this, however, will completely dominate the COAR calculation since correctly aligned seqs are only ever off by a couple of bases.
Since the lineages from most leaves come together near the root, errors in sequence inference near root are counted many times, which does not seem intuitive: if I incorrectly infer one mutation near root, I don't think that impact of that mistake should necessarily scale with N leaves. For instance here the naive sequence is off by 4, and it's counted in the calculation for every leaf's lineage:
I don't think that using total sequence length is the correct denominator (max penalty). In any given tree, the most that we can be wrong really seems to scale more with the total tree depth or N mutations, rather than with total sequence length. The former would also result in COAR values that are nearer to 1, whereas now COAR is like 0.0003, and having lots of leading zeros in plots is always confusing.

While 3. is potentially worth implementing, 1. and 2. are more inherent and just make me more reluctant to rely on COAR as a final metric.

My guess is that what we want COAR to do is measure the accuracy of the order of inferred mutations from root. But I think that in practice just looking at the handful of inferred ancestral sequences doesn't really do this. I think we could compare the order of inferred and true mutations (even without keeping track of the full list of mutations in order in simulation), but not sure if it's worthwhile.

psathyrella / partis

Issues with COAR #326