psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
57 stars 34 forks source link

Issues with COAR #326

Open psathyrella opened 7 months ago

psathyrella commented 7 months ago
  1. The alignment step doesn't always do a good job of handling cases where the true and inferred lineages don't have neatly-corresponding sequences. For instance here it aligns two nodes with hamming distance 20: coar-weirdness And here are the two nodes in the true tree: coar-weirdness-true-tree and in the inferred tree: coar-weirdness-inf-tree To me, the inferred tree is clearly not claiming that these two nodes are equivalent, it rather just has an extra node near root. One misalignment like this, however, will completely dominate the COAR calculation since correctly aligned seqs are only ever off by a couple of bases.

  2. Since the lineages from most leaves come together near the root, errors in sequence inference near root are counted many times, which does not seem intuitive: if I incorrectly infer one mutation near root, I don't think that impact of that mistake should necessarily scale with N leaves. For instance here the naive sequence is off by 4, and it's counted in the calculation for every leaf's lineage: coar-counting

  3. I don't think that using total sequence length is the correct denominator (max penalty). In any given tree, the most that we can be wrong really seems to scale more with the total tree depth or N mutations, rather than with total sequence length. The former would also result in COAR values that are nearer to 1, whereas now COAR is like 0.0003, and having lots of leading zeros in plots is always confusing.

While 3. is potentially worth implementing, 1. and 2. are more inherent and just make me more reluctant to rely on COAR as a final metric.

My guess is that what we want COAR to do is measure the accuracy of the order of inferred mutations from root. But I think that in practice just looking at the handful of inferred ancestral sequences doesn't really do this. I think we could compare the order of inferred and true mutations (even without keeping track of the full list of mutations in order in simulation), but not sure if it's worthwhile.

psathyrella commented 7 months ago

Attaching coar definition. Davidsen and Matsen 2018 - coar-defn.pdf