matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees
https://matsengrp.github.io/gctree
GNU General Public License v3.0
16 stars 2 forks source link

obey isotype switching order #63

Closed willdumm closed 2 years ago

willdumm commented 2 years ago

Explicitly avoid collapsed trees like the one on the left:

image

(from https://www.frontiersin.org/articles/10.3389/fimmu.2018.02451/full)

willdumm commented 2 years ago

Although we can fix the tree that's claimed to violate switching order (as we discussed), the likelihood calculation for that tree should change, because the ancestral node labeled 'A' indicated by the red arrow will become unobserved, and the observed 'A' will become an extra leaf. Because of this, it seems like we should add observed isotypes and infer ancestral isotypes before branching process parameter fitting, and only collapse edges between nodes with the same sequence and isotype. @WSDeWitt if we do this, would we need to change anything about the tree likelihood calculation, to remove the assumption that all edges represent mutations?

Also consider

willdumm commented 2 years ago

https://colab.research.google.com/drive/1L8l4FuBHMME65CaqStTU1PfrtOZvyV6b?usp=sharing#scrollTo=S1GkM5YFPJbX

matsen commented 2 years ago

I don't have much to say other than 👍 ! Happy to discuss anytime.

wsdewitt commented 2 years ago

This all sounds right to me, however I'm not sure that tinkering with the branching process is the happy path. I don't think the single q mutation parameter is sufficient for modeling mutations and isotype switch rates. We would instead need to model isotype switching events as a multi-type branching process add-on, with a multi-dimensional offspring distribution matrix that concords with the isotype order DAG (and is a bunch more parameter to infer). This would be parallel to our infinite type branching process on genotypes, giving us tuple-valued states (genotype, isotype).

Do we want to go this route? If not, an alternative (hack) is to add isotype info post hoc, exploding genotypes by isotype partitions, and using your modified Sankoff approach to assign parsimonious ancestral isotypes, then re-collapsing on genotype+isotype.