Closed willdumm closed 2 years ago
Although we can fix the tree that's claimed to violate switching order (as we discussed), the likelihood calculation for that tree should change, because the ancestral node labeled 'A' indicated by the red arrow will become unobserved, and the observed 'A' will become an extra leaf. Because of this, it seems like we should add observed isotypes and infer ancestral isotypes before branching process parameter fitting, and only collapse edges between nodes with the same sequence and isotype. @WSDeWitt if we do this, would we need to change anything about the tree likelihood calculation, to remove the assumption that all edges represent mutations?
Also consider
phylip_parse.disambiguate
so it can be reused for disambiguating ancestral isotypesI don't have much to say other than 👍 ! Happy to discuss anytime.
This all sounds right to me, however I'm not sure that tinkering with the branching process is the happy path. I don't think the single q mutation parameter is sufficient for modeling mutations and isotype switch rates. We would instead need to model isotype switching events as a multi-type branching process add-on, with a multi-dimensional offspring distribution matrix that concords with the isotype order DAG (and is a bunch more parameter to infer). This would be parallel to our infinite type branching process on genotypes, giving us tuple-valued states (genotype, isotype).
Do we want to go this route? If not, an alternative (hack) is to add isotype info post hoc, exploding genotypes by isotype partitions, and using your modified Sankoff approach to assign parsimonious ancestral isotypes, then re-collapsing on genotype+isotype.
Explicitly avoid collapsed trees like the one on the left:
(from https://www.frontiersin.org/articles/10.3389/fimmu.2018.02451/full)