Open jeromekelleher opened 6 days ago
Hmm, that's a bit concerning isn't it. But you are right about the biallelic state being hopefully robust. I guess we are happy that balletic with missing-data (i.e. states -1, 0, and 1) is still OK, and that we are robust to 1 being the ancestral allele and 0 being the derived?
Yes, I hope so. It does make me wonder if we should push forward the simple changes for allowing non-zero ancestral state and just jettison that complexity
If it's not too much work to get non-zero AS working in the matching, then I guess so? This is presumably orthogonal to using the state polarity for ancestor building in tsinfer?
I've spent the day tracking down a bug in sc2ts, which is pointing towards being a problem with tsinfer's multi-allele handling (and allele index rotating to deal with non-zero ancestral state). Based on my experimental HMM branch https://github.com/tskit-dev/tsinfer/pull/959 the version of sc2ts at e5fd80369b2e0813c2e46b1810980bb640054ed6 was loading data into a tsinfer SampleData instance, running some matches and getting the results out by overriding the tsinfer.SampleMatcher class.
Specifically, what seems to be happening is that when matching strain ERR4207042 against the 2020-03-11 ARG tsinfer is missing that this sample has a T at this site not a G (ancestral state). The haplotype we send in looks OK, but we match to a node that has the ancestral state without reporting a mutation in the output. It's not clear to me how it happened, other than I'm guessing there's some problem with the ancestral state allele rotation code.
I say this because sc2ts now uses the low-level components directly, bypassing the SampleData class entirely and is correctly making these matches (which is how I spotted the problem). So I don't think it's a problem with the actual HMM, more some issue with the high-level wrapping code.
I don't have time to create a reproducible example I'm afraid.
I don't know whether we need to do anything with this pre release - I guess we don't support matching on multiallelic sites anyway, and we're pretty sure the biallelic case is working fine, right?