tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
56 stars 13 forks source link

Bug in multi-allele matching? #977

Open jeromekelleher opened 6 days ago

jeromekelleher commented 6 days ago

I've spent the day tracking down a bug in sc2ts, which is pointing towards being a problem with tsinfer's multi-allele handling (and allele index rotating to deal with non-zero ancestral state). Based on my experimental HMM branch https://github.com/tskit-dev/tsinfer/pull/959 the version of sc2ts at e5fd80369b2e0813c2e46b1810980bb640054ed6 was loading data into a tsinfer SampleData instance, running some matches and getting the results out by overriding the tsinfer.SampleMatcher class.

Specifically, what seems to be happening is that when matching strain ERR4207042 against the 2020-03-11 ARG tsinfer is missing that this sample has a T at this site not a G (ancestral state). The haplotype we send in looks OK, but we match to a node that has the ancestral state without reporting a mutation in the output. It's not clear to me how it happened, other than I'm guessing there's some problem with the ancestral state allele rotation code.

I say this because sc2ts now uses the low-level components directly, bypassing the SampleData class entirely and is correctly making these matches (which is how I spotted the problem). So I don't think it's a problem with the actual HMM, more some issue with the high-level wrapping code.

I don't have time to create a reproducible example I'm afraid.

I don't know whether we need to do anything with this pre release - I guess we don't support matching on multiallelic sites anyway, and we're pretty sure the biallelic case is working fine, right?

hyanwong commented 6 days ago

Hmm, that's a bit concerning isn't it. But you are right about the biallelic state being hopefully robust. I guess we are happy that balletic with missing-data (i.e. states -1, 0, and 1) is still OK, and that we are robust to 1 being the ancestral allele and 0 being the derived?

jeromekelleher commented 6 days ago

Yes, I hope so. It does make me wonder if we should push forward the simple changes for allowing non-zero ancestral state and just jettison that complexity

hyanwong commented 6 days ago

If it's not too much work to get non-zero AS working in the matching, then I guess so? This is presumably orthogonal to using the state polarity for ancestor building in tsinfer?