Open hyanwong opened 2 years ago
Pinging @stsmall as this may be of interest for AG1000 data - we could use the stdpopsim AnoGam model (not fully documented yet) or the more documented Aedes one. I wonder how many triallelic sites they tend to generate?
good question @hyanwong -- I wonder how many sites multiallelic sites the AnoGam model produces. We know from the empirical Ag1kg data that ~1/3 of sites are multiallelic, but we ignored those for demographic inference
Thanks @andrewkern - of the multiallelic sites, what's the lowest freq allele? Is it the case that most of the time there's one of the 3 alleles at a vanishingly rare freq?
yeah good question. i'll defer to @stsmall on this as he has his hands on the data right now.
To pick this back up. The majority of multiallelic sites are singletons. [Ref, Alt, Alt1, Alt2] percent singleton Alt1 : 0.5881852164112996 percent singleton Alt2 : 0.7144287770714822 percent singleton Alt1 & Alt2 : 0.2824007940955862
Oh, well that's easy then, right? Just mark the allele that's a singleton as missing for the moment. I can't see any reason why not to do this.
As described briefly in https://github.com/tskit-dev/tsinfer/issues/228#issuecomment-734315735, it should be possible to use triallieic and tetraallelic sites for inference, rather than (as at present) simply placing the alleles using parsimony. The key is that we don't know which of the derived states is derived from the ancestral, and which (if any) are derived from other states, so we simply treat some of the allelic states as "missing data". There are some organisms in which there are a lot of triallelic sites (i.e. ones with high mutation rates), so I wonder if this is worth turning into some sort of formal preprocessing step, and testing if it works. Here's how it would go:
We could test the effectiveness of this by simulation quite easily.