tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
56 stars 13 forks source link

Pipeline or function to allow triallelic sites for inference? #670

Open hyanwong opened 2 years ago

hyanwong commented 2 years ago

As described briefly in https://github.com/tskit-dev/tsinfer/issues/228#issuecomment-734315735, it should be possible to use triallieic and tetraallelic sites for inference, rather than (as at present) simply placing the alleles using parsimony. The key is that we don't know which of the derived states is derived from the ancestral, and which (if any) are derived from other states, so we simply treat some of the allelic states as "missing data". There are some organisms in which there are a lot of triallelic sites (i.e. ones with high mutation rates), so I wonder if this is worth turning into some sort of formal preprocessing step, and testing if it works. Here's how it would go:

  1. Assume we have large numbers of trialleleic or tetrallelelic site in which the ancestral state is known
  2. For each site independently, take one of the non ancestral states as the one on which to base the inference (probably the non-ancestral state with the highest frequency?)
  3. Run through the sample data file through a pipeline which creates a new sample data file which marks all other states as missing. This allows these states to be placed either as mutations from the ancestral state or as mutations from the derived state
  4. Do the inference
  5. Map the original site data onto the tree (possibly using parsimony?) to add in the extra mutations necessary.

We could test the effectiveness of this by simulation quite easily.

hyanwong commented 2 years ago

Pinging @stsmall as this may be of interest for AG1000 data - we could use the stdpopsim AnoGam model (not fully documented yet) or the more documented Aedes one. I wonder how many triallelic sites they tend to generate?

andrewkern commented 2 years ago

good question @hyanwong -- I wonder how many sites multiallelic sites the AnoGam model produces. We know from the empirical Ag1kg data that ~1/3 of sites are multiallelic, but we ignored those for demographic inference

hyanwong commented 2 years ago

Thanks @andrewkern - of the multiallelic sites, what's the lowest freq allele? Is it the case that most of the time there's one of the 3 alleles at a vanishingly rare freq?

andrewkern commented 2 years ago

yeah good question. i'll defer to @stsmall on this as he has his hands on the data right now.

stsmall commented 2 years ago

To pick this back up. The majority of multiallelic sites are singletons. [Ref, Alt, Alt1, Alt2] percent singleton Alt1 : 0.5881852164112996 percent singleton Alt2 : 0.7144287770714822 percent singleton Alt1 & Alt2 : 0.2824007940955862

hyanwong commented 2 years ago

Oh, well that's easy then, right? Just mark the allele that's a singleton as missing for the moment. I can't see any reason why not to do this.