tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 69 forks source link

Account for multiallelics in emission probabilities #2804

Open astheeggeggs opened 11 months ago

astheeggeggs commented 11 months ago

As it stands tskit's implementation does not allow for differential emission probs conditional on number of alleles.

There are choices here, all of which should be incorporated.

  1. Let mismatch prob be a linear function of alleles - default when a scalar (mu, say) is passed. Equal mismatch prob is then split between the possible alleles, each with prob mu.
  2. Let mismatch prob at each site be explicitly passed by the user - default when a vector of length m (number of sites) is passed. Again, equal mismatch prob is then split between the possible alleles (but not rescaled up to account for number of alleles), each with prob mu_i/(a-1) (where mu_i is the mismatch prob at site i, and a is the number of alleles at site i).

This is encoded in lshmm here:

https://github.com/astheeggeggs/lshmm/blob/792c74bd9474deef55418354fcb4b86ab9c19338/lshmm/api.py#L168C8-L168C8

with warnings thrown if the user doesn't conform to the defaults.

szhan commented 1 week ago

This is scaling the mutation rates per site according to the number of distinct alleles, right? I think it is done now. And the way it is done now is to look at the reference and query haplotypes and then count the number of unique alleles at each site.