More flexible interpretation of mutation rate when getting emission probability when using `_tskit.lshmm`

tskit-dev / tskit

Population-scale genomics

MIT License

147 stars 70 forks source link

More flexible interpretation of mutation rate when getting emission probability when using `_tskit.lshmm` #2891

Open szhan opened 6 months ago

szhan commented 6 months ago

Currently, the emission probability for calculating the forward, backward, and Viterbi matrices is defined as follows:

p_e = mu;
if (is_match) {
    p_e = 1 - (num_alleles - 1) * mu;
}

This implies that an allele at a site mutates to another allowed allele (num_alleles - 1) uniformly at random at a user-specified mutation rate, mu. Changing the formula for p_e to p_e = 1 - mu upon allele match should allow for more flexible (general?) interpretation of the mutation rate.

Relevant lines are the following: https://github.com/tskit-dev/tskit/blob/2dae133fd8c3414d2f9b5073b964e37b4f96e032/c/tskit/haplotype_matching.c#L1055 https://github.com/tskit-dev/tskit/blob/2dae133fd8c3414d2f9b5073b964e37b4f96e032/c/tskit/haplotype_matching.c#L1106 https://github.com/tskit-dev/tskit/blob/2dae133fd8c3414d2f9b5073b964e37b4f96e032/c/tskit/haplotype_matching.c#L1295

szhan commented 5 months ago

Shouldn't the probability of match and probability of mismatch sum to 1? If that is the case, then I suspect there is a bug.

In this code snippet:

p_e = mu;
if (is_match) {
    p_e = 1 - (num_alleles - 1) * mu;
}

Probability of mismatch = mu Probability of match = 1 - (num_alleles - 1) * mu

Their sum is not 1 when num_alleles > 2.

Probability of match should be (1 - mu), unless probability of mismatch is (num_alleles - 1) * mu.

EDIT: Please disregard the above. I didn't realise it's the Rosen & Paten (2019) formulation, which rescales the mutation rate based on the number of alleles at the site. So, not a bug.

jeromekelleher commented 5 months ago

I agree it is a weird parameterisation, and doesn't belong in the core HMM. We should make this type of rescaling optional at the Python level.

szhan commented 4 months ago

I'm just looking at how computation of emission probability is done in test_haplotype_matching.py. It seems like it is going with a slightly different interpretation of mu (when there is no scaling based on the number of alleles)?

if is_match:
    p_e = 1 - mu
else:
    p_e = mu / (n_alleles - 1)

https://github.com/tskit-dev/tskit/blob/2dae133fd8c3414d2f9b5073b964e37b4f96e032/python/tests/test_haplotype_matching.py#L382

jeromekelleher commented 4 months ago

Maybe mu has been rescaled somewhere else? There may well be inconsistencies here, though