szhan / onekg_analysis

Evaluation of genotype imputation methods using the unified genealogy dataset
MIT License
0 stars 0 forks source link

Explore patterns of switches in HMM copying paths #19

Closed szhan closed 1 year ago

szhan commented 1 year ago

Related to #18

szhan commented 1 year ago

I'm comparing the parent node ids of the paths obtained under different parameters:

These histograms show the number of differences in parent node ids for each sample. These results show that we don't get the same results despite identical MMR.

Screen Shot 2023-06-29 at 7 55 15 AM
szhan commented 1 year ago

The x-axis label is wrong above. It should be "Number of parent node id differences".

szhan commented 1 year ago

The MMR isn't simply equal to rho / mu. So, I've thinking about this wrong. Using the parameter combinations above, one shouldn't expect getting the same paths even though rho / mu is maintained.

szhan commented 1 year ago

h/t Duncan Palmer. rho and mu should be defined as follows.

import math

def compute_rho(mu, n, k):
    a = n * mu**k
    b = (1 - mu)**k + (n - 1) * mu**k
    rho = a / b
    return rho

def compute_k(mu, rho, n):
    a = math.log((rho/n) / (1 - rho + rho/n))
    b = math.log(mu / (1 - mu))
    k = a / b
    return k
szhan commented 1 year ago

I did another run setting mu to 1e-07 and 1e-08 and letting rho to be defined by the function above, setting k to 1 (which is MMR) and n to 1,400 (which is the size of the reference sample set used in the previous runs). Some paths are different still, but less so.

Screen Shot 2023-06-29 at 1 38 11 PM
szhan commented 1 year ago

Also, see #21

szhan commented 1 year ago

Results of the experiments so far are consistent with numerical instability in the algorithm.

szhan commented 1 year ago

Next step is to compare sample paths obtained using tskit.lshmm and Duncan's lshmm.