pockerman / hidden_markov_modeling

0 stars 1 forks source link

probability distribution for the states #17

Open colinveal opened 4 years ago

colinveal commented 4 years ago

we need to think about how we model the expected distribution for each state:

i,e normal read depth, can be model as normal, poisson, negative binomial. with high enough read depth it approximates to normal, except can't take negative values, previously we were performing a negative binomial transformation to give a normal distribution. that was so we could use the distribution of the difference between 2 normal distributions. However we could model directly as negative binomial.

duplications and a single copy deletion will be similar but have different parameters.

2 copy deletions will be near uniform 0,

Alternatively we could use the distribution for the normal copy number as the only distribution and base the probabilities on distance from mean of that, i.e > mean and low p = high p of duplication.

pockerman commented 4 years ago

you mean in terms of the observations if I understand correctly....in other words how to model the emissions probabilities for each state?

pockerman commented 4 years ago

how much astray you think the following approach sounds: cluster the observations into as many clusters as needed states. Then for each cluster fit a distribution. Use that probability as the probability distribution for each state in the HMM?

colinveal commented 4 years ago

Sure any starting point to get the model will be good, we can always change the distributions. We could keep it even simpler and base it on the normal copy number as the only distribution, and a window is assessed against that, i.e probably normal, probability above normal, probability below normal and then combine the 2 sets of probabilities to calculate the likelihood of each state, i.e sig below and sig below = 0.90 deletion, 0.05 TUF, 0.04 normal, 0.01 dup, normal normal = 0.75 normal, 0.10 deletion, 0.1 dup, 0.05 TUF etc

pockerman commented 4 years ago

ok cool I will start looking into the clustering approach and see what we get...I will add sklearn into our requirements to use their clustering algos although we may have to implement others ourselves