jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.29k stars 590 forks source link

Question: Can i build a HMM, where the distribution depends on a value of a specific position #1080

Open LauraSkak opened 4 months ago

LauraSkak commented 4 months ago

Hi

This is not a bug, but a question. I'm in the search for a software that can help me build a HMM that can identify imprinted regions using data produced with dorado and modkit from Oxford Nanopore Technologies.

For each position and each haplotype i have a count modified, count unmodified and a total valid coverage count. In my opinion, there is three/four states; both haplotypes are modified, both haplotypes are unmodified and one haplotype is modified and the other is not. My theory is that i can build a HMM where the emission probability of a given position depends on the coverage of that position. No sure if i should use a binomial or beta distribution here. Is this possible with pomegranate?

Thanks in advance for any help you can provide... I'm a bit lost as to how i should attack this.

Laura

jmschrei commented 4 months ago

Hi Laura

It is possible to use a variety of distributions with pomegranate. Binomial distributions are already implemented (https://github.com/jmschrei/pomegranate/blob/master/pomegranate/distributions/bernoulli.py) but these assume a fixed number of total trials. However, you can easily make your own distributions using the existing ones as a template. You just need to fill in the few functions that are there and they can be dropped into any existing model.

My understanding of your specific problem is that you modified and unmodified counts and want to get a probability for each hidden state given both the ratio of these counts and their total value, because higher total counts means higher accuracy, right? Modeling just the ratios is easy -- you can use a beta distribution since the range is restricted to (0, 1) (you'd have to implement that yourself unfortunately) or a Gaussian distribution if you want to approximate it for now. If you want to include both the ratio and the counts you might be better off using binomial distributions but, as I mentioned above, you'd probably need to modify the code to handle variable numbers of total counts.

Sorry that I couldn't be of more help.

LauraSkak commented 4 months ago

Thank you for your quick response!

Yeah, thats the problem, that the probability of modification in each state should be constant but the emission probability should then depend on how many reads are mapped to that position and how many of those reads are classed as modified.

I'm still quite new in the field, so figuring out how i implement my own distributions is a big ask for me atm... I'm currently trying to implement something myself from scratch, so we'll see how that goes.

Again, thanks for the help!

jmschrei commented 4 months ago

I understand, good luck!

If you're new to the field I'd suggest starting with a more basic model, like using Gaussian emissions, and seeing how good you can get that model before running into issues. Then, you'd already have a good baseline set up to demonstrate why the more complicated emissions are necessary. Hopefully by that point you'd be more comfortable with the setting and modeling and more able to write your own emissions and test them in a setting you know already works fairly well (by replacing your Gaussian emissions with them).