jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.35k stars 589 forks source link

Updating bayesian networks by removing discreet distribution items #664

Closed ralfne closed 4 years ago

ralfne commented 4 years ago

Hi, I am working with a program that is giving me gene usage patterns in the form of a bayesian network. In order to further analyse the data, I am reading this into a pomgranate bayesian network. There is one piece of functionality which would be very useful for me, which may or may not be present in pomgranate already (I have only a beginners knowledge of bayesian networks and pomgranate; sorry if the answer to my question is obvious). I would like to be able to reduce the gene usage information so as to eliminate the alleles. (In case these terms are unknown: distinct alleles are minor variations of the same genes, and are specified by a '*\' string after the gene name). In order to explain what I mean, please consider two genes of type GeneA (A1 and A2), and three genes of type GeneB (B1, B2 and B3). There are two alleles for A2, and two alleles for B1. Thus we have:

GeneA: A1*1, A2*1, A2*2 GeneB: B1*1, B1*2, B2*1, B3*1

The expression of these genes are linked as follows:

GeneA->GeneB

The probabilities entered into the bayesian network are as follows:

GeneA (Discreet distribution) A1*1 0.5 A2*1 0.2 A2*2 0.3

GeneB (Conditional distribution) B1*1|A1*1 0.3 B1*2|A1*1 0.1 B2*1|A1*1 0.1 B3*1|A1*1 0.5

B1*1|A2*1 0.6 B1*2|A2*1 0.2 B2*1|A2*1 0.0 B3*1|A2*1 0.2

B1*1|A2*2 0.0 B1*2|A2*2 0.7 B2*1|A2*2 0.1 B3*1|A2*2 0.2

Sometimes when analysing such data, it makes sense to disregard the alleles. I.e. I would like work with A1 and A2 only, where A2 now comprises A2*1 and A2*2. Likewise, I would like to reduce the GeneBs to B1, B2 and B3, where B1 comprises B1*1 and B1*2. Preferably, all combinations thereof should be supported.

This would produce the following distributions:

-Eliminating allele level for GeneA:

GeneA (Discreet distribution) A1 0.5 A2 0.5

GeneB (Conditional distribution) B1*1|A1 0.3 B1*2|A1 0.1 B2*1|A1 0.1 B3*1|A1 0.5

B1*1|A2 0.6 B1*2|A2 0.9 B2*1|A2 0.1 B3*1|A2 0.4

-Eliminating allele level for GeneB:

GeneA (Discreet distribution) A1*1 0.5 A2*1 0.2 A2*2 0.3

GeneB (Conditional distribution) B1|A1*1 0.4 B2|A1*1 0.1 B3|A1*1 0.5

B1|A2*1 0.8 B2|A2*1 0.0 B3|A2*1 0.2

B1|A2*2 0.7 B2|A2*2 0.1 B3|A2*2 0.2

-Eliminating allele level for GeneA and GeneB:

GeneA (Discreet distribution) A1 0.5 A2 0.5

GeneB (Conditional distribution) B1|A1 0.4 B2|A1 0.1 B3|A1 0.5

B1|A2 0.75 B2|A2 0.05 B3|A2 0.2

Thanks for a great program, and for taking your time looking into this!

jmschrei commented 4 years ago

Howdy

Unfortunately there is nothing supported right now for doing that. You would probably have to go in and manually do such a collapse.