jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.32k stars 590 forks source link

Is there any way to add nodes and edges to an exisiting bayesian network? #738

Closed AdirthaBorgohain closed 4 years ago

AdirthaBorgohain commented 4 years ago

I have a network that I created from a pre-defined structure using from_structure method. I want to know if there is any way by which we can add new nodes and edges to the network?

My network is intially created in the following manner:

data =[[1,0,0],
       [1,0,1],
       [1,0,1],
       [0,0,0]]

network_structure = ((), (0,), (0,))
network = BayesianNetwork.from_structure(data, structure=network_structure, name='Random Network', state_names=["A", "B", "C"])

Say, I want to add a node "D" which is a parent of "C". The data which I will fit in the network now will be in the form of

new_data =[[1,0,0,0],
       [1,0,1,1],
       [1,0,1,1],
       [0,0,0,1]]

Any known way to do this in pomegranate?

jmschrei commented 4 years ago

The easiest way is just to run the from_structure command again. The process for learning the parameters of a network given a structure should be super fast.

AdirthaBorgohain commented 4 years ago

Say I have a network on which parameters are already learned on a large amount of data. Then I want to add a new node which can have an edge connecting it to an existing node in the network. After adding the node, can I fit the new data while not forgetting whatever the network learned in the past? The new data will have an extra column for the new node. Assuming that I did not have the values for the newly added node in the data on which the model was initially trained. I hope you understood what I meant to say. Is it possible? In the example I gave above, what if values of data and new_data are :

data =[[1,0,0],
       [1,0,1],
       ...
       ...
       [1,0,1],
       [0,0,0]]

new_data =[[1,1,0,1],
           [0,0,0,1],
           ...
           ...
           [1,1,1,0],
           [0,1,0,1]]

The new data does not have the same instances of data with just one added column. Its completely new data.

jmschrei commented 4 years ago

If you want to progressively train components of the model it's likely easiest to fit each one of the states initially and then build a Bayesian network object with the trained components. In your instance, I'd start off by fitting a discrete distribution to the first column and conditional probability tables to the second and third columns (conditioned on the first column). Then, if you have another data set, you can fit a conditional probability table to the fourth column with whatever structure you'd like. Here's some code that I haven't bug-tested but should be a good start.

X1 = .. your first massive data set of shape (n, 3) ..
X2 = .. some other data set of shape (m, 4) ..

d1 = DiscreteDistribution.from_samples(X1[:,0])
d2 = ConditionalProbabilityTable.from_samples(X1[:, [0, 1]])
d3 = ConditionalProbabilityTable.from_samples(X1[:, [0, 2]])

# Now the new variable
d4 = ConditionalProbabilityTable.from_samples(X2) 

# Build the model
n1, n2, n3, n4 = Node(d1), Node(d2), Node(d3), Node(d4)

model = BayesianNetwork()
model.add_nodes(n1, n2, n3, n4)
model.add_edge(n1, n2)
model.add_edge(n1, n3)
model.add_edge(n1, n4)
model.add_edge(n2, n4)
model.add_edge(n3, n4)
model.bake()

If you already have a network object whose parameters have been trained you can extract the distributions and make a new Bayesian network with nodes = [node.distribution for node in model.states] and then do the above.

I think (but haven't tested) you could also just do add_node and add_edge with the original Bayesian network object as long as you rebake it. Then, you have to set the self.frozen property to be True for each of the distributions you don't want to update, e.g.

for node in model.states: 
    node.distribution.frozen = True

and that will prevent them from updating when you use the fit step.

AdirthaBorgohain commented 4 years ago

Thank you for the detailed explanation. In the line d4 = ConditionalProbabilityTable.from_samples(X2), should I not add the parent node for d4 ? Also how do I update d1,d2,d3 distributions with data from X2?

jmschrei commented 4 years ago

Yes, you should probably add in whomever the parents are for d2, d3, and d4.

Do you know all of your data in advance or are you trying to adaptively grow the network in response to a stream of data that grows in the number of features? If you know all the data in advance you can concatenate the examples together and add in missing values (np.nan) for the examples where that variable is missing. If you have a stream of data but know the total number of variables you'll encounter you can iteratively run model.summarize on incoming streams of data with np.nan put in for the variables you haven't yet encountered, followed by model.from_summaries at the end.

Fitting parameters given a known structure does not require inference and so is generally very flexible.

AdirthaBorgohain commented 4 years ago

Mine is stream of data where I grow the network adaptively. If I understood correctly, it is necessary to atleast know the total number of variables I will encounter? Because in my case, I was thinking of adding nodes as the need arises and then fit the new incoming data containing values for all the nodes including the new nodes. This new data will update the learned parameters of the network

jmschrei commented 4 years ago

In that situation I would recommend that you ignore the BayesianNetwork methods and work with the summarize method of the underlying distributions. The summarize method will aggregate statistics until you call the from_summaries method. If you have a stream of data you should keep calling summarize until you want to update the model parameters. You can do this separately for each distribution, selecting out the columns of data corresponding to each distribution. If you want to add a new distribution it's easy to just start fitting it to the appropriate columns in the distribution. When you're done with your training process, you can then put them all in a Bayesian network and perform inference.

AdirthaBorgohain commented 4 years ago

Will try out what you just said. Just another question, How do we use summarize a distribution (Discrete and Conditional) with new data? Could not find any relevant example. The documentation says we need to pass data with (n_samples, dimension). So, I tried doing this:
d1 = DiscreteDistribution.summarize(X1[:,0]) Throws in an error saying TypeError: descriptor 'summarize' requires a 'pomegranate.distributions.DiscreteDistribution.DiscreteDistribution' object.

jmschrei commented 4 years ago

The summarize and from_summaries methods are supposed to aggregate statistics given an existing model. The idea is that, while these summaries are independent of the current parameters for basic distributions, they do depend on the current parameters for models like HMMs and GMMs. You should create a dummy version of the distribution with made up values and then call summarize.