Closed AdirthaBorgohain closed 4 years ago
The easiest way is just to run the from_structure
command again. The process for learning the parameters of a network given a structure should be super fast.
Say I have a network on which parameters are already learned on a large amount of data. Then I want to add a new node which can have an edge connecting it to an existing node in the network. After adding the node, can I fit the new data while not forgetting whatever the network learned in the past? The new data will have an extra column for the new node. Assuming that I did not have the values for the newly added node in the data on which the model was initially trained. I hope you understood what I meant to say. Is it possible? In the example I gave above, what if values of data and new_data are :
data =[[1,0,0],
[1,0,1],
...
...
[1,0,1],
[0,0,0]]
new_data =[[1,1,0,1],
[0,0,0,1],
...
...
[1,1,1,0],
[0,1,0,1]]
The new data does not have the same instances of data with just one added column. Its completely new data.
If you want to progressively train components of the model it's likely easiest to fit each one of the states initially and then build a Bayesian network object with the trained components. In your instance, I'd start off by fitting a discrete distribution to the first column and conditional probability tables to the second and third columns (conditioned on the first column). Then, if you have another data set, you can fit a conditional probability table to the fourth column with whatever structure you'd like. Here's some code that I haven't bug-tested but should be a good start.
X1 = .. your first massive data set of shape (n, 3) ..
X2 = .. some other data set of shape (m, 4) ..
d1 = DiscreteDistribution.from_samples(X1[:,0])
d2 = ConditionalProbabilityTable.from_samples(X1[:, [0, 1]])
d3 = ConditionalProbabilityTable.from_samples(X1[:, [0, 2]])
# Now the new variable
d4 = ConditionalProbabilityTable.from_samples(X2)
# Build the model
n1, n2, n3, n4 = Node(d1), Node(d2), Node(d3), Node(d4)
model = BayesianNetwork()
model.add_nodes(n1, n2, n3, n4)
model.add_edge(n1, n2)
model.add_edge(n1, n3)
model.add_edge(n1, n4)
model.add_edge(n2, n4)
model.add_edge(n3, n4)
model.bake()
If you already have a network object whose parameters have been trained you can extract the distributions and make a new Bayesian network with nodes = [node.distribution for node in model.states]
and then do the above.
I think (but haven't tested) you could also just do add_node
and add_edge
with the original Bayesian network object as long as you rebake it. Then, you have to set the self.frozen
property to be True for each of the distributions you don't want to update, e.g.
for node in model.states:
node.distribution.frozen = True
and that will prevent them from updating when you use the fit step.
Thank you for the detailed explanation. In the line
d4 = ConditionalProbabilityTable.from_samples(X2)
, should I not add the parent node for d4
? Also how do I update d1,d2,d3
distributions with data from X2
?
Yes, you should probably add in whomever the parents are for d2, d3, and d4.
Do you know all of your data in advance or are you trying to adaptively grow the network in response to a stream of data that grows in the number of features? If you know all the data in advance you can concatenate the examples together and add in missing values (np.nan
) for the examples where that variable is missing. If you have a stream of data but know the total number of variables you'll encounter you can iteratively run model.summarize
on incoming streams of data with np.nan
put in for the variables you haven't yet encountered, followed by model.from_summaries
at the end.
Fitting parameters given a known structure does not require inference and so is generally very flexible.
Mine is stream of data where I grow the network adaptively. If I understood correctly, it is necessary to atleast know the total number of variables I will encounter? Because in my case, I was thinking of adding nodes as the need arises and then fit the new incoming data containing values for all the nodes including the new nodes. This new data will update the learned parameters of the network
In that situation I would recommend that you ignore the BayesianNetwork methods and work with the summarize method of the underlying distributions. The summarize
method will aggregate statistics until you call the from_summaries
method. If you have a stream of data you should keep calling summarize
until you want to update the model parameters. You can do this separately for each distribution, selecting out the columns of data corresponding to each distribution. If you want to add a new distribution it's easy to just start fitting it to the appropriate columns in the distribution. When you're done with your training process, you can then put them all in a Bayesian network and perform inference.
Will try out what you just said. Just another question, How do we use summarize a distribution (Discrete and Conditional) with new data? Could not find any relevant example. The documentation says we need to pass data with (n_samples, dimension). So, I tried doing this:
d1 = DiscreteDistribution.summarize(X1[:,0])
Throws in an error saying
TypeError: descriptor 'summarize' requires a 'pomegranate.distributions.DiscreteDistribution.DiscreteDistribution' object
.
The summarize
and from_summaries
methods are supposed to aggregate statistics given an existing model. The idea is that, while these summaries are independent of the current parameters for basic distributions, they do depend on the current parameters for models like HMMs and GMMs. You should create a dummy version of the distribution with made up values and then call summarize.
I have a network that I created from a pre-defined structure using
from_structure
method. I want to know if there is any way by which we can add new nodes and edges to the network?My network is intially created in the following manner:
Say, I want to add a node "D" which is a parent of "C". The data which I will fit in the network now will be in the form of
Any known way to do this in pomegranate?