jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.38k stars 590 forks source link

[Question] How to calculate marginals for Bayesian net in new pomegranate? (+input format for ConditionalCategorical) #1087

Closed ferencbartok closed 8 months ago

ferencbartok commented 8 months ago

I'm new to pomegranate and found mainly examples from the old version, unfortunately. There existed a method marginal() which should:

Return the marginal probabilities of each variable in the graph.

This is equivalent to a pass of belief propogation on a graph where no data has been given. This will calculate the probability of each variable being in each possible emission when nothing is known.

How should one calculate this in the new pomegranate? I'm trying something like this:

a = Categorical([[0.1, 0.85, 0.05]])
b = Categorical([[0.2, 0.8]])
probs = np.array([
        [
            [0.2, 0.3, 0.5], 
            [0.1, 0.6, 0.3] 
        ],
        [
            [0.4, 0.4, 0.2],
            [0.3, 0.2, 0.5] 
        ],
        [
            [0.1, 0.7, 0.2], 
            [0.6, 0.1, 0.3]
        ]
    ])
c = ConditionalCategorical(probs=probs)
model = BayesianNetwork([a, b, c], [(a, c), (b, c)])
X = torch.tensor([[-1, -1, -1]])
X_masked = torch.masked.MaskedTensor(X, mask=X >= 0)
result = model.predict(X_masked)
result = model.predict_proba(X_masked)

But I always get some error regarding shapes*. What am I doing wrong? I don't really understand the descriptions regarding the shapes and sizes of probs. In our Bayes net we have many nodes with different number of possible states so the shape varies. Could you please help me?

*RuntimeError: shape '[1, 1, 3]' is invalid for input of size 2

Thanks!

jmschrei commented 8 months ago

You need to wrap probs in one more dimension because it's able to handle multidimensional probabilities, where each feature has a different conditional probability distribution. So, like..

probs = np.array([[
        [
            [0.2, 0.3, 0.5], 
            [0.1, 0.6, 0.3] 
        ],
        [
            [0.4, 0.4, 0.2],
            [0.3, 0.2, 0.5] 
        ],
        [
            [0.1, 0.7, 0.2], 
            [0.6, 0.1, 0.3]
        ]
    ]])

I can reproduce the original error using the provided code and when I make that change and print out result I get the following:

[tensor([[0.1000, 0.8500, 0.0500]]), tensor([[0.2000, 0.8000]]), tensor([[0.3090, 0.2690, 0.4220]])]

ferencbartok commented 8 months ago

Thank you very much for the quick response! It's working. (I think I tried this as well at some point but back then I didn't have the edges set up in a proper order which caused the same/similar issue.) Now I'm having a problem with an example net of a,b,c,d, where edges are (a,d), (b,d), (c,d). From the code/doc:

probs: list of numpy.ndarray, torch.tensor or None, shape=(k, k), optional
        A list of conditional probabilities with one tensor for each feature
        in the data being modeled. Each tensor should have `k+1` dimensions 
        where `k` is the number of timesteps to condition on. Each dimension
        should span the number of keys in that dimension. For example, if
        specifying a univariate conditional categorical distribution where
        k=2, a valid tensor shape would be [(2, 3, 4)]. Default is None.

What do timestamps mean here? I'd like to create variables/nodes, which have parents and model it using ConditionalCategorical. Should I use something different?

The concrete example I'm having a problem with by the way: image And this is how my code currently creates the array:

[[[[0.621325 0.348595 0.03008 ]
   [0.24265  0.60695  0.1504  ]]
  [[0.5545   0.3703   0.0752  ]
   [0.109    0.515    0.376   ]]
  [[0.57925  0.42075  0.      ]
   [0.1585   0.8415   0.      ]]
  [[0.505    0.495    0.      ]
   [0.01     0.99     0.      ]]
  [[0.5792   0.34555  0.07525 ]
   [0.1585   0.4655   0.376   ]]
  [[0.505    0.307    0.188   ]
   [0.01     0.05     0.94    ]]]]

Or in a more readable format:

a = Categorical([[0.1, 0.85, 0.05]])
b = Categorical([[0.2, 0.8]])
c = Categorical([[0.8, 0.2]])
d_probs = np.array([[
        [
            [0.621325, 0.348595, 0.03008],
            [0.24265, 0.60695, 0.1504]
        ],
        [
            [0.5545, 0.3703, 0.0752],
            [0.109, 0.515, 0.376]
        ],
        [
            [0.57925, 0.42075, 0.0],
            [0.1585, 0.8415, 0.0]
        ],
        [
            [0.505, 0.495, 0.0],
            [0.01, 0.99, 0.0]
        ],
        [
            [0.5792, 0.34555, 0.07525],
            [0.1585, 0.4655, 0.376]
        ],
        [
            [0.505, 0.307, 0.188],
            [0.01, 0.05, 0.94]
        ]
    ]])
d = ConditionalCategorical(probs=d_probs)
d.name = 'd'
model = BayesianNetwork([b, a, d, c], [(a, d), (b, d), (c, d)])
X = torch.tensor([[-1, -1, -1, -1]])
X_masked = torch.masked.MaskedTensor(X, mask=X >= 0)
result2 = model.predict_proba(X_masked)

Based on your example (and some other examples in my network) I thought this kind of structure would work for all the nodes, but it seems I'm still missing something. Btw our goal is to migrate from SMILE to open-source, which is why I'm testing pomegranate.

ferencbartok commented 8 months ago

Oh, now I understand that the dimensions are dependent on the number of parents, so with the above example this solution works:

d_probs = np.array([[
        [
            [
                [0.621325, 0.348595, 0.03008],
                [0.24265, 0.60695, 0.1504]
            ],
            [
                [0.5545, 0.3703, 0.0752],
                [0.109, 0.515, 0.376]
            ],
        ],
        [
            [
                [0.57925, 0.42075, 0.0],
                [0.1585, 0.8415, 0.0]
            ],
            [
                [0.505, 0.495, 0.0],
                [0.01, 0.99, 0.0]
            ],
        ],
        [
            [
                [0.5792, 0.34555, 0.07525],
                [0.1585, 0.4655, 0.376]
            ],
            [
                [0.505, 0.307, 0.188],
                [0.01, 0.05, 0.94]
            ]
        ]
    ]])

As a last (bonus) question: doesn't pomegranate have some way of creating this array for us? I'm wondering if there could be some "reshape" function so that once we know the whole graph structure it could reshape the probabilities/distribution from a simple 1D array or similar. Or maybe this is what n_categories are for? edit: oh sorry, I can just use reshape like this for example:

shape = [1]
    for parent in node.parentIds:
        shape.append(node_id_to_node_dictionary[parent].stateCount)
    shape.append(node.stateCount)
    d_probs = np.array(node.probabilities).reshape(shape)