Marginalization of TabularCPD

LElgueddari commented 8 months ago

Mathematical definition

Hey, I'm new in the Bayesian network framework and I've just started to use your package (BTW thanks for the amazing package). I was creating my one CPD and I've computed the marginalization. It seems that the results I'm getting are a bit strange. Here is an exemple. Suppose that I have two dices, a 4-sided dice and a 6-sided dice, and suppose that I'm using the first one 3/4 of the time. From what I understood about the marginalization, if I want to know the marginal probability of me getting a 1 :

$p(X=1) = \sum{y}p(x=1, y) = \sum{y}p(x=1|Y=y)p(Y=y)$

With: $Y$ being the random variable representing the dice selection ($p(Y=dice_4) = 0.75$ and $p(Y=dice_6) = 0.25$) $X$ being the random variable representing the output of rolling the dice

From my understanding, the marginal distribution of X should be $p(X) = \frac{1}{4} \frac{3}{4} + \frac{1}{6} \frac{1}{4} = 0.2292$ if X = 1, 2, 3, 4 $p(X) = \frac{1}{6} * \frac{1}{4} = 0.0417$ if X = 5, 6

The Code to reproduce

import numpy as np
from pgmpy import TabularCPD
cpds_dice = TabularCPD(`
    variable="Dice", 
    variable_card=2,
    values=np.array([0.75, 0.25]).reshape(2, 1), 
    state_names={"Dice": ["4-dice", "6-dice"]}
)
cpds_face = TabularCPD(
    variable="Face", 
    variable_card=6, 
    values=[[1/4, 1/6], [1/4, 1/6], [1/4, 1/6], [1/4, 1/6], [0, 1/6], [0, 1/6], ],
    evidence=["Dice"],
    evidence_card=[2],
    state_names={"Dice": ["4-dice", "6-dice"], "Face": ["1", "2", "3", "4", "5", "6"]})

print(cpds_face.marginalize(["Dice"], inplace=False))
print("True marginalization: ", np.dot(cpds_face.values, cpds_dice.values))

The output I'm getting:

+---------+-----------+
| Face(1) | 0.208333  |
+---------+-----------+
| Face(2) | 0.208333  |
+---------+-----------+
| Face(3) | 0.208333  |
+---------+-----------+
| Face(4) | 0.208333  |
+---------+-----------+
| Face(5) | 0.0833333 |
+---------+-----------+
| Face(6) | 0.0833333 |
+---------+-----------+
True marginalization:  [0.22916667 0.22916667 0.22916667 0.22916667 0.04166667 0.04166667]

From what I've understood from the code the marginalize function sums and normalizes the value, while this is true for joint distribution it is not for the conditional distribution.

Did I miss something, how can I compute the true marginal distribution with the Bayesian network ?

ankurankan commented 8 months ago

@LElgueddari Sorry for the late reply. In pgmpy, the TabularCPDs are separate/independent objects, and they do not have reference to other defined CPDs. In this case, you are getting an incorrect answer because when you marginalize the cpds_face, it does not consider the probability of dice selection. It is essentially treating cpds_face as the joint distribution.

The two defined CPDs need to be connected together, which is done in pgmpy by defining a network structure over the variables (BayesianNetwork). After which the CPDs can be added to this model. And finally, to compute any distribution over the variables in the model, we need to run inference on the model (in the example below, I have used VariableElimination).

import numpy as np
from pgmpy.models import BayesianNetwork
from pgmpy.factors.discrete import TabularCPD
from pgmpy.inference import VariableElimination

model = BayesianNetwork([('Dice', 'Face')])
cpds_dice = TabularCPD(
    variable="Dice", 
    variable_card=2,
    values=np.array([0.75, 0.25]).reshape(2, 1), 
    state_names={"Dice": ["4-dice", "6-dice"]}
)
cpds_face = TabularCPD(
    variable="Face", 
    variable_card=6, 
    values=[[1/4, 1/6], [1/4, 1/6], [1/4, 1/6], [1/4, 1/6], [0, 1/6], [0, 1/6], ],
    evidence=["Dice"],
    evidence_card=[2],
    state_names={"Dice": ["4-dice", "6-dice"], "Face": ["1", "2", "3", "4", "5", "6"]})

model.add_cpds(cpds_dice, cpds_face)
infer = VariableElimination(model)
face_marginal = infer.query(['Face'])
print(face_marginal)

This should give you the expected result:

+---------+-------------+
| Face    |   phi(Face) |
+=========+=============+
| Face(1) |      0.2292 |
+---------+-------------+
| Face(2) |      0.2292 |
+---------+-------------+
| Face(3) |      0.2292 |
+---------+-------------+
| Face(4) |      0.2292 |
+---------+-------------+
| Face(5) |      0.0417 |
+---------+-------------+
| Face(6) |      0.0417 |
+---------+-------------+

LElgueddari commented 8 months ago

Thanks @ankurankan for the answer. In the examples given in the class TabularCPD (which stands for Tabular Conditional Probability Distribution I guess), the documentation states that the inputs should be the conditional probability distribution and not the joint one. But if I follow the example, then the marginalize function should take the joint probability and therefore gives me a wrong answer. Probably a way to fix this should be:

A warning saying that the marginal distribution is wrong if the CPD has evidence, and the possibility to add the joint probability of evidence so the marginal of the given CPD is equal to the dot product of the given joint probability and the conditional probability in the values parameter.
Adding the marginal distribution as an attribute.
In the Bayesian network, we could add a marginalize network function to compute the true marginal probability of all the nodes.
Overall, I think (not tested yet, but I've read the function's code) that the way the marginal is defined in the CPD might introduce a bug in the way remove node is computed since it uses the marginalize of the CPD, which is the false one in the case of evidences.

Thank again for your answer.

ankurankan commented 8 months ago

@LElgueddari Sorry for the super late reply and thanks for the suggestions for modifying the method. I do agree that marginalizing the conditional variables of a conditional distribution does not really have any sense. I am not sure anymore what was my understanding while implementing the method. I think with the current implementation, it would implicity assume the marginal distribution of the conditional variable to be uniformly distributed and then marginalize it out.

I think the best solution would be to simply remove the marginalize method from TabularCPD as I do not see any interpretation of the operation. What do you think?

pgmpy / pgmpy

Marginalization of TabularCPD #1725

Mathematical definition

The Code to reproduce