Bug: Error when using belief propagation's Query method with state names different from the state numbers

Jamesfox1 commented 3 years ago

Subject of the issue

We can't use Belief Propagation's query method with different state names (ie strings or numbers that aren't the state numbers)

Your environment

pgmpy commit: e52f836 or pgmpy version: 0.1.13
Python version: 3.8.8
Operating System: Ubuntu 20.04

Steps to reproduce

This example creates the error because the state_names don't correspond to the index of the variable's domain (which you refer to as its "state number").

Therefore, belief_propagation.query(variables=['J'], evidence={'A': 'a'}) raises an error. (it also produces an error if we gave TabularCPD final argument state_names={'A': [2, 1]} and asked a query of the form: belief_propagation.query(variables=['J'], evidence={'A': 2}) because again the state name ("2") don't correspond to the state number "0").

from pgmpy.models import BayesianModel
from pgmpy.inference import BeliefPropagation
from pgmpy.factors.discrete import TabularCPD

bayesian_model = BayesianModel([('A', 'B')])
cpd_a = TabularCPD('A', 2, [[0.5], [0.5]], state_names={'A': ['a', 'b']})
cpd_b = TabularCPD('B', 2, [[1, 0],
                    [0, 1]], ['A'], [2])
bayesian_model.add_cpds(cpd_a, cpd_b)
belief_propagation = BeliefPropagation(bayesian_model)
factor = belief_propagation.query(variables=['B'], evidence={'A': 'a'})
print(factor)

Expected behaviour

The version below works as expected because the state names [0,1] correspond to the indexes of variable A's domain:

from pgmpy.models import BayesianModel
from pgmpy.inference import BeliefPropagation
from pgmpy.factors.discrete import TabularCPD

bayesian_model = BayesianModel([('A', 'B')])
cpd_a = TabularCPD('A', 2, [[0.5], [0.5]], state_names={'A': [0, 1]})
cpd_b = TabularCPD('B', 2, [[1, 0],
                    [0, 1]], ['A'], [2])
bayesian_model.add_cpds(cpd_a, cpd_b)
belief_propagation = BeliefPropagation(bayesian_model)
factor = belief_propagation.query(variables=['B'], evidence={'A': 0})
print(factor)

Actual behaviour

The issue is arising from the reduce method of the DiscreteFactor class within DiscreteFactor.py:

In lines 510-515 it says that this is where you're converting the "state names to state number". However, by simply printing out the state_names just before line 510, you find that it's already converted the state_names to the state_numbers - therefore, this conversion doesn't do anything.

ankurankan commented 3 years ago

@Jamesfox1 The problem, in this case, is that the state_names for cpd_j isn't specified. Currently, the state names between the CPDs aren't shared, so cpd_a uses the state names [a, b] for A but since it's not specified in cpd_j it automatically assigns the state names for A to be [0, 1]. And hence when we try to reduce cpd_j on A: a, it can't identify the state name. A simple fix for now would be to also specify the state names for cpd_j. Something like:

cpd_j = TabularCPD('B', 2, [[1, 0], [0, 1]], ['A'], [2], state_names={'A': ['a', 'b'], 'B': ['c', 'd']})

RyanCarey commented 3 years ago

Would it make more sense for:

the state_names argument to accept a list (which could be called "domain") rather than a dictionary (since the values of the dictionary seem to always duplicate with the "variable" argument)
to automatically update a "state_names" dictionary in init, that it includes the list for each previous variable ?

(We're following this approach in pycid, the influence diagram library, built on pgmpy.)

ankurankan commented 3 years ago

I am a bit divided on this issue. Currently, the state_names require the dictionary for state names of both the variable and evidence variables. The problem is that if we relax this and only ask for variable's state names, it makes it dependent on other CPDs (as it will require parent CPDs to be defined as well), which results in a loss of modularity. But at the same time, it will make it easier for the users requiring less input from them.

My idea behind keeping things modular was to have separate objects for model structure and different parameterizations so that users can combine these in any way they want. So, it allows users to easily add/remove/replace CPDs, or modify network structure as they wish. Or if someone wants to add a new type of parameterization, it would be straightforward. It also helps users if they just want to do operations on a single / a bunch of CPDs without considering the model altogether.

pgmpy / pgmpy