jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.34k stars 589 forks source link

[BUG] Chow-Liu Gives Unexpected Structure #708

Closed ndolev closed 4 years ago

ndolev commented 4 years ago

Hi,

Given a matrix with two variables and 10 samples and no intersection:

array([[False, False],
       [False, False],
       [False, False],
       [ True, False],
       [False, False],
       [False,  True],
       [False,  True],
       [ True, False],
       [False, False],
       [False, False]])

This call:

model0 = BN.from_samples(m, algorithm = 'chow-liu', n_jobs = 1)

Generates the following structure:

((), (0,))

Even though m[:,0] & m[:,1] is always false.

Greedy will give the expected structure: ((), ()).

However, if I feed the matrix:

array([[False, False, False,  True],
       [ True, False, False, False],
       [ True, False, False, False],
       [ True, False, False, False],
       [False, False, False,  True],
       [False,  True, False, False],
       [False, False, False, False],
       [False,  True, False, False],
       [False, False,  True, False],
       [False, False, False,  True]])

Where each node is an orphan, then exact, exact-dp and greedy all give variable 3 the child 0 (even though they are not connected).

What do you think?

jmschrei commented 4 years ago

The Chow-Liu tree algorithm finds the minimum spanning tree across your data. The result is that each variable will have a parent (except for the root node) regardless of how tenuously two variables are connected.

ndolev commented 4 years ago

So why does Greedy, Exact and Exact-dp all create connections where they don't exist?

jmschrei commented 4 years ago

Those algorithms are estimating a structured based on the (usually noisy) data you give it. Sometimes edges are estimated to exist that you, having designed the data generation procedure, didn't intend to exist because the noise resembles signal.

ndolev commented 4 years ago

The rabbit hole gets deeper when you look at the underlying computation of the conditional probability table for the second matrix:

X = np.array([        [False, False, False,  True],
                      [ True, False, False, False],
                      [ True, False, False, False],
                      [ True, False, False, False],
                      [False, False, False,  True],
                      [False,  True, False, False],
                      [False, False, False, False],
                      [False,  True, False, False],
                      [False, False,  True, False],
                      [False, False, False,  True]])

The CPT can be computed as follows:

cpt = ConditionalProbabilityTable.from_samples(X, parents = [DiscreteDistribution.from_samples(X[:,0], weights=None ,pseudocount=0)],
                                               weights=None, pseudocount=0) 

Which yields:

False   False   False   False   0.25
False   False   False   True    0.75
False   False   True    False   1.0
False   False   True    True    0.0
False   True    False   False   1.0
False   True    False   True    0.0
False   True    True    False   0.5
False   True    True    True    0.5
True    False   False   False   1.0
True    False   False   True    0.0
True    False   True    False   0.5
True    False   True    True    0.5
True    True    False   False   0.5
True    True    False   True    0.5
True    True    True    False   0.5
True    True    True    True    0.5

I can't understand why the cpd would be 0.5 for True True True False for example?

ghost commented 4 years ago

First of all, it's not the 'underlying computation', when you make a function call and check the result ;)

The second to last row in the array in your result stands for the assignment of the respective node, any row before that is the assignments of the nodes which it is conditioned on. The sum over all 'same-condition' must add up to ONE.

Thus, the last two rows add up to one. In addition, you didn't provide any information for this condition (by the looks of your input), and pomegranate assumes uniform distribution for any conditions which havent been specified.

Could you please clarify what X is and the command to get the output?

ndolev commented 4 years ago

Hi Sebastian,

Thanks for your comment.

It's underlying to learning from samples in the sense that "ConditionalProbabilityTable" is an internal Pomegranate function. :)

I edited the issue. X is the array I mentioned in the beginning of the message.

Why would it make sense to assume a uniform distribution in this case when you have the discrete distributions (how very frequentist of you ;) ? You could guess for example the cpd based on the discrete probabilities under the independence assumption when you have no information, for example...

EDIT Also, do you really have no information? You have information that this never happens in 10 tries. If anything, one would think that a Bayesian statistician, would take this to mean that you have some evidence that True True True False is rather rare...

Cheers, Noah

jmschrei commented 4 years ago

Hi @ndolev

As @SebastianBelkner mentioned, pomegranate assumes a uniform distribution for child values when the parent combination isn't seen in the data. You are correct that one could use the marginal distribution of X instead of the uniform distribution. My reasoning for not using the marginal value as the default is because that marginal distribution is derived from the other rows in the table and that it is conceptually simpler (to me) to use the uniform distribution when nothing is known. That being said, I think that a reasonable case could also be made for using the marginal distribution.