pgmpy / pgmpy

Python Library for learning (Structure and Parameter), inference (Probabilistic and Causal), and simulations in Bayesian Networks.
https://pgmpy.org/
MIT License
2.72k stars 702 forks source link

Strange Behavior HillClimbSearch #1361

Closed ivanDonadello closed 3 years ago

ivanDonadello commented 3 years ago

Subject of the issue

I want to reproduce the example here

Your environment

Steps to reproduce

import pandas as pd
import numpy as np
from pgmpy.estimators import HillClimbSearch, BicScore
data = pd.DataFrame(np.random.randint(0, 5, size=(5000, 9)), columns=list('ABCDEFGHI'))
data['J'] = data['A'] * data['B']
est = HillClimbSearch(data, scoring_method=BicScore(data))
best_model = est.estimate()
best_model.edges()

Expected behaviour

[('B', 'J'), ('A', 'J')]

Actual behaviour

[('A', 'B'), ('J', 'A'), ('J', 'B')]

ankurankan commented 3 years ago

@ivanDonadello The API for HillClimbSearch was changed in the last release, and now it expects the scoring method to be specified in the estimate method. So, in the example that you have posted above, it uses the default K2Score. If you use BicScore, you should get the expected results:

In [16]: import pandas as pd
    ...: import numpy as np
    ...: from pgmpy.estimators import HillClimbSearch, BicScore
    ...: data = pd.DataFrame(np.random.randint(0, 5, size=(5000, 9)), columns=list('ABCDEFGHI'))
    ...: data['J'] = data['A'] * data['B']
    ...: est = HillClimbSearch(data)
    ...: best_model = est.estimate(scoring_method=BicScore(data))
    ...: best_model.edges()
  0%|                                                                                                                                                 | 2/1000000 [00:00<68:50:16,  4.04it/s]
Out[16]: OutEdgeView([('A', 'J'), ('B', 'J')])

Also, if your code is based on an example from the pgmpy documentation, could you please point me to it (so I could update it)?

ivanDonadello commented 3 years ago

@ankurankan thanks for your answer. I use the chow-liu, tan and PC estimators in this way:

est = TreeSearch(data, root_node=states_[0])
model = est.estimate(estimator_type='chow-liu')
est = TreeSearch(data, root_node=states_[0])
model = est.estimate(estimator_type='tan', class_node=states_[1])
est = PC(data)
model = est.estimate()

Then I estimate the parameters:

model = BayesianModel(model.edges())
model.fit(data, estimator=BayesianEstimator, prior_type='dirichlet', pseudo_counts=0.1)

Finally, I query the model

model.predict(values, n_jobs=6)

My idea is to find a structure that best predicts the values of the evidence. Now I can also use HillClimbSearch.

ankurankan commented 3 years ago

@ivanDonadello A better method to verify your model structure could be to test the implied conditional independencies against the dataset. This will avoid any additional noise due to parameter learning. The R package dagitty has a neat function to do it easily: https://rdrr.io/cran/dagitty/man/localTests.html

ivanDonadello commented 3 years ago

Thank you very much!

MathijsPost commented 3 years ago

But can variable prediction be used as a measure to find the accuracy of a model? I had the same idea, but I cannot really find the reasoning behind variable prediction. It uses variable elimination if I'm not mistaken, but how does it pick the state of the missing variable? Does it always pick the state with the highest percentage? Or can the other states be chosen as well?

ankurankan commented 3 years ago

@MathijsPost Sorry for the late reply. Model prediction can certainly help where you can compare the accuracy (or some other metric) of the predictions. But a point to keep in mind is that in that case, we are optimizing for a conditional distribution (P(Y | X)) which is what most of the general machine learning algorithms are built for. They usually optimize an error term which is computed by the difference in true and predicted values. And if your goal is to learn a conditional distribution, I think a lot of the machine learning methods would outperform a Bayesian network. But if your goal is to learn a causal structure, I would suggest doing model testing using statistical independence tests as only the structure implies the causal relations.

Yes, the predict method does a MAP (Maximum a Posteriori) estimation, so it uses inference algorithm to do a query with the given columns as evidence, computes a joint distribution on the missing columns, and then returns the combination of variable states which has the highest probability.

MathijsPost commented 3 years ago

Thank you @ankurankan ! What do you think is the best method to test statistical independence on a structure learned Bayesian Network? Is there a build-in function in pgmpy? I am looking for a slightly more intuitive metric other than the BIC, K2 or BDeu scores.

ankurankan commented 3 years ago

@MathijsPost I don't know of any single metric to test the model structure. But a common way is to test the conditional independence implied by the network structure using statistical tests in the dataset. pgmpy doesn't have a method to do it directly yet, but you can use the R package dagitty for it: https://rdrr.io/cran/dagitty/man/localTests.html. Can also have a look at this: https://currentprotocols.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/cpz1.45

ankurankan commented 3 years ago

Also, @mbaak recently contributed this method: https://github.com/pgmpy/pgmpy/blob/dev/pgmpy/inference/bn_inference.py for computing the likelihood of data given model structure with parameters. It can also be used as a metric.

mbaak commented 3 years ago

@MathijsPost With the log_probability(X) method now in https://github.com/pgmpy/pgmpy/blob/dev/pgmpy/inference/bn_inference.py one can calculate the G-test statistics of how well the model fits the dataset (https://en.wikipedia.org/wiki/G-test). This can be interpreted as Pearson's chi-squared test formula (with a bit of work), giving a null p-value of how well the model describes the dataset - in essence a test of the model structure. I'd be happy to write an example of how to do this.

MathijsPost commented 3 years ago

Hi @mbaak , that sounds very interesting! If it is not too much to work, an example would be nice :)

mbaak commented 3 years ago

Hi @MathijsPost, I tried to make an example for you but while doing this ran into a pgmpy sampling issue that I just posted here: https://github.com/pgmpy/pgmpy/issues/1387 This needs to be resolved in order to make the example. Once it is, I will get back to this.

MathijsPost commented 3 years ago

Hi @mbaak ,

That's alright, take your time!

MathijsPost commented 3 years ago

@MathijsPost Sorry for the late reply. Model prediction can certainly help where you can compare the accuracy (or some other metric) of the predictions. But a point to keep in mind is that in that case, we are optimizing for a conditional distribution (P(Y | X)) which is what most of the general machine learning algorithms are built for. They usually optimize an error term which is computed by the difference in true and predicted values. And if your goal is to learn a conditional distribution, I think a lot of the machine learning methods would outperform a Bayesian network. But if your goal is to learn a causal structure, I would suggest doing model testing using statistical independence tests as only the structure implies the causal relations.

Yes, the predict method does a MAP (Maximum a Posteriori) estimation, so it uses inference algorithm to do a query with the given columns as evidence, computes a joint distribution on the missing columns, and then returns the combination of variable states which has the highest probability.

@ankurankan , is there a way to make the prediction process more stochastic? Such that it not always picks the state with the highest probability but occasionally other states as well according to the distribution?

MathijsPost commented 3 years ago

@ankurankan what are your thoughts on the following:

I see that predict calls the map_query function. In the map_query I replaced:

argmax = np.argmax(final_distribution.values)

by:

from random import choices 
population = list(np.arange(0, len(final_distribution.values))
weights = list(final_distribution.values)
choice = choices(population, weights)
argmax = choice[0]

This way it picks the index based on the weights. It is no longer officially a MAP query, but it can pick values with smaller probabilities as well.

ankurankan commented 3 years ago

@MathijsPost I would suggest adding something like this in the predict method as modifying the MAP behavior to stochastic would be incorrect. We could add a new argument to the predict method which the user can use to specify whether they want a stochastic prediction or a simple MAP one. What do you think?

MathijsPost commented 3 years ago

Hi @ankurankan I think that sounds like a better solution! I agree that it is incorrect to modify the MAP behavior. Do you think that is a lot of work? For my application it is sometimes incorrect to always pick the state with the highest probability but I suppose it depends on the application.

ankurankan commented 3 years ago

@MathijsPost It should be very straightforward. For getting stochastic results, we will just need to replace the map_query call in the predict method with query and then just sample the predictions (something like what you suggest above) from the returned distribution.

ankurankan commented 3 years ago

@MathijsPost The stochastic predict should work now in the latest dev version.

MathijsPost commented 3 years ago

@ankurankan Cool man! Thankyou

MathijsPost commented 3 years ago

Hi @MathijsPost, I tried to make an example for you but while doing this ran into a pgmpy sampling issue that I just posted here:

1387

This needs to be resolved in order to make the example. Once it is, I will get back to this.

Hi @mbaak , was it possible to resolve this issue? I would be very helpful if you could provide me an example for the G-test statistic. Thank you :)

MathijsPost commented 3 years ago

Hi @MathijsPost, I tried to make an example for you but while doing this ran into a pgmpy sampling issue that I just posted here:

1387

This needs to be resolved in order to make the example. Once it is, I will get back to this.

Hi @mbaak , was it possible to resolve this issue? I would be very helpful if you could provide me an example for the G-test statistic. Thank you :)

Hi @mbaak, what I tried yesterday was the following:

from pgmpy.inference import BayesianModelProbability
a = BayesianModelProbability(model)
a.score(dataset)

This resulted in the following error:

C:\Users\TUDelft SID\Anaconda3\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py:522: UserWarning: Found unknown state name. Trying to switch to using all state names as state numbers
  "Found unknown state name. Trying to switch to using all state names as state numbers"
C:\Users\TUDelft SID\Anaconda3\lib\site-packages\pgmpy\inference\bn_inference.py:142: RuntimeWarning: divide by zero encountered in log
  return np.log(probability_node)
MathijsPost commented 3 years ago

@ankurankan Hi! Thanks again for implementing stochastic predictions. I am comparing MAP and stochastic predictions on the same dataset however, I ran into a small issue for the stochastic predictions. The MAP predictions work fine, but the stochastic predictions gave me the following error:

100%|██████████| 722/722 [8:02:07<00:00, 40.07s/it]  

C:\Users\TUDelft SID\Anaconda3\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py:462: RuntimeWarning: invalid value encountered in true_divide
  phi.values = phi.values / phi.values.sum()
Traceback (most recent call last):

  File "<ipython-input-4-c85949579e3b>", line 19, in <module>
    predictedvariables = predict_variables_fixed_input(day, predvariables, len(day), structurelearningmodel, True, dictionaries)

  File "D:\Documents\!AE\MSc\Thesis\testBN_v11v2.py", line 305, in predict_variables_fixed_input
    y_pred = model.predict(predict_data, stochastic=stoch, n_jobs=-1)

  File "C:\Users\TUDelft SID\Anaconda3\lib\site-packages\pgmpy\models\BayesianModel.py", line 599, in predict
    p = pred_values[i].sample(n=len(row))

  File "C:\Users\TUDelft SID\Anaconda3\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py", line 813, in sample
    indexes = np.random.choice(range(len(p)), size=n, p=p)

  File "mtrand.pyx", line 928, in numpy.random.mtrand.RandomState.choice

ValueError: probabilities contain NaN

I am going to look into this my self as well, just so you are aware. It might have to do with the nan issue we discussed earlier.