Closed ivanDonadello closed 3 years ago
@ivanDonadello The API for HillClimbSearch was changed in the last release, and now it expects the scoring method to be specified in the estimate
method. So, in the example that you have posted above, it uses the default K2Score
. If you use BicScore
, you should get the expected results:
In [16]: import pandas as pd
...: import numpy as np
...: from pgmpy.estimators import HillClimbSearch, BicScore
...: data = pd.DataFrame(np.random.randint(0, 5, size=(5000, 9)), columns=list('ABCDEFGHI'))
...: data['J'] = data['A'] * data['B']
...: est = HillClimbSearch(data)
...: best_model = est.estimate(scoring_method=BicScore(data))
...: best_model.edges()
0%| | 2/1000000 [00:00<68:50:16, 4.04it/s]
Out[16]: OutEdgeView([('A', 'J'), ('B', 'J')])
Also, if your code is based on an example from the pgmpy documentation, could you please point me to it (so I could update it)?
@ankurankan thanks for your answer. I use the chow-liu, tan and PC estimators in this way:
est = TreeSearch(data, root_node=states_[0])
model = est.estimate(estimator_type='chow-liu')
est = TreeSearch(data, root_node=states_[0])
model = est.estimate(estimator_type='tan', class_node=states_[1])
est = PC(data)
model = est.estimate()
Then I estimate the parameters:
model = BayesianModel(model.edges())
model.fit(data, estimator=BayesianEstimator, prior_type='dirichlet', pseudo_counts=0.1)
Finally, I query the model
model.predict(values, n_jobs=6)
My idea is to find a structure that best predicts the values of the evidence. Now I can also use HillClimbSearch.
@ivanDonadello A better method to verify your model structure could be to test the implied conditional independencies against the dataset. This will avoid any additional noise due to parameter learning. The R package dagitty
has a neat function to do it easily: https://rdrr.io/cran/dagitty/man/localTests.html
Thank you very much!
But can variable prediction be used as a measure to find the accuracy of a model? I had the same idea, but I cannot really find the reasoning behind variable prediction. It uses variable elimination if I'm not mistaken, but how does it pick the state of the missing variable? Does it always pick the state with the highest percentage? Or can the other states be chosen as well?
@MathijsPost Sorry for the late reply. Model prediction can certainly help where you can compare the accuracy (or some other metric) of the predictions. But a point to keep in mind is that in that case, we are optimizing for a conditional distribution (P(Y | X)) which is what most of the general machine learning algorithms are built for. They usually optimize an error term which is computed by the difference in true and predicted values. And if your goal is to learn a conditional distribution, I think a lot of the machine learning methods would outperform a Bayesian network. But if your goal is to learn a causal structure, I would suggest doing model testing using statistical independence tests as only the structure implies the causal relations.
Yes, the predict method does a MAP (Maximum a Posteriori) estimation, so it uses inference algorithm to do a query with the given columns as evidence, computes a joint distribution on the missing columns, and then returns the combination of variable states which has the highest probability.
Thank you @ankurankan ! What do you think is the best method to test statistical independence on a structure learned Bayesian Network? Is there a build-in function in pgmpy? I am looking for a slightly more intuitive metric other than the BIC, K2 or BDeu scores.
@MathijsPost I don't know of any single metric to test the model structure. But a common way is to test the conditional independence implied by the network structure using statistical tests in the dataset. pgmpy doesn't have a method to do it directly yet, but you can use the R package dagitty for it: https://rdrr.io/cran/dagitty/man/localTests.html. Can also have a look at this: https://currentprotocols.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/cpz1.45
Also, @mbaak recently contributed this method: https://github.com/pgmpy/pgmpy/blob/dev/pgmpy/inference/bn_inference.py for computing the likelihood of data given model structure with parameters. It can also be used as a metric.
@MathijsPost With the log_probability(X) method now in https://github.com/pgmpy/pgmpy/blob/dev/pgmpy/inference/bn_inference.py one can calculate the G-test statistics of how well the model fits the dataset (https://en.wikipedia.org/wiki/G-test). This can be interpreted as Pearson's chi-squared test formula (with a bit of work), giving a null p-value of how well the model describes the dataset - in essence a test of the model structure. I'd be happy to write an example of how to do this.
Hi @mbaak , that sounds very interesting! If it is not too much to work, an example would be nice :)
Hi @MathijsPost, I tried to make an example for you but while doing this ran into a pgmpy sampling issue that I just posted here: https://github.com/pgmpy/pgmpy/issues/1387 This needs to be resolved in order to make the example. Once it is, I will get back to this.
Hi @mbaak ,
That's alright, take your time!
@MathijsPost Sorry for the late reply. Model prediction can certainly help where you can compare the accuracy (or some other metric) of the predictions. But a point to keep in mind is that in that case, we are optimizing for a conditional distribution (P(Y | X)) which is what most of the general machine learning algorithms are built for. They usually optimize an error term which is computed by the difference in true and predicted values. And if your goal is to learn a conditional distribution, I think a lot of the machine learning methods would outperform a Bayesian network. But if your goal is to learn a causal structure, I would suggest doing model testing using statistical independence tests as only the structure implies the causal relations.
Yes, the predict method does a MAP (Maximum a Posteriori) estimation, so it uses inference algorithm to do a query with the given columns as evidence, computes a joint distribution on the missing columns, and then returns the combination of variable states which has the highest probability.
@ankurankan , is there a way to make the prediction process more stochastic? Such that it not always picks the state with the highest probability but occasionally other states as well according to the distribution?
@ankurankan what are your thoughts on the following:
I see that predict calls the map_query
function. In the map_query
I replaced:
argmax = np.argmax(final_distribution.values)
by:
from random import choices
population = list(np.arange(0, len(final_distribution.values))
weights = list(final_distribution.values)
choice = choices(population, weights)
argmax = choice[0]
This way it picks the index based on the weights. It is no longer officially a MAP query, but it can pick values with smaller probabilities as well.
@MathijsPost I would suggest adding something like this in the predict
method as modifying the MAP behavior to stochastic would be incorrect. We could add a new argument to the predict
method which the user can use to specify whether they want a stochastic prediction or a simple MAP one. What do you think?
Hi @ankurankan I think that sounds like a better solution! I agree that it is incorrect to modify the MAP behavior. Do you think that is a lot of work? For my application it is sometimes incorrect to always pick the state with the highest probability but I suppose it depends on the application.
@MathijsPost It should be very straightforward. For getting stochastic results, we will just need to replace the map_query
call in the predict
method with query
and then just sample the predictions (something like what you suggest above) from the returned distribution.
@MathijsPost The stochastic predict
should work now in the latest dev version.
@ankurankan Cool man! Thankyou
Hi @MathijsPost, I tried to make an example for you but while doing this ran into a pgmpy sampling issue that I just posted here:
1387
This needs to be resolved in order to make the example. Once it is, I will get back to this.
Hi @mbaak , was it possible to resolve this issue? I would be very helpful if you could provide me an example for the G-test statistic. Thank you :)
Hi @MathijsPost, I tried to make an example for you but while doing this ran into a pgmpy sampling issue that I just posted here:
1387
This needs to be resolved in order to make the example. Once it is, I will get back to this.
Hi @mbaak , was it possible to resolve this issue? I would be very helpful if you could provide me an example for the G-test statistic. Thank you :)
Hi @mbaak, what I tried yesterday was the following:
from pgmpy.inference import BayesianModelProbability
a = BayesianModelProbability(model)
a.score(dataset)
This resulted in the following error:
C:\Users\TUDelft SID\Anaconda3\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py:522: UserWarning: Found unknown state name. Trying to switch to using all state names as state numbers
"Found unknown state name. Trying to switch to using all state names as state numbers"
C:\Users\TUDelft SID\Anaconda3\lib\site-packages\pgmpy\inference\bn_inference.py:142: RuntimeWarning: divide by zero encountered in log
return np.log(probability_node)
@ankurankan Hi! Thanks again for implementing stochastic predictions. I am comparing MAP and stochastic predictions on the same dataset however, I ran into a small issue for the stochastic predictions. The MAP predictions work fine, but the stochastic predictions gave me the following error:
100%|██████████| 722/722 [8:02:07<00:00, 40.07s/it]
C:\Users\TUDelft SID\Anaconda3\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py:462: RuntimeWarning: invalid value encountered in true_divide
phi.values = phi.values / phi.values.sum()
Traceback (most recent call last):
File "<ipython-input-4-c85949579e3b>", line 19, in <module>
predictedvariables = predict_variables_fixed_input(day, predvariables, len(day), structurelearningmodel, True, dictionaries)
File "D:\Documents\!AE\MSc\Thesis\testBN_v11v2.py", line 305, in predict_variables_fixed_input
y_pred = model.predict(predict_data, stochastic=stoch, n_jobs=-1)
File "C:\Users\TUDelft SID\Anaconda3\lib\site-packages\pgmpy\models\BayesianModel.py", line 599, in predict
p = pred_values[i].sample(n=len(row))
File "C:\Users\TUDelft SID\Anaconda3\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py", line 813, in sample
indexes = np.random.choice(range(len(p)), size=n, p=p)
File "mtrand.pyx", line 928, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
I am going to look into this my self as well, just so you are aware. It might have to do with the nan
issue we discussed earlier.
Subject of the issue
I want to reproduce the example here
Your environment
Steps to reproduce
Expected behaviour
[('B', 'J'), ('A', 'J')]
Actual behaviour
[('A', 'B'), ('J', 'A'), ('J', 'B')]