BayesianModel prediction on new data

pgmpy / pgmpy_notebook

Short Tutorial to Probabilistic Graphical Models(PGM) and pgmpy

http://pgmpy.org/

MIT License

369 stars 212 forks source link

BayesianModel prediction on new data #42

Closed SrGrace closed 4 years ago

SrGrace commented 4 years ago

How do I predict on a completely new data on which BayesianModel() hasn't been trained?

Example:

import numpy as np
import pandas as pd
from pgmpy.models import BayesianModel
values = pd.DataFrame(np.random.randint(low=0, high=2, size=(100, 3)),
                      columns=['A', 'B', 'C'])

model = BayesianModel([('A', 'B'), ('C', 'B')])
model.fit(values)

predict_data = pd.DataFrame({'A': [-1], 'C': [1]})
y_pred = model.predict(predict_data)
y_pred

KeyError: -1

SrGrace commented 4 years ago

@ankurankan, Could you plz help?

its happening even while predicting on the same training corpus if I save the model and then load it using BIFReader, BIFWriter

ankurankan commented 4 years ago

@SrGrace You can pass an additional state_names argument to the fit method which specifies all the possible states for the variables. This will automatically create states which don't exist in the data. Have a look at the documentation here: https://github.com/pgmpy/pgmpy/blob/dev/pgmpy/models/BayesianModel.py#L489.

About the BIFReader and BIFWriter, are your state names in data int values? Because when reading back from a file BIFReader has no way to distinguish between whether the state name was int or str, so by default, it always assumes the state names to be str. So, if you are trying to predict using int state names after reading, it will throw an error. If you are getting the error for some other reason, could you share your code so that I can reproduce it?

SrGrace commented 4 years ago

@ankurankan thanks for replying!

Okay, now I'm typecasting the test df to str and then its working fine when using BIFReader and BIFWriter but still Is there any other way so that BIFReader could get the data in their training data types because even after specifying the state_names as (I'm taking the above example):

model = BayesianModel([('A', 'B'), ('C', 'B')])
model.fit(values, state_names={'A': list(set(values['A'])),  # [0, 1]
                               'B': list(set(values['B'])),
                               'C': list(set(values['C']))})

given in the training data, data type of all these columns is int. I have to typecast it while predicting.

About the new data, -1 is not there in the training corpus for column 'A' and in the state_name I can't define it as {'A': [-1, 0, 1]} because then it throws an error ValueError: Data contains unexpected states for variable 'A'. Therefore how to handle completely new data, I'm not sure.

ankurankan commented 4 years ago

@SrGrace I pushed an update to pgmpy yesterday (https://github.com/pgmpy/pgmpy/pull/1285) which now allows you to specify state_name_type for BIFReader.get_model. So, you can basically specify what type you want the read state names to be and it will automatically convert them.

About specifying the extra state, what version of pgmpy are you using? Because it works on my machine:

In [1]: import numpy as np 
   ...: import pandas as pd 
   ...: from pgmpy.models import BayesianModel 
   ...: values = pd.DataFrame(np.random.randint(low=0, high=2, size=(100, 3)), 
   ...:                       columns=['A', 'B', 'C']) 
   ...:  
   ...: model = BayesianModel([('A', 'B'), ('C', 'B')]) 
   ...: model.fit(values, state_names={'A':[-1, 0, 1], 'B': [0, 1], 'C': [0,1]})                                                                                                             

In [2]:                                                                                                                                                                                      

In [2]: model.get_cpds()                                                                                                                                                                     
Out[2]: 
[<TabularCPD representing P(A:3) at 0x7fbfefbfc198>,
 <TabularCPD representing P(B:2 | A:3, C:2) at 0x7fbff0542a20>,
 <TabularCPD representing P(C:2) at 0x7fbf7893a400>]

In [3]: model.cpds[0]                                                                                                                                                                        
Out[3]: <TabularCPD representing P(A:3) at 0x7fbfefbfc198>

In [4]: print(model.cpds[0])                                                                                                                                                                 
+-------+------+
| A(-1) | 0    |
+-------+------+
| A(0)  | 0.55 |
+-------+------+
| A(1)  | 0.45 |
+-------+------+