pgmpy / pgmpy_notebook

Short Tutorial to Probabilistic Graphical Models(PGM) and pgmpy
http://pgmpy.org/
MIT License
368 stars 213 forks source link

KeyError in notebook "1. Intro..." #54

Open ansgar-t opened 2 years ago

ansgar-t commented 2 years ago

Depending on the random train/test split this code can give a key error:

for i in X_test_features:
    predicted_values.append(joint_prob[i[0], i[1]].idxmax())

Here's a possible alternative:

for i in X_test_features:
    key = (i[0], i[1])
    conditional_prob = joint_prob[key].idxmax() if key in joint_prob else 0.0
    predicted_values.append(conditional_prob)
ankurankan commented 2 years ago

@ansgar-t Yes, this seems to be an issue. Thanks for reporting it. But I think it would be better to assign None/Nan for keys that don't exist. As 0 is an actual class value which would lead to mis-classification. Something like: joint_prob[key].idxmax() if key in joint_prob else None. What do you think? and would you like to create a PR for it?

ansgar-t commented 2 years ago

you're right about the else case, of course. "0.0" doesn't make sense there.

looks like I thought of separating 2 steps:

... and then I didn't. :)

ansgar-t commented 2 years ago

having said that...

I think adding the following code to the preparation of joint_prob would be my preferred solution now:

# making sure, that the estimated joint probability is defined over the full domain, 
# using 0.0 for value combinations not seen in the data:

length_domain = range(16) # assuming length cannot exceed 15
width_domain = range(11) # assuming width cannot exceed 10
type_domain = range(3) 

full_index = pd.MultiIndex.from_product([length_domain, width_domain, type_domain])

joint_prob = joint_prob.reindex(full_index).fillna(0.0)
ankurankan commented 2 years ago

@ansgar-t Sorry for the super late reply. Yes, this solution also looks good. Would be great if you would open a PR with the fix :). Thanks