Open ansgar-t opened 2 years ago
@ansgar-t Yes, this seems to be an issue. Thanks for reporting it. But I think it would be better to assign None/Nan for keys that don't exist. As 0 is an actual class value which would lead to mis-classification. Something like: joint_prob[key].idxmax() if key in joint_prob else None
. What do you think? and would you like to create a PR for it?
you're right about the else case, of course. "0.0" doesn't make sense there.
looks like I thought of separating 2 steps:
... and then I didn't. :)
having said that...
I think adding the following code to the preparation of joint_prob would be my preferred solution now:
# making sure, that the estimated joint probability is defined over the full domain,
# using 0.0 for value combinations not seen in the data:
length_domain = range(16) # assuming length cannot exceed 15
width_domain = range(11) # assuming width cannot exceed 10
type_domain = range(3)
full_index = pd.MultiIndex.from_product([length_domain, width_domain, type_domain])
joint_prob = joint_prob.reindex(full_index).fillna(0.0)
@ansgar-t Sorry for the super late reply. Yes, this solution also looks good. Would be great if you would open a PR with the fix :). Thanks
Depending on the random train/test split this code can give a key error:
Here's a possible alternative: