mckinsey / causalnex

A Python library that helps data scientists to infer causation rather than observing correlation.
http://causalnex.readthedocs.io/
Other
2.21k stars 256 forks source link

[Bug]: Classification Model always predicting 0 #217

Open yc-um opened 10 months ago

yc-um commented 10 months ago

Contact Details

yhcho@umich.edu

Short description of the problem here.

Hi, I built a classification model based on the hybrid structural model learned from a dataset with 6 features and ~40k records following the tutorial, Evaluated the model on the test set, and got 0 for precision and recall for class 1. I tried constructing the model with different parameters, but this won't seem to fix the issue and I keep getting 0 for precision and recall, even though the full dataset includes ~20% of class 1. What might be wrong here? I would appreciate any feedback or suggestions.

{'Target_0': {'precision': 0.8104828298476633, 'recall': 1.0, 'f1-score': 0.8953223046206502, 'support': 3139.0}, 'Target_1': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 734.0}, 'accuracy': 0.8104828298476633, 'macro avg': {'precision': 0.40524141492383164, 'recall': 0.5, 'f1-score': 0.4476611523103251, 'support': 3873.0}, 'weighted avg': {'precision': 0.6568824174778763, 'recall': 0.8104828298476633, 'f1-score': 0.7256433550746763, 'support': 3873.0}}

CausalNex Version

0.12.1

Python Version

3.9.18

Relevant code snippet

sm = from_pandas(df, tabu_edges=[("Target","X1"),("Target","X2"),("Target","X3"),("Target","X4"),("Target","X5")], w_threshold=0.8)
#MANUAL ADJUSTMENT OF CONNECTIONS
sm.add_edge("X1", "Target")
sm.add_edge("X2", "X3")

from causalnex.network import BayesianNetwork

bn = BayesianNetwork(sm)

discretised_data = df.copy()

columns_to_bin = ['X1', 'X2', 'X3', 'X4', 'X5']

# Bin input data
num_bins = 5
bin_range = (0, 10)

# Loop through the columns and apply qcut to create buckets
for column in columns_to_bin:
    discretised_data[f'{column}'] = pd.cut(discretised_data[column], bins=num_bins, labels=False, retbins=False, right=True, include_lowest=True)

# Split 90% train and 10% test
from sklearn.model_selection import train_test_split
train, test = train_test_split(discretised_data, train_size=0.9, test_size=0.1, random_state=7)

bn = bn.fit_node_states(discretised_data)
bn = bn.fit_cpds(train, method="BayesianEstimator", bayes_prior="K2")

from causalnex.evaluation import classification_report
classification_report(bn, test, "Target")

Relevant log output

No response

Code of Conduct

SarthakNikhal commented 9 months ago

Can I work on this issue?