uber / causalml

Uplift modeling and causal inference with machine learning algorithms
Other
5.09k stars 780 forks source link

CausalRandomForestRegressor with causal_mse predicts to inf on data with nuisance #589

Open winston-zillow opened 1 year ago

winston-zillow commented 1 year ago

Describe the bug After training the CausalRandomForestRegressor with criterion causal_mse on data with nuisance, many of the predicted ITE values are inf.

To Reproduce I changed the causal trees with synthetic data notebook to use data generated by simulate_nuisance_and_easy_treatment

# y, X, w, tau, b, e = synthetic_data(mode=5, n=10000, p=20, sigma=5.0)
from causalml.dataset import simulate_nuisance_and_easy_treatment
y, X, w, tau, b, e = simulate_nuisance_and_easy_treatment(n=10000, p=20, sigma=5.0)

after training the CausalRandomForestRegressor with criterion causal_mse with the same codes:

rforest2 = CausalRandomForestRegressor(criterion="causal_mse",
                                       min_samples_leaf=200,
                                       control_name=0,
                                       n_estimators=50,
                                       n_jobs=4)
rforest2.fit(X=df_train[feature_names].values,
             treatment=df_train['treatment'].values,
             y=df_train['outcome'].values
             )

many of the predicted ITE values are inf.

rf2_ite_pred = rforest2.predict(df_test[feature_names].values)
rf2_ite_pred[:100]

This is the case even if I change the nuisance to something simpler:

    #b = (
    #       np.sin(np.pi * X[:, 0] * X[:, 1])
    #        + 2 * (X[:, 2] - 0.5) ** 2
    #        + X[:, 3]
    #        + 0.5 * X[:, 4]
    #)
    b = X[:, 3] + 2 * X[:, 4] + 3 * X[:, 1]

Expected behavior Should predict to valid values.

Environment (please complete the following information):

winston-zillow commented 1 year ago

Note: CausalRandomForestRegressor with standard_mse predicts fine on the same data.

winston-zillow commented 1 year ago

More debug info: one of the trained tree seems to be bad:

print(rforest2.estimators_[10].feature_importances_)
=> [ 0. nan  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.]

After these trees are removed the predictions won't be inf but still may predicts to extreme negative values (-3e+13.)

alexander-pv commented 1 year ago

Hi, thanks for the report. The issue has been fixed recently in https://github.com/uber/causalml/pull/583.

Please, reinstall the package from source. You can also generate the desired type of synth data by changing mode parameter:

y, X, w, tau, b, e = synthetic_data(mode=1, n=10000, p=20, sigma=5.0)

In causal_trees_with_synthetic_data.ipynb you will get the following result: synth_data_mode1

winston-zillow commented 1 year ago

Thanks. Reinstalling from source fixes the problem!

winston-zillow commented 1 year ago

This still happens with my real world data. Some predictions result in nan (rather than inf.) Maybe there's still issue?

alexander-pv commented 1 year ago

Hi. Could you please plot each tree from your fitted CausalRandomForestRegressor using plot_causal_tree in causalml.inference.tree.plot and attach images? You can also attach small dataset which reproduces the nan issue.

lemonlmn commented 1 month ago

Hi, I encounter the same nan issue using CausalRandomForestRegressor for the predict.

When using 'causal_mse' the nan ratio is around 10%. Using 'standard_mse' is better, but still have around 2% nan.

lemonlmn commented 1 month ago

BTW, seems plot_causal_tree only works for CausalTreeRegressor, not for CausalRandomForestRegressor.