matheusfacure / python-causality-handbook

Causal Inference for the Brave and True. A light-hearted yet rigorous approach to learning about impact estimation and causality.
https://matheusfacure.github.io/python-causality-handbook/landing-page.html
MIT License
2.61k stars 456 forks source link

Chapter 22 implementation of cv_estimate() has a bug #320

Closed cgrinaldi closed 1 year ago

cgrinaldi commented 1 year ago

There is an issue on chapter 22, in the following code block:

from sklearn.model_selection import KFold

def cv_estimate(train_data, n_splits, model, model_params, X, y):
    cv = KFold(n_splits=n_splits)
    m = model(**model_params)

    models = []
    cv_pred = pd.Series(np.nan, index=train_data.index)
    for train, test in cv.split(train_data):
        m.fit(train_data[X].iloc[train], train_data[y].iloc[train])
        cv_pred.iloc[test] = m.predict(train_data[X].iloc[test])
        models += [m]

    return cv_pred, models

The model m is created outside of the for loop, which causes there to be a single model that is fit multiple times. This causes the resulting list of models to be a list of the same model, repeated.

This can be verified by seeing that models[0] == models[1] returns True (which it shouldn't be).

To fix this issue, m = model(**model_params) should be moved to the first line within the for loop.