Question on Chapter 11 about the Logistic regression: pd.get_dummies()

matheusfacure / python-causality-handbook

Causal Inference for the Brave and True. A light-hearted yet rigorous approach to learning about impact estimation and causality.

MIT License

2.78k stars 482 forks source link

First of all, thank you so much for providing such a great thing!!! Very happy that I came across it somehow.

There is a question/issue on the logistic regression code you used for getting the propensity score estimation, about the common convention of dropping the redundant first dummy:

There is an issue on chapter 11, in the following paragraph [categ = ["ethnicity", "gender", "school_urbanicity"] cont = ["school_mindset", "school_achievement", "school_ethnic_minority", "school_poverty", "school_size"]

data_with_categ = pd.concat([ data.drop(columns=categ), # dataset without the categorical features pd.get_dummies(data[categ], columns=categ, drop_first=False)# categorical features converted to dummies ], axis=1)

print(data_with_categ.shape)]

It should be [categ = ["ethnicity", "gender", "school_urbanicity"] cont = ["school_mindset", "school_achievement", "school_ethnic_minority", "school_poverty", "school_size"]

data_with_categ = pd.concat([ data.drop(columns=categ), # dataset without the categorical features pd.get_dummies(data[categ], columns=categ, drop_first=True)# categorical features converted to dummies ], axis=1)

print(data_with_categ.shape)]

Just very curious about why we would not drop the first dummy here... Shouldn't it be dropped?

Thank you!

from sklearn.linear_model import LogisticRegression import numpy as np n=1000 x = np.random.normal(0, 1, size=n) y = (np.random.normal(x, 1, size=n) > 0) m1 = LogisticRegression(penalty="none") p1 = m1.fit(x.reshape(-1, 1), y).predict_proba(x.reshape(-1, 1))[:, 1] m2 = LogisticRegression(penalty="none") p2 = m2.fit(np.vstack([x,x]).T, y).predict_proba(np.vstack([x,x]).T)[:, 1] np.allclose(p1, p2, 1e-7) >>> True m1.n_iter_ >>> 8 m2.n_iter_ >>> 6

matheusfacure / python-causality-handbook

Question on Chapter 11 about the Logistic regression: pd.get_dummies() #324