matheusfacure / python-causality-handbook

Causal Inference for the Brave and True. A light-hearted yet rigorous approach to learning about impact estimation and causality.
https://matheusfacure.github.io/python-causality-handbook/landing-page.html
MIT License
2.61k stars 456 forks source link

Question on Chapter 11 about the Logistic regression: pd.get_dummies() #324

Closed Junlei-Chen closed 1 year ago

Junlei-Chen commented 1 year ago

First of all, thank you so much for providing such a great thing!!! Very happy that I came across it somehow.

There is a question/issue on the logistic regression code you used for getting the propensity score estimation, about the common convention of dropping the redundant first dummy:

There is an issue on chapter 11, in the following paragraph [categ = ["ethnicity", "gender", "school_urbanicity"] cont = ["school_mindset", "school_achievement", "school_ethnic_minority", "school_poverty", "school_size"]

data_with_categ = pd.concat([ data.drop(columns=categ), # dataset without the categorical features pd.get_dummies(data[categ], columns=categ, drop_first=False)# categorical features converted to dummies ], axis=1)

print(data_with_categ.shape)]

It should be [categ = ["ethnicity", "gender", "school_urbanicity"] cont = ["school_mindset", "school_achievement", "school_ethnic_minority", "school_poverty", "school_size"]

data_with_categ = pd.concat([ data.drop(columns=categ), # dataset without the categorical features pd.get_dummies(data[categ], columns=categ, drop_first=True)# categorical features converted to dummies ], axis=1)

print(data_with_categ.shape)]

Just very curious about why we would not drop the first dummy here... Shouldn't it be dropped?

Thank you!

matheusfacure commented 1 year ago

That's a good question. It probably has something to do with the solver. The predictions for a model using linear dependent columns are the same, but it somehow converges faster. I changed the code to your suggestion and it gave me multiple non conversion warnings.

from sklearn.linear_model import LogisticRegression
import numpy as np

n=1000

x = np.random.normal(0, 1, size=n)
y = (np.random.normal(x, 1, size=n) > 0)

m1 = LogisticRegression(penalty="none")
p1 = m1.fit(x.reshape(-1, 1), y).predict_proba(x.reshape(-1, 1))[:, 1]

m2 = LogisticRegression(penalty="none")
p2 = m2.fit(np.vstack([x,x]).T, y).predict_proba(np.vstack([x,x]).T)[:, 1]

np.allclose(p1, p2, 1e-7)
>>> True

m1.n_iter_
>>> 8

m2.n_iter_
>>> 6