Closed Junlei-Chen closed 1 year ago
That's a good question. It probably has something to do with the solver. The predictions for a model using linear dependent columns are the same, but it somehow converges faster. I changed the code to your suggestion and it gave me multiple non conversion warnings.
from sklearn.linear_model import LogisticRegression
import numpy as np
n=1000
x = np.random.normal(0, 1, size=n)
y = (np.random.normal(x, 1, size=n) > 0)
m1 = LogisticRegression(penalty="none")
p1 = m1.fit(x.reshape(-1, 1), y).predict_proba(x.reshape(-1, 1))[:, 1]
m2 = LogisticRegression(penalty="none")
p2 = m2.fit(np.vstack([x,x]).T, y).predict_proba(np.vstack([x,x]).T)[:, 1]
np.allclose(p1, p2, 1e-7)
>>> True
m1.n_iter_
>>> 8
m2.n_iter_
>>> 6
First of all, thank you so much for providing such a great thing!!! Very happy that I came across it somehow.
There is a question/issue on the logistic regression code you used for getting the propensity score estimation, about the common convention of dropping the redundant first dummy:
There is an issue on chapter 11, in the following paragraph [categ = ["ethnicity", "gender", "school_urbanicity"] cont = ["school_mindset", "school_achievement", "school_ethnic_minority", "school_poverty", "school_size"]
data_with_categ = pd.concat([ data.drop(columns=categ), # dataset without the categorical features pd.get_dummies(data[categ], columns=categ, drop_first=False)# categorical features converted to dummies ], axis=1)
print(data_with_categ.shape)]
It should be [categ = ["ethnicity", "gender", "school_urbanicity"] cont = ["school_mindset", "school_achievement", "school_ethnic_minority", "school_poverty", "school_size"]
data_with_categ = pd.concat([ data.drop(columns=categ), # dataset without the categorical features pd.get_dummies(data[categ], columns=categ, drop_first=True)# categorical features converted to dummies ], axis=1)
print(data_with_categ.shape)]
Just very curious about why we would not drop the first dummy here... Shouldn't it be dropped?
Thank you!