py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.81k stars 714 forks source link

About DMLIV new release #388

Open federiconuta opened 3 years ago

federiconuta commented 3 years ago

Hi,

EDITED:
I had issues on store_final and inference. However I tried:

cate = DMLIV(model_Y_X =model_Y_X(), model_T_X =model_T_X(), model_T_XZ =model_T_XZ(),
             model_final=dmliv_model_effect(), featurizer=dmliv_featurizer(),cv=6,
             n_splits= 20)

and seems to work with store_final = cache_values. Further, I saw from the documentation that it can be specifiedinference= 'bootstrap'` when fitting. Two main issues emerge:

1) specifying inference= 'bootstrap' in cate.fit() takes forever to run. Also without specifying it, it takes a lot of time and the following error emerges: The final model has a nonzero intercept for at least one outcome; it will be subtracted, but consider fitting a model without an intercept if possible. I guess this is maybe due to the fact that the panel wrapper provided in previous issues does not exclude the first time dummy and individual dummy when performing:

n_products_per_group = 1
n_months = 47
n_groups = 11
n_products = n_groups * n_products_per_group
# p[i, j] = price of product i on month j
p = price
# p_minus[i, j] = average price of products other than i within the same group on month j
p_minus = med_price_atc
# q[i, j] = demand of product i on month j
q = demand
# X[i, j] = (one-hot-encoding of product i): si fa - 3 perchè sarebbe nprod-1+nmonth-1+group-1
X = np.zeros((n_products, n_months, n_products + n_months + n_groups - 2))

for g in np.arange(n_groups):
    for i in np.arange(n_products_per_group):
        for j in np.arange(n_months):
            index_i = n_products_per_group*g + i
            X[index_i, j, :(n_products)] = 1.0 * (np.arange(0, n_products) == (index_i))
            #INDICE GRUPPO
            X[index_i, j, (n_products):(n_products + n_groups - 1)] =  1.0 * (np.arange(1, n_groups) == g)
            #INDICE QUARTER
            X[index_i, j, (n_products + n_groups - 1):] = 1.0 * (np.arange(1, n_months) == j)

Could you please confirm that?

2) is it possible to use inference='bootstrap' with cate.refit_final() ?

3) I cannot import SubsetWrapper, where can I get it from using the latest release?

Thank you very much!

kbattocchi commented 3 years ago

A few notes.

  1. cv has replaced n_splits, so don't specify both (you should get a warning about this), right now I believe your n_splits setting is overriding cv, so you're using 20 folds, which seems excessive (it means that the three nuisance models will each be fit 20 times).
  2. The non-zero final model intercept indicates that your final model is not configured correctly; your final model should always be linear with no intercept. This is not related to the panel wrapper because that only affects the first stage models.
  3. DMLIV supports more efficient inference as long as you use a compatible final model (which you can enable with fit(..., inference='auto') or refit(..., inference='auto'), so I'd recommend that instead of using bootstrap in any case; besides that I think that bootstrap doesn't currently work correctly with refitting but even if it did you should probably prefer the supplied 'auto' inference.
  4. I believe that SubsetWrapper was never part of our supported API, but I don't think it has moved, it should still be in the prototypes directory.
federiconuta commented 3 years ago

I see and corrected all as you kindly suggested. Just when refitting as follows:

lasso_dmliv_model_effect = lambda: SubsetWrapper(LassoCV(cv=[(fold00, fold11), (fold11, fold00)],n_alphas = 5000,fit_intercept=False,max_iter=10000),
                                               feature_inds)
refitted = cate.refit_final(lasso_dmliv_model_effect(), dmliv_featurizer(), inference= 'auto')

the following error appears:

refit_final() got multiple values for argument 'inference'

My guess is that somewhere inference is set and inference='auto' overrode it somehow. The I tried redefining explicitly cate.model_final:

cate.model_final = lambda: SubsetWrapper(LassoCV(cv=[(fold00, fold11), (fold11, fold00)],n_alphas = 5000,fit_intercept=False,max_iter=10000),
                                               feature_inds)
refitted = cate.refit_final(inference ='auto') 

which however raises an error of missing fit function. If I then define:

cate.model_final=SubsetWrapper(LassoCV(cv=[(fold00, fold11), (fold11, fold00)],n_alphas = 5000,fit_intercept=False,max_iter=10000),
                                               feature_inds)

the following error message appears:

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 11 is different from 78)

The main trade-off I am facing is: I can use the elder version for refit_final() but this dos not allow for inference. The new version allows for inference but seems to be unable to support SubsetWrapper. My guess is that refitting the final on a subgroup of Xs causes problems of dimensionality. Is there a way to overcome this? (i.e. allowing for SubsetWrapper AND inference)? Comparing the two .py files it seems that the elder refit_final has a part:

if not self.stored_final_data:
            raise AttributeError(
                "Estimator is not yet fit with store_data=True")
        self.model_effect.model_effect = model_effect
        self.model_effect.featurizer = featurizer
        self.model_effect.fit(self.res_y, self.res_t, self.X)
        return self

that the recent version does not have.

kbattocchi commented 3 years ago

The correct way to call refit is to reassign the final model first, as you are doing in your second/third examples.

Are you able to fit (rather than refit) with that final model? I'd expect refit to work with any model that works for fit, but it's possible you've got some sort of indexing issue that would cause problems with both.

federiconuta commented 3 years ago

Are you able to fit (rather than refit) with that final model? I'd expect refit to work with any model that works for fit, but it's possible you've got some sort of indexing issue that would cause problems with both.

Thank you for the reply. So I am able to use .fit if I do not call lambda, i.e.:

cate.model_final=SubsetWrapper(LassoCV(cv=[(fold00, fold11), (fold11, fold00)],n_alphas = 5000,fit_intercept=False,max_iter=10000),
                                               feature_inds)
cate.model_final.fit() #can be called

The same error:

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 11 is different from 78)

appears if I call cate.fit(Y, T[:,0], X=XW, Z=Z, cache_values=True, inference='auto') after performing the change of final model which however is not what I would like to do.

however as far as I could see the problems for me are two:

kbattocchi commented 3 years ago

Sorry, I meant your second test, where you call cate.fit(Y=..., T=..., X=..., etc.), not cate.final_model.fit. This indicates that your final model isn't set up correctly, because refit and fit both do the same thing to fit the final model, it's just that refit reuses the first stage nuisances instead of retraining the first stage models and recomputing them.

It's possible that it only appeared to work correctly in the past; if calling cate.fit gives you that error then there is a fundamental problem; you could also try passing that model to the DMLIV initializer instead of setting it after the fact, and I think you'll see the same behavior, which again indicates that the model is not configured properly for your data.

federiconuta commented 3 years ago

@kbattocchi Ok. Just to be clearer on the sequentiality:

1) first I call cate.fit(Y, T[:,0], X=XW, Z=Z, cache_values=True, inference='auto') which works perfectly fine; 2) I define cate.model_final=SubsetWrapper(...) which should take a subset of XW (i.e. first 10 columns since the indices are selected like that:

colonne = ["A_"+str(k) for k in range(XW.shape[1])]
XW_db = pd.DataFrame(XW, columns=colonne)

subset_names = set(['A_0','A_1','A_2','A_3','A_4','A_5','A_6','A_7','A_8','A_9','A_10'])
# list of indices of features X to use in the final model

feature_inds = np.argwhere([(x in subset_names) for x in XW_db.columns.values]).flatten()

);

3) Finally call cate.refit_final(inference='auto') which arises the dimensionality error.

This is why I suspect that there is a "reshape" somewhere which is not accepted. In any case the result of print(Y.shape, T[:,0].shape, XW.shape, Z.shape) seems to me correct: (517,) (517,) (517, 76) (517, 1) isn't it?

Maybe a possible solution (I don't know if it does what I want, i.e. refit the final with just the first ten columns of XW), is taking XW[:,:10] directly but then I don't know how I could refit only the final... The point is that if I do not refit the final basically every X having as first dimension (517,) is accepted and correctly fit.

kbattocchi commented 3 years ago

My question is what happens when you use SubsetWrapper as the initial final model (what you specify in the DMLIV initializer) - does the very first call to fit still work?

My guess is that it won't, in which case it's unsurprising that it also doesn't work when refitting.

federiconuta commented 3 years ago

@kbattocchi This is basically the experiment which I made first. What I was saying is that if I use SubsetWrapper as the initial final model, fit with that model does work which is the same as doing:

cate.fit(Y, T[:,0], X=XW[:,:11], Z=Z, cache_values=True, inference='auto')

where now X is a subset of XW. The initial fit does always work if I do not change the cate.model_final. To directly answer your question, following the steps:

1) define lasso_model_final = lambda: SubsetWrapper(LassoCV(cv=[(fold00, fold11), (fold11, fold00)],n_alphas = 5000,fit_intercept=False,max_iter=10000), feature_inds) 2) define cate as:

cate = DMLIV(model_Y_X= model_Y_X(), model_T_X = model_T_X(), model_T_XZ=model_T_XZ(), model_final= lasso_model_final(), featurizer=dmliv_featurizer(),
             cv=6, random_state=123)

3) making then: cate.fit(Y, T[:,0], X=XW, Z=Z, cache_values=True, inference='auto')

the following error raises:

All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '<dml_iv.utilities.SubsetWrapper object at 0x7fea41ecc9a0>' (type <class 'dml_iv.utilities.SubsetWrapper'>) doesn't

I am not understanding actually what is going wrongs since the arrays that I am using are the ones in the panel wrapper basically:

price = df_amox_est.ln_price.to_numpy()
price = np.split(price, 11) # it changes the np array to list of sub arrays.
price = np.stack(price, axis=0 ) #conversion to array
min_price_atc = df_amox_est.ln_priceatcmin.to_numpy() 
min_price_atc = np.split(min_price_atc, 11) # it changes the np array to list of sub arrays.
min_price_atc = np.stack(min_price_atc, axis=0 ) #conversion to array
med_price_atc = df_amox_est.ln_priceatcmedian.to_numpy()
med_price_atc = np.split(med_price_atc, 11) # it changes the np array to list of sub arrays.
med_price_atc = np.stack(med_price_atc, axis=0 ) #conversion to array 
demand = df_amox_est.ln_stdu.to_numpy()
demand = np.split(demand, 11)
demand = np.stack(demand, axis=0 )

# Defining some covariate for each molecule in each country:

nprod = df_amox_est.ln_nprod.to_numpy() 
nprod = np.split(nprod, 11)
nprod = np.stack(nprod, axis=0) 
nfirm = df_amox_est.ln_nfirm.to_numpy() 
nfirm = np.split(nfirm, 11)
nfirm = np.stack(nfirm, axis=0 ) 
natc = df_amox_est.natc.to_numpy() 
natc = np.split(natc, 11)
natc = np.stack(natc, axis=0) 
avg_price = df_amox_est.avg_price.to_numpy()  
avg_price = np.split(avg_price, 11)
avg_price = np.stack(avg_price, axis=0 ) 
min_price = df_amox_est.min_price.to_numpy()  
min_price = np.split(min_price, 11)
min_price = np.stack(min_price, axis=0 ) 
share_generics = df_amox_est.genericshare.to_numpy()  
share_generics = np.split(share_generics, 11)
share_generics = np.stack(share_generics, axis=0 ) 
inflowratio = df_amox_est.inflowrate.to_numpy()  
inflowratio = np.split(inflowratio, 11)
inflowratio = np.stack(inflowratio, axis=0 ) 
outflowratio = df_amox_est.outflowrate.to_numpy()  
outflowratio = np.split(outflowratio, 11)
outflowratio = np.stack(outflowratio, axis=0 )

 #########SECOND BLOCK 

from econml.dml import LinearDML
from sklearn.preprocessing import PolynomialFeatures
from econml.inference import StatsModelsInference
n_products_per_group = 1
n_months = 47
n_groups = 11
n_products = n_groups * n_products_per_group
# p[i, j] = price of product i on month j
p = price
# p_minus[i, j] = average price of products other than i within the same group on month j
p_minus = med_price_atc
# q[i, j] = demand of product i on month j
q = demand
# X[i, j] = (one-hot-encoding of product i): si fa - 3 perchè sarebbe nprod-1+nmonth-1+group-1
X = np.zeros((n_products, n_months, n_products + n_months + n_groups - 2))

for g in np.arange(n_groups):
    for i in np.arange(n_products_per_group):
        for j in np.arange(n_months):
            index_i = n_products_per_group*g + i
            X[index_i, j, :(n_products)] = 1.0 * (np.arange(0, n_products) == (index_i))
            #INDICE GRUPPO
            X[index_i, j, (n_products):(n_products + n_groups - 1)] =  1.0 * (np.arange(1, n_groups) == g)
            #INDICE QUARTER
            X[index_i, j, (n_products + n_groups - 1):] = 1.0 * (np.arange(1, n_months) == j)

########### THIRD BLOCK
# Creating the inputs to the DML method
Y = q.flatten()
T = np.hstack([p.reshape((n_products*n_months, 1)), p_minus.reshape((n_products*n_months, 1))])
X = X.reshape((n_products*n_months, n_products + n_months + n_groups - 2))
W = np.hstack([inflowratio.reshape((n_products*n_months, 1)),outflowratio.reshape((n_products*n_months, 1)), nprod.reshape((n_products*n_months, 1)), nfirm.reshape((n_products*n_months, 1)),natc.reshape((n_products*n_months, 1)),avg_price.reshape((n_products*n_months, 1)),min_price.reshape((n_products*n_months, 1)),share_generics.reshape((n_products*n_months, 1))])

I the define the folds as in panel wrapper and:

dmliv_featurizer = lambda: PolynomialFeatures(degree=1, include_bias=True) #lascio la costante
dmliv_model_effect = lambda: LassoCV(cv=[(fold00, fold11), (fold11, fold00)],n_alphas = 5000, fit_intercept=False, max_iter=10000)
cate = DMLIV(model_Y_X= model_Y_X(), model_T_X = model_T_X(), model_T_XZ=model_T_XZ(), model_final=dmliv_model_effect(), featurizer=dmliv_featurizer(),
             cv=6, random_state=123)

the subset of the feature is just XW put into a pandas array with the first then columns selected as above:

colonne = ["A_"+str(k) for k in range(XW.shape[1])]
XW_db = pd.DataFrame(XW, columns=colonne)

subset_names = set(['A_0','A_1','A_2','A_3','A_4','A_5','A_6','A_7','A_8','A_9','A_10'])
# list of indices of features X to use in the final model

feature_inds = np.argwhere([(x in subset_names) for x in XW_db.columns.values]).flatten()
federiconuta commented 3 years ago

@kbattocchi maybe the point is that I am creating X as an array while SubsetWrapper takes as argument a feature_inds coming from a pandas database? My X is XW which is defined as XW = hstack([X[:, :(n_products)], W]) What I would like to do is imply refit the final model on just X[:, :(n_products)] which I am doing by defining XW as a database and taking the first n_products columns from it. The point is that I do not understand what SubsetWrapper is going to take when I specify the feature_indices. Yet my original X term is an array, XW which I translated into pandas just for sake of simplicity. But I don't know if SubsetWrapper is going to take the subset of values of XW or what else...I am really confused about it sorry.

Is that correct? I report here the dimensions of my cate.fit() arguments which according to the help are compatible:

print(Y.shape, T[:,0].shape, XW.shape, Z.shape)

(517,) (517,) (517, 76) (517, 1)