py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.81k stars 714 forks source link

Bootstrap #92

Closed federiconuta closed 5 years ago

federiconuta commented 5 years ago

Dear all,

thanks again for this wonderful ML tool. I would like to understand if it is possible, in my case to have t-statistics exploiting the bootstrap algorithm you provided. In particular I have employed both

est = DMLCateEstimator(model_y=MultiTaskElasticNet(alpha=0.1),model_t=MultiTaskElasticNet(alpha=0.1))and est = DMLCateEstimator(model_y=MultiTaskElasticNet(alpha=0.1), model_t=MultiTaskElasticNet(alpha=0.1),model_final=Lasso(alpha=0.1),featurizer=PolynomialFeatures(degree=1),random_state=123) obtaining estimates for my Databases (15 in total). What I would like to do is computing SE or t-statistics directly by exploiting the bootstrap method you provided. My attempt was the following but I end up making a mess (basically I created a Y a T and an X suitable for the method and simply inputted in boot_est):

for i in range(1,16): df_part_q['df_part_q_'+str(i)] = pd.DataFrame() df_part_p['df_part_p_'+str(i)] = pd.DataFrame() Y['Y_'+ str(i)] = [] T['T_'+ str(i)] = [] X['X_'+ str(i)] = []

for j in range(1,16): df_part_q['df_part_q_'+str(j)] = pd.concat([quantities_unstacked['quantities_unstacked'+str(j)].T, quantities_unstacked_othertatc1['quantities_unstacked_otheratc1_'+str(j)].T], axis = 1) df_part_p['df_part_p_'+str(j)] = pd.concat([prices_unstacked['prices_unstacked'+str(j)].T, prices_unstacked_othertatc1['prices_unstacked_otheratc1_'+str(j)].T], axis = 1) df_part_r['df_part_r_'+str(j)] = db['db_'+str(j)]['Recalls rest product']

for j in range(1,16): Y['Y_'+str(j)].append(df_part_q['df_part_q_'+str(j)].values) T['T_'+str(j)].append(df_part_p['df_part_p_'+str(j)].values) X['X_'+str(j)].append(df_part_r['df_part_r_'+str(j)].values)

from econml.bootstrap import BootstrapEstimator boot_est=BootstrapEstimator(DMLCateEstimator(model_y=MultiTaskElasticNetCV(cv=3),model_t=MultiTaskElasticNetCV(cv=3)) ,n_bootstrap_samples=20)

boot_estimate = {} for j in range(1,16): boot_estimate['boot_est'+str(j)] = [] for j in range(1,16): boot_estimate['boot_est'+str(j)].append(boot_est.fit(Y['Y_'+str(j)][0], T['T_'+str(j)][0].reshape(-1, 1)))

te_pred_interval = {} for j in range(1,16): te_pred_interval['te_pred_interval'+str(j)] = []

for j in range(1,16): te_pred_interval['te_pred_interval'+str(j)].append(boot_estimate['boot_est'+str(j)][0].const_marginal_effect())

Can you please help me in this? :)

Thank you,

Federico

vasilismsr commented 5 years ago

Hi Federico,

We currently only have support for confidence intervals. Very interestingly we were just discussing that we also need some interface for p-values and t-stastistics, as that would also be of interest to users. Most probably we will add such functionality in the next sprint (in a few months).

For now, in terms of p-values, what I would do is the following hack: compute confidence intervals at multiple levels of confidence (i.e. alpha=.1, .01, .001, .001, and lower=alpha/2, upper=1-alpha/2) until the confidence interval contains the zero. The last alpha for which the confidence interval did not include zero, is a good proxy of the p-value for the quantity you are investigating.

federiconuta commented 5 years ago

Many thanks for the reply.

So the fact is that, I don't know why, but when I compute:

for j in range(1,16): te_pred_interval['te_pred_interval'+str(j)].append(boot_estimate['boot_est'+str(j)][0].const_marginal_effect_interval())

The code returns me 120 estimates for each DB which is not what I want. I indeed would like N×M estimates where N×M are dimensions of the pandas DataFrame.

Am I doing something wrong?

Il Lun 23 Set 2019, 13:40 vasilismsr notifications@github.com ha scritto:

Hi Federico,

We currently only have support for confidence intervals. Very interestingly we were just discussing that we also need some interface for p-values and t-stastistics, as that would also be of interest to users. Most probably we will add such functionality in the next sprint (in a few months).

For now, in terms of p-values, what I would do is the following hack: compute confidence intervals at multiple levels of confidence (i.e. alpha=.1, .01, .001, .001, and lower=alpha/2, upper=1-alpha/2) until the confidence interval contains the zero. The last alpha for which the confidence interval did not include zero, is a good proxy of the p-value for the quantity you are investigating.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/92?email_source=notifications&email_token=AMJWORRQS74JI4WJACCLGULQLCTJ3A5CNFSM4IZJGDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7KSL5Y#issuecomment-534062583, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWORUC7XER5JJDQNBP5XTQLCTJ3ANCNFSM4IZJGDCA .

federiconuta commented 5 years ago

P.s. the estimates are contained in a dictionary of databases that I called a_hat. Basically I have there 15 matrices N×M of estimates: a matrix for each DB. What I would like are the p-values for each one of these estimates :)

Il Lun 23 Set 2019, 15:21 Nutarelli Federico federico.nutarelli@imtlucca.it ha scritto:

Many thanks for the reply.

So the fact is that, I don't know why, but when I compute:

for j in range(1,16): te_pred_interval['te_pred_interval'+str(j)].append(boot_estimate['boot_est'+str(j)][0].const_marginal_effect_interval())

The code returns me 120 estimates for each DB which is not what I want. I indeed would like N×M estimates where N×M are dimensions of the pandas DataFrame.

Am I doing something wrong?

Il Lun 23 Set 2019, 13:40 vasilismsr notifications@github.com ha scritto:

Hi Federico,

We currently only have support for confidence intervals. Very interestingly we were just discussing that we also need some interface for p-values and t-stastistics, as that would also be of interest to users. Most probably we will add such functionality in the next sprint (in a few months).

For now, in terms of p-values, what I would do is the following hack: compute confidence intervals at multiple levels of confidence (i.e. alpha=.1, .01, .001, .001, and lower=alpha/2, upper=1-alpha/2) until the confidence interval contains the zero. The last alpha for which the confidence interval did not include zero, is a good proxy of the p-value for the quantity you are investigating.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/92?email_source=notifications&email_token=AMJWORRQS74JI4WJACCLGULQLCTJ3A5CNFSM4IZJGDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7KSL5Y#issuecomment-534062583, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWORUC7XER5JJDQNBP5XTQLCTJ3ANCNFSM4IZJGDCA .

vasilismsr commented 5 years ago

Some more help on what y, T and X is in your application would be great. Hard to read the code.

If your numpy arrays are: y : (n, q) T : (n, d) X : (n, p) X_test : (m, p) Then running sth like: est = BoostrapEstimator(DMLCateEstimator()).fit(y, T, X) lower, upper = est.const_marginal_effect(X_test, lower=1-alpha/2, upper=alpha/2)

will return two tensors. lower : (m, q, d) upper : (m, q, d) where lower[t][i][j] is the lower bound of the alpha-confidence interval for the effect of treatment treatment T[j] on outcome Y[i], conditional on the features X being equal to X[t].

What I was saying is that from this you can also get an estimate of a p-value, by playing with alpha and seeing for each quantity of interest, what is the smallest alpha at which the interval does not contain zero.

federiconuta commented 5 years ago

Yes, sorry. So, let's reason with a single DB for sake of simplicity. My DB takes the form N x M where M are 12 years (from 2004 to 2015) and N are log of quantities for each product in the case of Y, log price in the case of T. In particular, let's say I am acting on DB number 1 which has 1340 products for 12 time periods. I will hence have: Y['Y_1'] which is a dictionary containing 1340 arrays of length 12 each (representing the log of quantity); T['T_1'] is again a dictionary of 1340 arrays of length 12 each containing this time the log of price. What I have done is basically est.fit(Y['Y_1'][0], T['T_1'][0]) obtaining some estimates that I arranged into a 1340 x 1340 matrix of coefficients which is stored into a_hat['a_hat1'].

Now, what I would like to do is to find the bootstrap estimates of the p-value (better the standard error) for each one of the 1340 x 1340 estimates. My idea was simply to follow the example of bootstrap estimates in the tutorial https://github.com/microsoft/EconML/blob/master/notebooks/Double%20Machine%20Learning%20Examples.ipynb of GitHub you provided. But I realized that it does not work for me. o I was wondering what I was asking myself if I read something wrong in the help :)

On Mon, Sep 23, 2019 at 5:00 PM vasilismsr notifications@github.com wrote:

Some more help on what y, T and X is in your application would be great. Hard to read the code.

If your numpy arrays are: y : (n, q) T : (n, d) X : (n, p) X_test : (m, p) Then running sth like: est = BoostrapEstimator(DMLCateEstimator()).fit(y, T, X) lower, upper = est.const_marginal_effect(X_test, lower=1-alpha/2, upper=alpha/2)

will return two tensors. lower : (m, q, d) upper : (m, q, d) where lower[t][i][j] is the lower bound of the alpha-confidence interval for the effect of treatment treatment T[j] on outcome Y[i], conditional on the features X being equal to X[t].

What I was saying is that from this you can also get an estimate of a p-value, by playing with alpha and seeing for each quantity of interest, what is the smallest alpha at which the interval does not contain zero.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/92?email_source=notifications&email_token=AMJWORX4G5YCTNU4SAAYO7LQLDKXVA5CNFSM4IZJGDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7LE5DA#issuecomment-534138508, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWORRUR4ZCLFJJ735DDL3QLDKXVANCNFSM4IZJGDCA .

-- Federico Nutarelli, PhD student Economics Networks and Business Analytics, IMT School fro advanced studies, Lucca. Email: federico.nutarelli@imtlucca.it

federiconuta commented 5 years ago

p.s. in particular. Y['Y_1] and T['T_1'] have same dimensions (12 x 1340). X_test in my case is not present if I have understood its meaning (I mean I do not split the database into test set and training set)

On Tue, Sep 24, 2019 at 10:15 AM Nutarelli Federico < federico.nutarelli@imtlucca.it> wrote:

Yes, sorry. So, let's reason with a single DB for sake of simplicity. My DB takes the form N x M where M are 12 years (from 2004 to 2015) and N are log of quantities for each product in the case of Y, log price in the case of T. In particular, let's say I am acting on DB number 1 which has 1340 products for 12 time periods. I will hence have: Y['Y_1'] which is a dictionary containing 1340 arrays of length 12 each (representing the log of quantity); T['T_1'] is again a dictionary of 1340 arrays of length 12 each containing this time the log of price. What I have done is basically est.fit(Y['Y_1'][0], T['T_1'][0]) obtaining some estimates that I arranged into a 1340 x 1340 matrix of coefficients which is stored into a_hat['a_hat1'].

Now, what I would like to do is to find the bootstrap estimates of the p-value (better the standard error) for each one of the 1340 x 1340 estimates. My idea was simply to follow the example of bootstrap estimates in the tutorial https://github.com/microsoft/EconML/blob/master/notebooks/Double%20Machine%20Learning%20Examples.ipynb of GitHub you provided. But I realized that it does not work for me. o I was wondering what I was asking myself if I read something wrong in the help :)

On Mon, Sep 23, 2019 at 5:00 PM vasilismsr notifications@github.com wrote:

Some more help on what y, T and X is in your application would be great. Hard to read the code.

If your numpy arrays are: y : (n, q) T : (n, d) X : (n, p) X_test : (m, p) Then running sth like: est = BoostrapEstimator(DMLCateEstimator()).fit(y, T, X) lower, upper = est.const_marginal_effect(X_test, lower=1-alpha/2, upper=alpha/2)

will return two tensors. lower : (m, q, d) upper : (m, q, d) where lower[t][i][j] is the lower bound of the alpha-confidence interval for the effect of treatment treatment T[j] on outcome Y[i], conditional on the features X being equal to X[t].

What I was saying is that from this you can also get an estimate of a p-value, by playing with alpha and seeing for each quantity of interest, what is the smallest alpha at which the interval does not contain zero.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/92?email_source=notifications&email_token=AMJWORX4G5YCTNU4SAAYO7LQLDKXVA5CNFSM4IZJGDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7LE5DA#issuecomment-534138508, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWORRUR4ZCLFJJ735DDL3QLDKXVANCNFSM4IZJGDCA .

-- Federico Nutarelli, PhD student Economics Networks and Business Analytics, IMT School fro advanced studies, Lucca. Email: federico.nutarelli@imtlucca.it

-- Federico Nutarelli, PhD student Economics Networks and Business Analytics, IMT School fro advanced studies, Lucca. Email: federico.nutarelli@imtlucca.it

federiconuta commented 5 years ago

UPDATE: I am trying also to perform more or less the following:

est = BoostrapEstimator(DMLCateEstimator(model_y = MultiTaskElasticNet(), model_t = MultiTaskElasticNet(), model_final=lasso)).fit(y, T, X)

The problem is that it does not seem to converge at all (is kind 4 hours running right now)...maybe I have misunderstood something but cannot find anything in the help section. Sorry and many thanks again!

On Tue, Sep 24, 2019 at 12:36 PM Nutarelli Federico < federico.nutarelli@imtlucca.it> wrote:

p.s. in particular. Y['Y_1] and T['T_1'] have same dimensions (12 x 1340). X_test in my case is not present if I have understood its meaning (I mean I do not split the database into test set and training set)

On Tue, Sep 24, 2019 at 10:15 AM Nutarelli Federico < federico.nutarelli@imtlucca.it> wrote:

Yes, sorry. So, let's reason with a single DB for sake of simplicity. My DB takes the form N x M where M are 12 years (from 2004 to 2015) and N are log of quantities for each product in the case of Y, log price in the case of T. In particular, let's say I am acting on DB number 1 which has 1340 products for 12 time periods. I will hence have: Y['Y_1'] which is a dictionary containing 1340 arrays of length 12 each (representing the log of quantity); T['T_1'] is again a dictionary of 1340 arrays of length 12 each containing this time the log of price. What I have done is basically est.fit(Y['Y_1'][0], T['T_1'][0]) obtaining some estimates that I arranged into a 1340 x 1340 matrix of coefficients which is stored into a_hat['a_hat1'].

Now, what I would like to do is to find the bootstrap estimates of the p-value (better the standard error) for each one of the 1340 x 1340 estimates. My idea was simply to follow the example of bootstrap estimates in the tutorial https://github.com/microsoft/EconML/blob/master/notebooks/Double%20Machine%20Learning%20Examples.ipynb of GitHub you provided. But I realized that it does not work for me. o I was wondering what I was asking myself if I read something wrong in the help :)

On Mon, Sep 23, 2019 at 5:00 PM vasilismsr notifications@github.com wrote:

Some more help on what y, T and X is in your application would be great. Hard to read the code.

If your numpy arrays are: y : (n, q) T : (n, d) X : (n, p) X_test : (m, p) Then running sth like: est = BoostrapEstimator(DMLCateEstimator()).fit(y, T, X) lower, upper = est.const_marginal_effect(X_test, lower=1-alpha/2, upper=alpha/2)

will return two tensors. lower : (m, q, d) upper : (m, q, d) where lower[t][i][j] is the lower bound of the alpha-confidence interval for the effect of treatment treatment T[j] on outcome Y[i], conditional on the features X being equal to X[t].

What I was saying is that from this you can also get an estimate of a p-value, by playing with alpha and seeing for each quantity of interest, what is the smallest alpha at which the interval does not contain zero.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/92?email_source=notifications&email_token=AMJWORX4G5YCTNU4SAAYO7LQLDKXVA5CNFSM4IZJGDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7LE5DA#issuecomment-534138508, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWORRUR4ZCLFJJ735DDL3QLDKXVANCNFSM4IZJGDCA .

-- Federico Nutarelli, PhD student Economics Networks and Business Analytics, IMT School fro advanced studies, Lucca. Email: federico.nutarelli@imtlucca.it

-- Federico Nutarelli, PhD student Economics Networks and Business Analytics, IMT School fro advanced studies, Lucca. Email: federico.nutarelli@imtlucca.it

-- Federico Nutarelli, PhD student Economics Networks and Business Analytics, IMT School fro advanced studies, Lucca. Email: federico.nutarelli@imtlucca.it

vasilismsr commented 5 years ago

Your modelling approach seems to me have the problem of too few samples. You are treating each product as a completely separate entity and estimating at 1000x1000 matrix of parameters with only 12 samples. It seems to me a statistically infeasible task. You would need a huge amount of penalization. Moreover, it might be that the number of parameters that you are estimating is too large and the optimization done by scikit learn when fitting all the lasso and elasticnets is taking way too long and also most probably not converging within the max-iter number of gradient descent iterations, due to the large number of variables to optimize over.

Other than that you seem to be calling our library the correct way. You should be getting out lower and upper confidence bounds for each of the 1 million parameters. Though I suspect given your large number of parameters and small sample size the results will not be meaningful. This seems more like a modelling and a data problem that I don't think we can really help you with, as it's not really an sdk problem. Some different approaches to try (but I cannot help more on this): 1) featurize your products and don't treat them as individuals. 2) segment your products into small categories with few products (2 to 4) and then run our library for each category. This will estimate (4 to 16) parameters for each category. Even here you could decrease the number of parameters even more, by re-defining the treatments to be: price of product, average price of other products in the category. Then you have only 2 treatments. Then create an X vector which corresponds to the one-hot encoding of the id of the product. Then estimate a heterogeneous treatment effect as a function of X. This will estimate the same-price and group-cross-price elasticity of each product. These will be n_products*2 numbers, i.e. 4-8. Moreover, here you will be creating one sample per-product, as opposed to having only 12 samples that contain all the product demand and prices, you will create one sample per product that contains its demand, price and average price of other products. (I now recall I already described some solution like that in a prior communication). 3) Use FistaRegressor (nuclear norm penalization) as your final stage, which will enforce the 1000x1000 matrix to be low rank.

For all these you can still use the bootstrap estimator to get confidence intervals for the estimated elasticities.

federiconuta commented 5 years ago

Perfect. You have been very clear. I thank you for the patient replies and again for the very well build algorithm!

Federico

On Thu, Sep 26, 2019 at 2:41 PM vasilismsr notifications@github.com wrote:

Your modelling approach seems to me have the problem of too few samples. You are treating each product as a completely separate entity and estimating at 1000x1000 matrix of parameters with only 12 samples. It seems to me a statistically infeasible task. You would need a huge amount of penalization. Moreover, it might be that the number of parameters that you are estimating is too large and the optimization done by scikit learn when fitting all the lasso and elasticnets is taking way too long and also most probably not converging within the max-iter number of gradient descent iterations, due to the large number of variables to optimize over.

Other than that you seem to be calling our library the correct way. You should be getting out lower and upper confidence bounds for each of the 1 million parameters. Though I suspect given your large number of parameters and small sample size the results will not be meaningful. This seems more like a modelling and a data problem that I don't think we can really help you with, as it's not really an sdk problem. Some different approaches to try (but I cannot help more on this):

  1. featurize your products and don't treat them as individuals.
  2. segment your products into small categories with few products (2 to 4) and then run our library for each category. This will estimate (4 to 16) parameters for each category. Even here you could decrease the number of parameters even more, by re-defining the treatments to be: price of product, average price of other products in the category. Then you have only 2 treatments. Then create an X vector which corresponds to the one-hot encoding of the id of the product. Then estimate a heterogeneous treatment effect as a function of X. This will estimate the same-price and group-cross-price elasticity of each product. These will be n_products*2 numbers, i.e. 4-8. Moreover, here you will be creating one sample per-product, as opposed to having only 12 samples that contain all the product demand and prices, you will create one sample per product that contains its demand, price and average price of other products. (I now recall I already described some solution like that in a prior communication).
  3. Use FistaRegressor (nuclear norm penalization) as your final stage, which will enforce the 1000x1000 matrix to be low rank.

For all these you can still use the bootstrap estimator to get confidence intervals for the estimated elasticities.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/92?email_source=notifications&email_token=AMJWORWJDPQRG3BF5LXTBADQLSUXDA5CNFSM4IZJGDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7VNMXA#issuecomment-535483996, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWORUMXH5D6AOG4DLXKUDQLSUXDANCNFSM4IZJGDCA .

-- Federico Nutarelli, PhD student Economics Networks and Business Analytics, IMT School fro advanced studies, Lucca. Email: federico.nutarelli@imtlucca.it

federiconuta commented 5 years ago

One last clarification, please. My data are panels (hence observations repeated over 12 time periods). What I would like to do is to split the products into chunks of 2 to 4 products each in an N x 12 database where N is number of products and 12 are the 12 years in which they are observed. Is the procedure still correct? I guess so but not 100% sure

vasilismsr commented 5 years ago

I would more do it like: split products into chunks and then each chunk you create a database with N12 entries, which contains one row for each product i and month j. Say this is row k=(i,j). Then the database contains: 1) Y_k is the the demand of product i on month j. So Y is an (N12,) vector 2) T_k[0] is the price of product i on month j and T_k[1] is the average price of all other products on month j. So treatment is (N12) x 2 matrix 3) X_k is the one-hot encoding of the id i of the product, i.e. X_k has dimension (N12) x (n_products) and X_k[t]=1{i==t} If you have other features for the products, you can include them as extra columns in X. If you don't think those other features create heterogeneity in the demand elasticity but you want to control for them, then include them in W otherwise W=None.

Then call double ML with the numpy Y, T, X, W.

Then to get the own-price elasticity and group-cross price elasticity of say product i, call: est.const_marginal_effect(Xi) where Xi is an 1x(n_products) array that contains the one-hot-encoding of the id of product i, i.e. Xi[0, i]=1 and Xi[0, t]=0 for t\neq i.

Also if you want to calculate the change in demand between two treatment points (i.e. T0=np.array([[p0, gp0]]) and T1=np.array(([p1, gp1])), where p0, gp0 is some base own price of product i and gp0 is the average price of other products and similarly p1, gp1 is a target price and average other price, then you can also call: est.effect(Xi, T0=T0, T1=T1)

vasilismsr commented 5 years ago

Also if you really have only 2 products in each category, then you could alternatively do it the way you are currently doing it, where you estimate a whole matrix of 2x2 cross price elasticities. To do that you will do it as you are currently doing it, where you have a database with 12 x 2 entries, where each row contains: 1) Y_i[0] demand of product 0, Y_i[1] demand of product 1 2) T_i[0] price of product 0, T_1[i] demand of product 1 3) X= None, W=(None or any features you want to control for)

Then est.const_marginal_effect() will return a 2x2 matrix of demand elasticities. You might still want to add some small penalty in the lasso.

vasilismsr commented 5 years ago

One small other thing to be cautious with a panel would be correlation in the residual errors (e.g. if demand shocks are correlated within each product. You might want to check out this paper: https://arxiv.org/pdf/1909.03489.pdf This way of sample splitting is not yet implemented in the package. Maybe at some point. (this problem is for the first approach with the flattened database with N*12 rows).

This fact, might have a small effect on the validity of the bootstrap confidence intervals, but at least this way you will be getting more reasonable point estimates.

federiconuta commented 5 years ago

Wow that's very very exaustive!!! Many many thanks. I eill let you know the progresses of my work!

Federico

Il Sab 28 Set 2019, 11:53 vasilismsr notifications@github.com ha scritto:

One small other thing to be cautious with a panel would be correlation in the residual errors (e.g. if demand shocks are correlated within each product. You might want to check out this paper: https://arxiv.org/pdf/1909.03489.pdf This way of sample splitting is not yet implemented in the package. Maybe at some point. (this problem is for the first approach with the flattened database with N*12 rows).

This fact, might have a small effect on the validity of the bootstrap confidence intervals, but at least this way you will be getting more reasonable point estimates.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/92?email_source=notifications&email_token=AMJWORRT3T7VC2HYHPXMPADQL4SSLA5CNFSM4IZJGDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD72VHYY#issuecomment-536171491, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWORXLSHHA43LJAMGLZJDQL4SSLANCNFSM4IZJGDCA .

federiconuta commented 5 years ago

X_k is the one-hot encoding of the id i of the product, i.e. X_k has dimension (N*12) x (n_products) and X_k[t]=1{i==t}

N is the dimension of the chunck right? Hence we will have for eac product, ket's say if chunck_size = 4, a matrix 412 taking 1 just in correspondence of the product (e.g. if I have chuncks of size 4: X_k[1] will be a 412 matrix taking 1 for the first column, i.e. first product for 12 years, and 0 otherwise; X_2[2] the same as X_k[1] but with the second column of ones and so on for each product independently on the chuncks...or better, depending just on the size of the chuncks). Thus, I can create X separately right?

On Sat, Sep 28, 2019 at 12:08 PM Nutarelli Federico < federico.nutarelli@imtlucca.it> wrote:

Wow that's very very exaustive!!! Many many thanks. I eill let you know the progresses of my work!

Federico

Il Sab 28 Set 2019, 11:53 vasilismsr notifications@github.com ha scritto:

One small other thing to be cautious with a panel would be correlation in the residual errors (e.g. if demand shocks are correlated within each product. You might want to check out this paper: https://arxiv.org/pdf/1909.03489.pdf This way of sample splitting is not yet implemented in the package. Maybe at some point. (this problem is for the first approach with the flattened database with N*12 rows).

This fact, might have a small effect on the validity of the bootstrap confidence intervals, but at least this way you will be getting more reasonable point estimates.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/92?email_source=notifications&email_token=AMJWORRT3T7VC2HYHPXMPADQL4SSLA5CNFSM4IZJGDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD72VHYY#issuecomment-536171491, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWORXLSHHA43LJAMGLZJDQL4SSLANCNFSM4IZJGDCA .

-- Federico Nutarelli, PhD student Economics Networks and Business Analytics, IMT School fro advanced studies, Lucca. Email: federico.nutarelli@imtlucca.it

federiconuta commented 5 years ago

I mean, as far as I have understood, chunks are kind of overlapping chunks. Therefore: if chunk 1 has products 1,2,3,4 observed in 12 time periods (hence a 4*12):

for product i = 1 I compute the average of price_2, price_3 and price_4 hence I will have: Y_k = demand for prod 1; T_k[0] is price of prod 1; T_k[1] is the average price of prod 2 3 and 4. For product i = 2 , I compute the average of price_1, price_3 and price_4 hence I will have: Y_k = demand for prod 2; T_k[0] is price of prod 2; T_k[1] is the average price of prod 1 3 and 4....for prod 25 I compute the average of the products within chunck containing prod 25 and hence I will have: Y_k = demand for prod 25; T_k[0] is price of prod 25; T_k[1] is the average price of prod belonging toits chunk ?

On Sat, Sep 28, 2019 at 12:36 PM Nutarelli Federico < federico.nutarelli@imtlucca.it> wrote:

X_k is the one-hot encoding of the id i of the product, i.e. X_k has

dimension (N*12) x (n_products) and X_k[t]=1{i==t}

N is the dimension of the chunck right? Hence we will have for eac product, ket's say if chunck_size = 4, a matrix 412 taking 1 just in correspondence of the product (e.g. if I have chuncks of size 4: X_k[1] will be a 412 matrix taking 1 for the first column, i.e. first product for 12 years, and 0 otherwise; X_2[2] the same as X_k[1] but with the second column of ones and so on for each product independently on the chuncks...or better, depending just on the size of the chuncks). Thus, I can create X separately right?

On Sat, Sep 28, 2019 at 12:08 PM Nutarelli Federico < federico.nutarelli@imtlucca.it> wrote:

Wow that's very very exaustive!!! Many many thanks. I eill let you know the progresses of my work!

Federico

Il Sab 28 Set 2019, 11:53 vasilismsr notifications@github.com ha scritto:

One small other thing to be cautious with a panel would be correlation in the residual errors (e.g. if demand shocks are correlated within each product. You might want to check out this paper: https://arxiv.org/pdf/1909.03489.pdf This way of sample splitting is not yet implemented in the package. Maybe at some point. (this problem is for the first approach with the flattened database with N*12 rows).

This fact, might have a small effect on the validity of the bootstrap confidence intervals, but at least this way you will be getting more reasonable point estimates.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/92?email_source=notifications&email_token=AMJWORRT3T7VC2HYHPXMPADQL4SSLA5CNFSM4IZJGDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD72VHYY#issuecomment-536171491, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWORXLSHHA43LJAMGLZJDQL4SSLANCNFSM4IZJGDCA .

-- Federico Nutarelli, PhD student Economics Networks and Business Analytics, IMT School fro advanced studies, Lucca. Email: federico.nutarelli@imtlucca.it

-- Federico Nutarelli, PhD student Economics Networks and Business Analytics, IMT School fro advanced studies, Lucca. Email: federico.nutarelli@imtlucca.it

federiconuta commented 5 years ago

Ok I have tried out the code...but it raises an AssertionError because the shapes of Y, X and T are not the same...I mean I don't think that by definition they could be right? Y if N12; T is (N12)2 and X is (N12) x (n_products). The assertion error states that if W is empty, then: shape(Y)[0]==shape(T)[0]==shape(X)[0]

On Sat, Sep 28, 2019 at 12:54 PM Nutarelli Federico < federico.nutarelli@imtlucca.it> wrote:

I mean, as far as I have understood, chunks are kind of overlapping chunks. Therefore: if chunk 1 has products 1,2,3,4 observed in 12 time periods (hence a 4*12):

for product i = 1 I compute the average of price_2, price_3 and price_4 hence I will have: Y_k = demand for prod 1; T_k[0] is price of prod 1; T_k[1] is the average price of prod 2 3 and 4. For product i = 2 , I compute the average of price_1, price_3 and price_4 hence I will have: Y_k = demand for prod 2; T_k[0] is price of prod 2; T_k[1] is the average price of prod 1 3 and 4....for prod 25 I compute the average of the products within chunck containing prod 25 and hence I will have: Y_k = demand for prod 25; T_k[0] is price of prod 25; T_k[1] is the average price of prod belonging toits chunk ?

On Sat, Sep 28, 2019 at 12:36 PM Nutarelli Federico < federico.nutarelli@imtlucca.it> wrote:

X_k is the one-hot encoding of the id i of the product, i.e. X_k has

dimension (N*12) x (n_products) and X_k[t]=1{i==t}

N is the dimension of the chunck right? Hence we will have for eac product, ket's say if chunck_size = 4, a matrix 412 taking 1 just in correspondence of the product (e.g. if I have chuncks of size 4: X_k[1] will be a 412 matrix taking 1 for the first column, i.e. first product for 12 years, and 0 otherwise; X_2[2] the same as X_k[1] but with the second column of ones and so on for each product independently on the chuncks...or better, depending just on the size of the chuncks). Thus, I can create X separately right?

On Sat, Sep 28, 2019 at 12:08 PM Nutarelli Federico < federico.nutarelli@imtlucca.it> wrote:

Wow that's very very exaustive!!! Many many thanks. I eill let you know the progresses of my work!

Federico

Il Sab 28 Set 2019, 11:53 vasilismsr notifications@github.com ha scritto:

One small other thing to be cautious with a panel would be correlation in the residual errors (e.g. if demand shocks are correlated within each product. You might want to check out this paper: https://arxiv.org/pdf/1909.03489.pdf This way of sample splitting is not yet implemented in the package. Maybe at some point. (this problem is for the first approach with the flattened database with N*12 rows).

This fact, might have a small effect on the validity of the bootstrap confidence intervals, but at least this way you will be getting more reasonable point estimates.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/92?email_source=notifications&email_token=AMJWORRT3T7VC2HYHPXMPADQL4SSLA5CNFSM4IZJGDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD72VHYY#issuecomment-536171491, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWORXLSHHA43LJAMGLZJDQL4SSLANCNFSM4IZJGDCA .

-- Federico Nutarelli, PhD student Economics Networks and Business Analytics, IMT School fro advanced studies, Lucca. Email: federico.nutarelli@imtlucca.it

-- Federico Nutarelli, PhD student Economics Networks and Business Analytics, IMT School fro advanced studies, Lucca. Email: federico.nutarelli@imtlucca.it

-- Federico Nutarelli, PhD student Economics Networks and Business Analytics, IMT School fro advanced studies, Lucca. Email: federico.nutarelli@imtlucca.it

vasilismsr commented 5 years ago

Here is an example piece of code that generates fake data for a given chunk with n_products and n_months and then creates the Y, T, X that are needed by our sdk. I don't think I can help beyond that. The first shape of all these matrices has to be the same and equal to the number of "samples". See the code below. I will need to close this issue as it has gone beyond the use of the bootstrap. If you have more questions relating to applying our method to pricing please open a new issue and we can try to help if possible, though this is beyond the purpose of the issue tracker. We will eventually add some case study notebook particular to pricing an panels and many of these topics discussed here might become more clear. We might also create some Panel data wrapper in the future that might facilitate the interface of dml with panel data. But no timeline on when this will happen.

from econml.dml import LinearDMLCateEstimator
from sklearn.preprocessing import PolynomialFeatures
from econml.inference import StatsModelsInference
n_products = 3
n_months = 100
# p[i, j] = price of product i on month j
p = np.random.uniform(0, 1, size=(n_products, n_months))
# p_minus[i, j] = average price of products other than i on month j
p_minus = np.zeros(p.shape)
# q[i, j] = demand of product i on month j
q = np.zeros(p.shape)
# X[i, j] = (one-hot-encoding of product i) exludign product 0 used as a baseline
X = np.zeros((n_products, n_months, n_products - 1))
for i in np.arange(n_products):
    for j in np.arange(n_months):
        p_minus[i, j] = np.mean(p[np.arange(n_products) != i, j])
        q[i, j] = - i * p[i, j] - .5 * i * p_minus[i, j] + np.random.normal(0, .1)
        X[i, j, :] = 1.0 * (np.arange(1, n_products) == i)

# Creating the inputs to the DML method
Y = q.flatten()
T = np.hstack([p.reshape((n_products*n_months, 1)), p_minus.reshape((n_products*n_months, 1))])
X = X.reshape((n_products*n_months, n_products - 1))

from sklearn.linear_model import MultiTaskLassoCV, LassoCV, LinearRegression
est = LinearDMLCateEstimator(model_y = LassoCV(cv=3),
                             model_t = MultiTaskLassoCV(cv=3))
est.fit(Y, T, X, None)
est.const_marginal_effect(np.eye(n_products-1))

if you pull the branch: kebatt/refactorDML then you can also use the following in the last lines:

est.fit(Y, T, X, None, inference='statsmodels')
est.const_marginal_effect_interval(np.eye(n_products-1), alpha=.05)

this will provide 95% confidence intervals using asymptotic normality and not bootstrap, which should be faster and more reliable.

Also if you think you might have time fixed effects you might want to add the indicator of the month in your W variables and control for that. However, for these effects to be adequately identifiable you might want to clump several groups of products together into a single database and call the method with all these groups. The code below creates such a database and calls the DMLCateEstimator.

from econml.dml import LinearDMLCateEstimator
from sklearn.preprocessing import PolynomialFeatures
from econml.inference import StatsModelsInference
n_products_per_group = 3
n_months = 24
n_groups = 20
n_products = n_groups * n_products_per_group
# p[i, j] = price of product i on month j
p = np.random.uniform(0, 1, size=(n_products, n_months))
# p_minus[i, j] = average price of products other than i within the same group on month j
p_minus = np.zeros(p.shape)
# q[i, j] = demand of product i on month j
q = np.zeros(p.shape)
# X[i, j] = (one-hot-encoding of product i) exludign product 0 used as a baseline
X = np.zeros((n_products, n_months, n_products + n_months - 2))

# Adding some time effects on the price of each product
for g in np.arange(n_groups):
    for i in np.arange(n_products_per_group):
        for j in np.arange(n_months):
            index_i = n_products_per_group*g + i
            p[index_i, j] += 1.0*(j==1)

# Creating fake demands
for g in np.arange(n_groups):
    for i in np.arange(n_products_per_group):
        for j in np.arange(n_months):
            minus_i = (np.arange(n_products) >= (n_products_per_group*g))
            minus_i &= (np.arange(n_products) < (n_products_per_group*(g+1)))
            minus_i &= (np.arange(n_products) != (n_products_per_group*g + i))
            index_i = n_products_per_group*g + i
            p_minus[index_i, j] = np.mean(p[minus_i, j])
            q[index_i, j] = - i * p[index_i, j] - .5 * i * p_minus[index_i, j] + np.random.normal(0, .1)
            # adding some time fixed effect on the demand
            q[index_i, j] += 1.0 * (j==1)
            X[index_i, j, :(n_products-1)] = 1.0 * (np.arange(1, n_products) == (index_i))
            # adding the one-hot encoding of the month as part of the features of the observation
            X[index_i, j, (n_products-1):] = 1.0 * (np.arange(1, n_months) == j)

# Creating the inputs to the DML method
Y = q.flatten()
T = np.hstack([p.reshape((n_products*n_months, 1)), p_minus.reshape((n_products*n_months, 1))])
X = X.reshape((n_products*n_months, n_products + n_months - 2))

from sklearn.linear_model import MultiTaskLassoCV, LassoCV, LinearRegression
# Set linear_first_stages to false because this creates extra expanded features of the form X cross W 
# that slows code down. Effect on correctness is small. For full correctness it must be set to True
# if linear models are used as first stage models. You can set to False if for instance Random Forests
# are used as first stages.
est = LinearDMLCateEstimator(model_y = LassoCV(cv=3),
                             model_t = MultiTaskLassoCV(cv=3),
                             linear_first_stages=False)
# Using the one-hot encondings of the products as X and the one-hot encodings of the month as W
# to control for time fixed effects
est.fit(Y, T, X[:, :(n_products-1)], X[:, (n_products-1):])
est.const_marginal_effect(np.eye(n_products-1))

Again, if you pull the branch: kebatt/refactorDML then you can also use the following in the last lines:

est.fit(Y, T, X[:, :(n_products-1)], X[:, (n_products-1), inference='statsmodels')
est.const_marginal_effect_interval(np.eye(n_products-1), alpha=.05)
federiconuta commented 5 years ago

I will need to close this issue as it has gone beyond the use of the bootstrap. If you have more questions relating to applying our method to pricing please open a new issue and we can try to help if possible, though this is beyond the purpose of the issue tracker. We will eventually add some case study notebook particular to pricing an panels and many of these topics discussed here might become more clear. We might also create some Panel data wrapper in the future that might facilitate the interface of dml with panel data. But no timeline on when this will happen.

Yes I understand. You have already done a lot!

Thank you very much!!

On Sun, Sep 29, 2019 at 4:32 PM vasilismsr notifications@github.com wrote:

Here is an example piece of code that generates fake data for a given chunk with n_products and n_months and then creates the Y, T, X that are needed by our sdk. I don't think I can help beyond that. The first shape of all these matrices has to be the same and equal to the number of "samples". See the code below. I will need to close this issue as it has gone beyond the use of the bootstrap. If you have more questions relating to applying our method to pricing please open a new issue and we can try to help if possible, though this is beyond the purpose of the issue tracker. We will eventually add some case study notebook particular to pricing an panels and many of these topics discussed here might become more clear. We might also create some Panel data wrapper in the future that might facilitate the interface of dml with panel data. But no timeline on when this will happen.

from econml.dml import LinearDMLCateEstimatorfrom sklearn.preprocessing import PolynomialFeaturesfrom econml.inference import StatsModelsInference n_products = 3 n_months = 100# p[i, j] = price of product i on month j p = np.random.uniform(0, 1, size=(n_products, n_months))# p_minus[i, j] = average price of products other than i on month j p_minus = np.zeros(p.shape)# q[i, j] = demand of product i on month j q = np.zeros(p.shape)# X[i, j] = (one-hot-encoding of product i) exludign product 0 used as a baseline X = np.zeros((n_products, n_months, n_products - 1))for i in np.arange(n_products): for j in np.arange(n_months): p_minus[i, j] = np.mean(p[np.arange(n_products) != i, j]) q[i, j] = - i p[i, j] - .5 i p_minus[i, j] + np.random.normal(0, .1) X[i, j, :] = 1.0 (np.arange(1, n_products) == i)

Creating the inputs to the DML method

Y = q.flatten() T = np.hstack([p.reshape((n_productsn_months, 1)), p_minus.reshape((n_productsn_months, 1))]) X = X.reshape((n_products*n_months, n_products - 1)) from sklearn.linear_model import MultiTaskLassoCV, LassoCV, LinearRegression est = LinearDMLCateEstimator(model_y = LassoCV(cv=3), model_t = MultiTaskLassoCV(cv=3)) est.fit(Y, T, X, None) est.const_marginal_effect(np.eye(n_products-1))

if you pull the branch: kebatt/refactorDML then you can also use the following in the last lines:

est.fit(Y, T, X, None, inference='statsmodels') est.const_marginal_effect_interval(np.eye(n_products-1), alpha=.05)

this will provide 95% confidence intervals using asymptotic normality and not bootstrap, which should be faster and more reliable.

Also if you think you might have time fixed effects you might want to add the indicator of the month in your W variables and control for that. However, for these effects to be adequately identifiable you might want to clump several groups of products together into a single database and call the method with all these groups. The code below creates such a database and calls the DMLCateEstimator.

from econml.dml import LinearDMLCateEstimatorfrom sklearn.preprocessing import PolynomialFeaturesfrom econml.inference import StatsModelsInference n_products_per_group = 3 n_months = 24 n_groups = 20 n_products = n_groups * n_products_per_group# p[i, j] = price of product i on month j p = np.random.uniform(0, 1, size=(n_products, n_months))# p_minus[i, j] = average price of products other than i within the same group on month j p_minus = np.zeros(p.shape)# q[i, j] = demand of product i on month j q = np.zeros(p.shape)# X[i, j] = (one-hot-encoding of product i) exludign product 0 used as a baseline X = np.zeros((n_products, n_months, n_products + n_months - 2))

Adding some time effects on the price of each productfor g in np.arange(n_groups):

for i in np.arange(n_products_per_group):
    for j in np.arange(n_months):
        index_i = n_products_per_group*g + i
        p[index_i, j] += 1.0*(j==1)

Creating fake demandsfor g in np.arange(n_groups):

for i in np.arange(n_products_per_group):
    for j in np.arange(n_months):
        minus_i = (np.arange(n_products) >= (n_products_per_group*g))
        minus_i &= (np.arange(n_products) < (n_products_per_group*(g+1)))
        minus_i &= (np.arange(n_products) != (n_products_per_group*g + i))
        index_i = n_products_per_group*g + i
        p_minus[index_i, j] = np.mean(p[minus_i, j])
        q[index_i, j] = - i * p[index_i, j] - .5 * i * p_minus[index_i, j] + np.random.normal(0, .1)
        # adding some time fixed effect on the demand
        q[index_i, j] += 1.0 * (j==1)
        X[index_i, j, :(n_products-1)] = 1.0 * (np.arange(1, n_products) == (index_i))
        # adding the one-hot encoding of the month as part of the features of the observation
        X[index_i, j, (n_products-1):] = 1.0 * (np.arange(1, n_months) == j)

Creating the inputs to the DML method

Y = q.flatten() T = np.hstack([p.reshape((n_productsn_months, 1)), p_minus.reshape((n_productsn_months, 1))]) X = X.reshape((n_products*n_months, n_products + n_months - 2)) from sklearn.linear_model import MultiTaskLassoCV, LassoCV, LinearRegression# Set linear_first_stages to false because this creates extra expanded features of the form X cross W # that slows code down. Effect on correctness is small. For full correctness it must be set to True# if linear models are used as first stage models. You can set to False if for instance Random Forests# are used as first stages. est = LinearDMLCateEstimator(model_y = LassoCV(cv=3), model_t = MultiTaskLassoCV(cv=3), linear_first_stages=False)# Using the one-hot encondings of the products as X and the one-hot encodings of the month as W# to control for time fixed effects est.fit(Y, T, X[:, :(n_products-1)], X[:, (n_products-1):]) est.const_marginal_effect(np.eye(n_products-1))

Again, if you pull the branch: kebatt/refactorDML then you can also use the following in the last lines:

est.fit(Y, T, X[:, :(n_products-1)], X[:, (n_products-1), inference='statsmodels') est.const_marginal_effect_interval(np.eye(n_products-1), alpha=.05)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/92?email_source=notifications&email_token=AMJWORUO6JFFJHPFTIIEUS3QMC4ALA5CNFSM4IZJGDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD73WLSI#issuecomment-536307145, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWORXPNUROWT6TOKMSKS3QMC4ALANCNFSM4IZJGDCA .

-- Federico Nutarelli, PhD student Economics Networks and Business Analytics, IMT School fro advanced studies, Lucca. Email: federico.nutarelli@imtlucca.it

vasilismsr commented 5 years ago

Created enhancement issues #94 and #95 in response to this discussion.

vasilismsr commented 5 years ago

To finalize this also, I forgot the code for other solution. If your groups have small size (e.g. 3 products), then you could also go with your original solution, where you have multiple treatments and multiple outcomes. This might actually be a more stable solution. Here is an example code:

from econml.dml import LinearDMLCateEstimator
from sklearn.preprocessing import PolynomialFeatures
from econml.inference import StatsModelsInference
n_products = 3
n_months = 12
# p[i, j] = price of product i on month j
p = np.random.uniform(0, 1, size=(n_products, n_months))
# p_minus[i, j] = average price of products other than i on month j
p_minus = np.zeros(p.shape)
# q[i, j] = demand of product i on month j
q = np.zeros(p.shape)

for i in np.arange(n_products):
    for j in np.arange(n_months):
        p_minus[i, j] = np.mean(p[np.arange(n_products) != i, j])
        q[i, j] = - i * p[i, j] - .5 * i * p_minus[i, j] + np.random.normal(0, .1)

est = LinearDMLCateEstimator(model_y = MultiTaskLasso(alpha=0.01),
                             model_t = MultiTaskLasso(alpha=0.01),
                             n_splits = 2,
                             linear_first_stages=False)
est.fit(q.T, p.T, inference='statsmodels')
with np.printoptions(precision=3, suppress=True):
    p = est.const_marginal_effect()[0]
    l, u = est.const_marginal_effect_interval()
    arr = np.stack((p, l[0],u[0]), axis=-1)
    print(np.array(["{:.2f} ({:.2f}, {:.2f})".format(arr[i,j,0], arr[i,j,1], arr[i,j,2])
                    for i in range(arr.shape[0]) for j in range(arr.shape[1])]).reshape(l[0].shape))

Remove the inference='statsmodels' if you are using the pip version of the sdk and not the branch kebatt/refactorDML. You can use the bootstrap wrapper instead as you've been doing already. The above will print results of the form:

[['0.01 (-0.11, 0.13)' '-0.13 (-0.41, 0.15)' '-0.07 (-0.38, 0.25)']
 ['-0.23 (-0.37, -0.09)' '-1.07 (-1.23, -0.90)' '-0.11 (-0.34, 0.13)']
 ['-0.43 (-0.68, -0.19)' '-0.50 (-0.69, -0.30)' '-1.94 (-2.20, -1.68)']]

which is the point estimate and the upper and lower confidence bounds for the cross-price elasticity matrix.

federiconuta commented 5 years ago

Thank you very much. By the way, I don't know if another issue should be raised but there are also problems (at least for me) in importing LinearCateEstimator. I already pip installed econml.

Il Dom 29 Set 2019, 20:54 vasilismsr notifications@github.com ha scritto:

To finalize this also, I forgot the code for other solution. If your groups have small size (e.g. 3 products), then you could also go with your original solution, where you have multiple treatments and multiple outcomes. This might actually be a more stable solution. Here is an example code:

from econml.dml import LinearDMLCateEstimatorfrom sklearn.preprocessing import PolynomialFeaturesfrom econml.inference import StatsModelsInference n_products = 3 n_months = 12# p[i, j] = price of product i on month j p = np.random.uniform(0, 1, size=(n_products, n_months))# p_minus[i, j] = average price of products other than i on month j p_minus = np.zeros(p.shape)# q[i, j] = demand of product i on month j q = np.zeros(p.shape) for i in np.arange(n_products): for j in np.arange(n_months): p_minus[i, j] = np.mean(p[np.arange(n_products) != i, j]) q[i, j] = - i p[i, j] - .5 i * p_minus[i, j] + np.random.normal(0, .1)

est = LinearDMLCateEstimator(model_y = MultiTaskLasso(alpha=0.01), model_t = MultiTaskLasso(alpha=0.01), n_splits = 2, linear_first_stages=False) est.fit(q.T, p.T, inference='statsmodels')with np.printoptions(precision=3, suppress=True): p = est.const_marginal_effect()[0] l, u = est.const_marginal_effect_interval() arr = np.stack((p, l[0],u[0]), axis=-1) print(np.array(["{:.2f} ({:.2f}, {:.2f})".format(arr[i,j,0], arr[i,j,1], arr[i,j,2]) for i in range(arr.shape[0]) for j in range(arr.shape[1])]).reshape(l[0].shape))

Remove the inference='statsmodels' if you are using the pip version of the sdk and not the branch kebatt/refactorDML. You can use the bootstrap wrapper instead as you've been doing already. The above will print results of the form:

[['0.01 (-0.11, 0.13)' '-0.13 (-0.41, 0.15)' '-0.07 (-0.38, 0.25)'] ['-0.23 (-0.37, -0.09)' '-1.07 (-1.23, -0.90)' '-0.11 (-0.34, 0.13)'] ['-0.43 (-0.68, -0.19)' '-0.50 (-0.69, -0.30)' '-1.94 (-2.20, -1.68)']]

which is the point estimate and the upper and lower confidence bounds for the cross-price elasticity matrix.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/92?email_source=notifications&email_token=AMJWORVIBIXF6TYB67EGQC3QMD2VHA5CNFSM4IZJGDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD734BWA#issuecomment-536330456, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWORUDYMZFQFV3GQPE2D3QMD2VHANCNFSM4IZJGDCA .

vasilismsr commented 5 years ago

Yes, sorry this must be due to the refactoring that we are doing. I am currently using the kebatt/refactorDML branch of our code. If you want to run my code snippets above verbatim, then you would need to clone that branch of our sdk and then go to the folder and install the package from there using: python setup.py --develop. This will install that branch of the code and will replace the econml installed by pip install.

If you want to use the pip install version, you can just replace the LinearDMLCateEstimator in my code with DMLCateEstimator. The branch I am working off will soon be part of an updated version of the pip install package.

federiconuta commented 5 years ago

Thank you again!!!! Have a nice week :)

Il Lun 30 Set 2019, 09:32 vasilismsr notifications@github.com ha scritto:

Yes, sorry this must be due to the refactoring that we are doing. I am currently using the kebatt/refactorDML branch of our code. If you want to run my code snippets above verbatim, then you would need to clone that branch of our sdk and then go to the folder and install the package from there using: python setup.py --develop. This will install that branch of the code and will replace the econml installed by pip install.

If you want to use the pip install version, you can just replace the LinearDMLCateEstimator in my code with DMLCateEstimator. The branch I am working off will soon be part of an updated version of the pip install package.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/92?email_source=notifications&email_token=AMJWORV2K6IDZOA5A3YH4L3QMGTPFA5CNFSM4IZJGDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD74WXMY#issuecomment-536439731, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWORW5HJC4D7ZC2KDH4R3QMGTPFANCNFSM4IZJGDCA .

federiconuta commented 5 years ago

Ok not to bother you further...but actually I treid out the code and it crashes. It takes a lot of time (is the first one you proposed without controlling time fixed effects and adopting LinearDMLCateEstimator). Do you think it is just because I have plenty of data? In that case I should reduce the data by taking a random sample I guess...

Il Lun 30 Set 2019, 09:34 Nutarelli Federico federico.nutarelli@imtlucca.it ha scritto:

Thank you again!!!! Have a nice week :)

Il Lun 30 Set 2019, 09:32 vasilismsr notifications@github.com ha scritto:

Yes, sorry this must be due to the refactoring that we are doing. I am currently using the kebatt/refactorDML branch of our code. If you want to run my code snippets above verbatim, then you would need to clone that branch of our sdk and then go to the folder and install the package from there using: python setup.py --develop. This will install that branch of the code and will replace the econml installed by pip install.

If you want to use the pip install version, you can just replace the LinearDMLCateEstimator in my code with DMLCateEstimator. The branch I am working off will soon be part of an updated version of the pip install package.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/EconML/issues/92?email_source=notifications&email_token=AMJWORV2K6IDZOA5A3YH4L3QMGTPFA5CNFSM4IZJGDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD74WXMY#issuecomment-536439731, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWORW5HJC4D7ZC2KDH4R3QMGTPFANCNFSM4IZJGDCA .