Co-variance matrix is undertermined

py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.

https://www.microsoft.com/en-us/research/project/alice/

Other

3.69k stars 697 forks source link

Co-variance matrix is undertermined #262

Open federiconuta opened 4 years ago

federiconuta commented 4 years ago

Hi all,

I am writing to you since the following error appears when using:

est = LinearDMLCateEstimator(model_y = LassoCV(cv=[(fold00, fold11), (fold11, fold00)]), model_t = MultiTaskLassoCV(cv=[(fold00, fold11), (fold11, fold00)]), n_splits = [(fold0, fold1), (fold1, fold0)], linear_first_stages=False)

Then I fit in this way:

est.fit(Y, T, X[:, :(n_products)], W, inference='statsmodels')

The error is: /Users/federiconutarelli/anaconda3/envs/pharma/EconML/econml/utilities.py:961: UserWarning: Co-variance matrix is undertermined. Inference will be invalid!

Should I specifiy the covariance matrix somehow? What does it mean undetermined? I mean, in a mathematical sense or in the sense that I should set a specific covariance matrix?

I am operating in a panel data framework. Let's say I am almost reproducing the code of issue #94.

Thank you,

Federico

kbattocchi commented 4 years ago

This means that either multiple columns of the feature matrix are (approximately) collinear or there are fewer observations than features (although you should see an additional warning in that case).

The feature matrix that we use in this step is determined by applying the featurizer (if any) to the X matrix, then adding a column of ones (unless you specified that fit_cate_intercept=False when initializing the estimator), and then interacting this matrix with the treatment matrix. So one way you could see this is if your X matrix already has a constant column when we add another one for the CATE intercept.

federiconuta commented 4 years ago

So, if I use SparseDMLCateEstimator rather than LinearDMLCateEstimator, it appears a message saying:

The number of features in the final model (< 5) is too small for a sparse model. We recommend using the LinearDMLCateEstimator for this low-dimensional setting.

So maybe this is the case. Maybe I should describe the X matrices: I start with a panel consisting of a molecule observed for 47 periods in a country.SO the number of observations is 47. Basically X consists of a column of 1 for the individual fixed effect and time encoding. SO is 47x47. Referring to the #94 is:

n_products_per_group = 1
        n_months = 47
        n_groups = 1
        n_products = n_groups * n_products_per_group
        p = price
        p_minus = median_price
        q = demand
        X = np.zeros((n_products, n_months, n_products + n_months + n_groups - 2))

for g in np.arange(n_groups):
            for i in np.arange(n_products_per_group):
                for j in np.arange(n_months):
                    index_i = n_products_per_group*g + i
                    #Index individual
                    X[index_i, j, :(n_products)] = 1.0 * (np.arange(0, n_products) == (index_i))
                    #Index group
                    X[index_i, j, (n_products):(n_products + n_groups - 1)] =  1.0 * (np.arange(1,n_groups) == g)
                    #Time index
                    X[index_i, j, (n_products + n_groups - 1):] = 1.0 * (np.arange(1, n_months) == j)

Now that we talk about it and I see the picture as a whole, the problem can be that when implementing the LinearDMLCateEstimator I should exclude the first observation as you suggested as in:

n_products_per_group = 1
        n_months = 47
        n_groups = 1
        n_products = n_groups * n_products_per_group
        p = price
        p_minus = median_price
        q = demand
        X = np.zeros((n_products, n_months, n_products + n_months + n_groups - 3))

for g in np.arange(n_groups):
            for i in np.arange(n_products_per_group):
                for j in np.arange(n_months):
                    index_i = n_products_per_group*g + i
                    X[index_i, j, :(n_products-1)] = 1.0 * (np.arange(1, n_products) == (index_i))
                    # adding the one-hot encoding of the month as part of the features of the observation
                   X[index_i, j, (n_products-1):(n_products + n_groups - 2)] =  1.0 * (np.arange(1, n_groups) == g)
                   X[index_i, j, (n_products + n_groups - 2):] = 1.0 * (np.arange(1, n_months) == j)

federiconuta commented 4 years ago

Not sure, however since in my case:

n_products_per_group = 1
 n_months = 47
n_groups = 1

In this case the X matrix should account for the fixed effects of the only product (which is constant over the 47 periods) and the time fixed effects. So it should be a 47x(n_month*2) or something similar? Further, since I include both fixed effects and time effects, the observations should be less than the feature parameters. But in that case, I do not understand why the warning message in Sparse appears, i.e.:

The number of features in the final model (< 5) is too small for a sparse model. We recommend using the LinearDMLCateEstimator for this low-dimensional setting.

kbattocchi commented 4 years ago

Hm. If n_products_per_group and n_groups are both 1 then that implies that there's just one product, so there's a bunch of stuff from that example that's completely unneeded in your scenario (e.g. you won't have treatments for the other products in a group because there are no such products). Ultimately, you would just have 46 columns representing dummies for each month except the first.

But note that if you don't have multiple products and you only have one observation per month, then the identification strategy used in that other example can't work here; the idea there was that you would carefully split into folds so that each split would contain all of the months and all of the products (but only half of the pairs, necessarily); however, if you only have a single product then there's no way to do this - each fold will have only half of the months and so there is no data for the model trained on that fold to predict the results for the other half.

federiconuta commented 4 years ago

Oh I see so basically it's like "the third month" is observed in n products, so that the model could be trained on the third month on a set of products and predict the Y for the third month on the other set (ideally half), for instance. here I cannot do that since I have a full set of months, therefore each month appears only once and the algorithm trains the model on a set of months (half) not present on the test set. Correct?

So is there a way to solve the issue in your opinion or is it simply impossible due to the latter observation?

Thank you again,

Federico

federiconuta commented 4 years ago

What if, for instance, I take the same product observed in two countries and consider 2 groups with 1 product each?

kbattocchi commented 4 years ago

Again, the grouping idea in the other example was because we also wanted a form of cross-price elasticity. If you don't expect the treatment in one country to affect the outcome in the other country, then I'd drop the idea of "groups".

But yes, if you have 2 countries and 47 months and you expect that there are independent fixed effects for each month and for each country, then yes, you can have one country dummy and 46 month dummies and split your observations across folds so that, say, the observations for country 1 for months 1-23 and for country 2 for months 24-47 go in one fold and the other observations go into the second fold.

But this is still not quite enough data: you'll have 47 observations per fold, but if we add a CATE intercept you're trying to estimate the effect of (1 country dummy) + (46 month dummies) + (1 intercept) = 48 features. Even if you choose not to fit an intercept, there is only just enough data to fit the model, so I'd expect dramatic overfitting. Things would be slightly better if you had 3 countries instead of 2, but really it seems like you don't have enough data for any kind of robust estimate unless you make some stronger assumptions (e.g. maybe you think there is seasonality, but not year-to-year variation, in which case you could use 12 month dummies instead of 47).

Also, you might have noticed that in the example you're referencing we have two levels of folds (fold0 is itself broken down into fold00 and fold01) so that we can do cross-validation of the regularization in the first stage models, but to do that you'd need even more data.

So unfortunately, I think it's likely that you just don't have enough data to get good estimates out.

federiconuta commented 4 years ago

Again, the grouping idea in the other example was because we also wanted a form of cross-price elasticity. If you don't expect the treatment in one country to affect the outcome in the other country, then I'd drop the idea of "groups".

Well no, indeed that's what I am interested into actually, but in couple of countries for a single product. So basically the idea was to estimate own and cross price elasticity for a product in a couple of countries (say Italy and Spain). What I would like to obtain is own for the product in Italy and cross for the same product in Italy (where cross describes how a price change in Spain affects prices in Italy).

Actually I have tried the approach and obtained 4 coefficient. The test set matrix is a (2,1), hence heat results in a (2,2) and I am quite puzzled in which coefficient is what (in the sense that in the (2,2) matrix, what does the first column represent? What the second?).

But this is still not quite enough data: you'll have 47 observations per fold, but if we add a CATE intercept you're trying to estimate the effect of (1 country dummy) + (46 month dummies) + (1 intercept) = 48 features. Even if you choose not to fit an intercept, there is only just enough data to fit the model, so I'd expect dramatic overfitting. Things would be slightly better if you had 3 countries instead of 2, but really it seems like you don't have enough data for any kind of robust estimate unless you make some stronger assumptions (e.g. maybe you think there is seasonality, but not year-to-year variation, in which case you could use 12 month dummies instead of 47).

Yes that's an issue. Indeed my estimates are quite unstable.

So unfortunately, I think it's likely that you just don't have enough data to get good estimates out.

Yes I think you are right. I will try to figure out a solution and keep you updated.

Thanks a lot!

kbattocchi commented 4 years ago

I see, in that case you'd have either 2 treatments and 2 outcomes (if you expect that the effect of prices in Spain on demand in Italy is different from the effect of prices in Italy on demand in Spain and that own-price elasticities are also potentially different between countries) but then no country dummies (because each row represents both treatments and outcomes for that month), in which case when you call effect(X, T0, T1) (where each row of X contains month dummies, each row of T0 contains the initial prices in each country, and each row of T1 contains the final prices in each country) then you'll get back a matrix where each row has the predicted change in demand in each country.

Alternatively, you can have two observations per month with two treatments (own-country and cross-country) and one outcome (demand), where the interpretation of "own" (and the country whose demand is in the row) depends on a country dummy. So then when you call effect(X, T0, T1) (where each row of X contains month and country dummies, each row of T0 contains the own- and cross-country initial prices, and each row of T2 contains the own- and cross-country final prices), you'll get back a matrix where each row has the predicted change in demand in only the country specified by the dummy.

kbattocchi commented 4 years ago

But note that in the first case, you are now trying to model twice as many treatments (the own country and the cross country prices) so that you once again do not have enough data to identify these effects (we now have twice as many generated features as observations, since we cross the original features with the treatments). We also have twice as many treatments in the second approach, but we have twice as many observations since we have a row for each country for each month. So, unless you have some other way of generating more observations, you'd need to stick to the second approach, and even that is just barely identified. But I hope that the explanation helps you understand how to interpret the outputs.

federiconuta commented 4 years ago

So I am not sure, but I think I am in the second case since:

y.shape
(1, 94)
#after flattening
Y.shape
(94,)
X.shape
(2,47,48)
#after reshaping
X.shape
(94, 48)
#after reshaping
W.shape
(94, 53)
T.shape
(94, 2)

where T[0] has the price for ITA and T[1] the price for Spain. What I do not understand is what you mean by "initial" and "final" prices. The test variable is defined as:

X_test_UE = np.vstack([np.zeros((1,n_products-1)), np.eye(n_products-1)])

and is a (2,1). Moreover I have specified country and group effects in:

n_products_per_group = 1
        n_months = 47
        n_groups = 2
        n_products = n_groups * n_products_per_group
        # p[i, j] = price of product i on month j
        p = price
        p_minus = median_price
        # q[i, j] = demand of product i on month j
        q = demand
        X = np.zeros((n_products, n_months, n_products + n_months + n_groups - 3))
        #WHEN USING LINEARCATE (to avoid collinearity)
        for g in np.arange(n_groups):
            for i in np.arange(n_products_per_group):
                for j in np.arange(n_months):
                        index_i = n_products_per_group*g + i
                        X[index_i, j, :(n_products-1)] = 1.0 * (np.arange(1, n_products) == (index_i))
                        # adding the one-hot encoding of the month as part of the features of the observation
                        X[index_i, j, (n_products-1):(n_products + n_groups - 2)] =  1.0 * (np.arange(1, n_groups) == g)
                        X[index_i, j, (n_products + n_groups - 2):] = 1.0 * (np.arange(1, n_months) == j)

Do you agree?