jkcshea / ivmte

An R package for implementing the method in Mogstad, Santos, and Torgovitsky (2018, Econometrica).
GNU General Public License v3.0
18 stars 2 forks source link

System singularity error #177

Closed slacouture closed 4 years ago

slacouture commented 4 years ago

Hi! In advance, sorry for the long post: I am trying to be as detailed as I can:

When running the package for estimating the MTRs using splines, I am sometimes receiving a particular error message as if the IV-like spec was perfectly collinear. You can find here a zip-file with the data and the code reproducing the error.

What the code is doing is to try to add more covariates to the estimation to get more variation in the propensity score. In particular, for a set of covariates

cXs <- c("age", "unemployment", "elderly", "pop") # Continuous Xs -- if cXs <- c("age") the program works
dXs <- c("female") # Discrete Xs

I add linear interactions and 2nd and 3rd order values and interact with the instrument Z to have the following formulae:

RF <- "YD ~ 1 + age + unemployment + elderly + pop + female + age2 + unemployment2 + elderly2 + pop2 + age3 + unemployment3 + elderly3 + pop3 + age*female + unemployment*female + elderly*female + pop*female + unemployment*age + elderly*age + pop*age + elderly*unemployment + pop*unemployment + pop*elderly + (Z==2) + (Z==1) + (Z==2)*age + (Z==1)*age + (Z==2)*unemployment + (Z==1)*unemployment + (Z==2)*elderly + (Z==1)*elderly + (Z==2)*pop + (Z==1)*pop + (Z==2)*female + (Z==1)*female + (Z==2)*age2 + (Z==1)*age2 + (Z==2)*unemployment2 + (Z==1)*unemployment2 + (Z==2)*elderly2 + (Z==1)*elderly2 + (Z==2)*pop2 + (Z==1)*pop2 + (Z==2)*age3 + (Z==1)*age3 + (Z==2)*unemployment3 + (Z==1)*unemployment3 + (Z==2)*elderly3 + (Z==1)*elderly3 + (Z==2)*pop3 + (Z==1)*pop3 + (Z==2)*age*female + (Z==1)*age*female + (Z==2)*unemployment*female + (Z==1)*unemployment*female + (Z==2)*elderly*female + (Z==1)*elderly*female + (Z==2)*pop*female + (Z==1)*pop*female + (Z==2)*unemployment*age + (Z==1)*unemployment*age + (Z==2)*elderly*age + (Z==1)*elderly*age + (Z==2)*pop*age + (Z==1)*pop*age + (Z==2)*elderly*unemployment + (Z==1)*elderly*unemployment + (Z==2)*pop*unemployment + (Z==1)*pop*unemployment + (Z==2)*pop*elderly + (Z==1)*pop*elderly"

that I use in the iv-like argument;

FS <- "D ~ 1 + (Z==2) + (Z==1) + age + unemployment + elderly + pop + female + age2 + unemployment2 + elderly2 + pop2 + age3 + unemployment3 + elderly3 + pop3 + age*female + unemployment*female + elderly*female + pop*female + unemployment*age + elderly*age + pop*age + elderly*unemployment + pop*unemployment + pop*elderly"

that I use to specify the propensity argument; and

m1 <- "~ 1 + uSplines(degree = 2, knots = c(.3,.4,.5)) + age + unemployment + elderly + pop + female + age2 + unemployment2 + elderly2 + pop2 + age3 + unemployment3 + elderly3 + pop3 + age*female + unemployment*female + elderly*female + pop*female + unemployment*age + elderly*age + pop*age + elderly*unemployment + pop*unemployment + pop*elderly"

used for the MTR specification

The exact arguments used are:

args <- list(data = data, 
             ivlike =  as.formula(RF),
             target = "genlate",
             genlate.lb = 0,
             genlate.ub = 1,
             m0 = ~ 1,
             m1 = as.formula(m1),
             propensity = as.formula(FS),
             m0.lb = 0, 
             m0.ub = 0,             
             m1.lb = 0, 
             m1.ub = 1)

I then receive the following output:

LP solver: Gurobi ('gurobi')

Obtaining propensity scores...

Generating target moments...
    Integrating terms for control group...
    Integrating terms for treated group...

Generating IV-like moments...
Error in solve.default((1/nrow(X)) * t(X) %*% X) : 
  system is computationally singular: reciprocal condition number = 8.51157e-50
In addition: Warning message:
In rm(tmpOutput) : object 'tmpOutput' not found

However if I run an standard lm model on the RF:

> lm(as.formula(RF), data)

Call:
lm(formula = as.formula(RF), data = data)

Coefficients:
                    (Intercept)                              age                     unemployment                          elderly  
                      1.468e-01                       -1.206e-03                        6.289e+00                       -1.506e+00  
                            pop                           female                             age2                    unemployment2  
                     -1.656e-07                        5.040e-02                       -1.920e-05                       -7.443e+02  
                       elderly2                             pop2                             age3                    unemployment3  
                      4.224e+00                       -1.753e-12                        1.029e-07                        1.904e+04  
                       elderly3                             pop3                       Z == 2TRUE                       Z == 1TRUE  
                     -3.781e+00                        2.494e-18                       -1.776e-01                       -1.929e-01  
                     age:female              unemployment:female                   elderly:female                       pop:female  
                     -1.868e-04                       -3.374e-01                       -1.888e-01                       -2.661e-08  
               age:unemployment                      age:elderly                          age:pop             unemployment:elderly  
                      5.182e-02                        6.607e-03                        5.836e-10                        4.340e+00  
               unemployment:pop                      elderly:pop                   age:Z == 2TRUE                   age:Z == 1TRUE  
                     -4.305e-05                        8.587e-06                        4.948e-04                       -1.446e-03  
        unemployment:Z == 2TRUE          unemployment:Z == 1TRUE               elderly:Z == 2TRUE               elderly:Z == 1TRUE  
                      5.164e+01                        1.413e+01                       -1.754e+00                        2.001e+00  
                 pop:Z == 2TRUE                   pop:Z == 1TRUE                female:Z == 2TRUE                female:Z == 1TRUE  
                     -2.113e-06                        3.816e-07                        4.811e-02                       -4.524e-02  
                age2:Z == 2TRUE                  age2:Z == 1TRUE         unemployment2:Z == 2TRUE         unemployment2:Z == 1TRUE  
                     -4.311e-06                        4.745e-05                       -1.641e+03                       -7.480e+02  
            elderly2:Z == 2TRUE              elderly2:Z == 1TRUE                  pop2:Z == 2TRUE                  pop2:Z == 1TRUE  
                      1.659e+01                       -7.317e+00                        4.498e-12                        9.980e-13  
                age3:Z == 2TRUE                  age3:Z == 1TRUE         unemployment3:Z == 2TRUE         unemployment3:Z == 1TRUE  
                      2.830e-09                       -2.721e-07                        3.062e+04                        1.058e+04  
            elderly3:Z == 2TRUE              elderly3:Z == 1TRUE                  pop3:Z == 2TRUE                  pop3:Z == 1TRUE  
                     -2.918e+01                        6.841e+00                       -4.369e-18                       -1.862e-18  
          age:female:Z == 2TRUE            age:female:Z == 1TRUE   unemployment:female:Z == 2TRUE   unemployment:female:Z == 1TRUE  
                      2.547e-05                       -2.793e-04                        2.538e+00                        1.543e-02  
      elderly:female:Z == 2TRUE        elderly:female:Z == 1TRUE            pop:female:Z == 2TRUE            pop:female:Z == 1TRUE  
                     -4.201e-01                        3.091e-01                       -9.176e-08                        8.100e-08  
    age:unemployment:Z == 2TRUE      age:unemployment:Z == 1TRUE           age:elderly:Z == 2TRUE           age:elderly:Z == 1TRUE  
                     -7.215e-02                       -1.894e-02                        1.801e-03                       -3.353e-03  
             age:pop:Z == 2TRUE               age:pop:Z == 1TRUE  unemployment:elderly:Z == 2TRUE  unemployment:elderly:Z == 1TRUE  
                      1.975e-09                       -1.315e-09                       -1.196e+02                        1.538e+00  
    unemployment:pop:Z == 2TRUE      unemployment:pop:Z == 1TRUE           elderly:pop:Z == 2TRUE           elderly:pop:Z == 1TRUE  
                      2.177e-05                        5.837e-05                        4.169e-06                       -1.071e-05  

Just as a check you can see that the lm package estimates all values, thus it is not dropping any redundant

> a <- lm(as.formula(RF), data)
> length(coefficients(a))
[1] 72
> length(c(x.formula.r, Zs, Inter)) # All items from RF (minus intercept)
[1] 71

One detail that might be relevant is that the unemployment, elderly and pop variables are observed at the municipality level. Is this driving the problem? Is there a way to overcome it?

Thank you!

slacouture commented 4 years ago

To get a notion where the problem is this could help:

If I specify only

cXs <- c("age", "unemployment")

and don't include them in 2nd and 3rd order, so the iv-like specification is

RF <- "YD ~ 1 + age + unemployment + female + age*female + unemployment*female + unemployment*age + (Z==2) + (Z==1) + (Z==2)*age + (Z==1)*age + (Z==2)*unemployment + (Z==1)*unemployment + (Z==2)*female + (Z==1)*female + (Z==2)*age*female + (Z==1)*age*female + (Z==2)*unemployment*female + (Z==1)*unemployment*female + (Z==2)*unemployment*age + (Z==1)*unemployment*age",

the program works.

But if I try to add another covariate, say "pop", so the iv-like specification takes the form

RF<- "YD ~ 1 + age + unemployment + pop + female + age*female + unemployment*female + pop*female + unemployment*age + pop*age + pop*unemployment + (Z==2) + (Z==1) + (Z==2)*age + (Z==1)*age + (Z==2)*unemployment + (Z==1)*unemployment + (Z==2)*pop + (Z==1)*pop + (Z==2)*female + (Z==1)*female + (Z==2)*age*female + (Z==1)*age*female + (Z==2)*unemployment*female + (Z==1)*unemployment*female + (Z==2)*pop*female + (Z==1)*pop*female + (Z==2)*unemployment*age + (Z==1)*unemployment*age + (Z==2)*pop*age + (Z==1)*pop*age + (Z==2)*pop*unemployment + (Z==1)*pop*unemployment"

or if I add only the 2nd order continuous variables so

RF <- "YD ~ 1 + age + unemployment + female + age*female + unemployment*female + unemployment*age + age2 + unemployment2 + (Z==2) + (Z==1) + (Z==2)*age + (Z==1)*age + (Z==2)*unemployment + (Z==1)*unemployment + (Z==2)*female + (Z==1)*female + (Z==2)*age*female + (Z==1)*age*female + (Z==2)*unemployment*female + (Z==1)*unemployment*female + (Z==2)*unemployment*age + (Z==1)*unemployment*age + (Z==2)*age2 + (Z==1)*age2 + (Z==2)*unemployment2 + (Z==1)*unemployment2"

the program breaks.

jkcshea commented 4 years ago

Sorry for being slow to get back to you on this.

There is something peculiar about the design matrix implied by your IV-like specification. Call that design matrix X. Using R's qr() command, one can confirm that X is full column rank. However, this is not the case for t(X) %*% X.

> qr(X)$rank
[1] 72
> qr(t(X) %*% X)$rank
[1] 5

Trying to invert t(X) %*% X using solve() or qr.solve isn't going to work. However, you can perform a Cholesky decomposition, and then invert that instead.

> invXX <- chol2inv(chol(t(X) %*% X))
> invXX[1:4, 1:4]
              [,1]          [,2]          [,3]          [,4]
[1,]  7.492741e+00 -0.0078332681 -1.503016e+02 -101.37416922
[2,] -7.833268e-03  0.0003881945  7.115998e-02    0.01031024
[3,] -1.503016e+02  0.0711599824  1.790002e+04  796.97611202
[4,] -1.013742e+02  0.0103102391  7.969761e+02 1531.45891953

Making that change resolved the problem above. However, I'm not entirely sure what this means for your IV-like specification.

Running your code (see below, though), the model eventually became infeasible after a sufficient number of audits. The design of the package should limit such cases, so I'm looking into that (issue #178).

Give the repo a pull, and let me know if (i) this issue is resolved on your end, too; (ii) you run into the an infeasibility issue. If you restrict the number of audits to be sufficiently small, e.g. audit.max = 2, you should still get a bound, though.

(The code that was uploaded wouldn't run, since the object Z.form was not declared. I assume you meant Zs. Making that change, I was able to replicate the error. So let me know if Z.form and Zs are actually different things.)

a-torgovitsky commented 4 years ago

What does lm do for this type of inversion? If lm is able to do it, then we also want our code to be able to do it. Maybe there is a function somewhere in the lm code that we can use directly?

a-torgovitsky commented 4 years ago

I wonder if it is this: https://github.com/SurajGupta/r-source/blob/a28e609e72ed7c47f6ddfbb86c85279a0750f0b7/src/library/stats/R/lm.R#L114

z <- .Call(C_Cdqrls, x, y, tol, FALSE)

Looks like z is a list with some stuff. Can we use that?

Getting down to the C function quickly here is probably the fastest way to go as well.

jkcshea commented 4 years ago

Sure, I can give that a shot. Also, this page does a deep dive into the lm() command, discussing the C and Fortran components.

a-torgovitsky commented 4 years ago

That's very useful. We should be doing it the same way base R is.

jkcshea commented 4 years ago

Actually, I missed some details earlier, and am wondering if changes really need to be made. The regressions by the package already use base R, i.e. lm.fit. So getting the beta estimates is not a problem, even for the example above.

The error @slacouture found was was to do with the matrix inversion when constructing the S-weights, e.g. image

The lm() code we looked at calls C to perform the regression using a QR decomposition. But the QR decomposition is to avoid having to invert a matrix. So I won't get back (X'X)^{-1} from the C output.

Let me know if we can indeed benefit from using the C command, though. (In addition to the coefficients and residuals from the regression, the C output includes the QR decomposition for the design matrix.)

slacouture commented 4 years ago

Thanks for looking into this @jkcshea. First of all, regarding the mistake in the replication code, the correct replacement for Z.form is not Zs but paste(Zs, collapse = " + "). Sorry about that. This shouldn't affect the reproduction of the error as you noticed but potentially the estimation as it will be supplying an incorrect specification of propensity.

I pulled the repo and the problem of singularity is solved. However, although I am getting the same error of infeasibility reported in #178, I am receiving a different error message:

Error: Automatic grid expansion limit reached. The LP problem is still unbounded. Either impose additional shape constraints, or increase the size of the initial grid for the audit. Since the initial grid must be a subset of the audit grid, it may be necessary to increase the size of the audit grid also. The most recent options after automatic expansion were:
initgrid.nx = 68
initgrid.nu = 25
audit.nx = 2500
audit.nu = 25 

Only if I expand the audit grid (audit.nu = 50 for instance) I get the same message as in #178.

If I restrict the number of audits as you suggest I do get a bound, but this bound is just crazy. Does this mean I am incorrectly using the program?

a-torgovitsky commented 4 years ago

What are the Xtilde's that you have there in the S-weight problem? Once we run the OLS we should be able to construct the S-weighting function with any additional inversion steps.

jkcshea commented 4 years ago

If I restrict the number of audits as you suggest I do get a bound, but this bound is just crazy. Does this mean I am incorrectly using the program?

Oh no, that's not what I meant. I realize what I said is misleading. As the audit procedure continues, there are more and more constraints in the LP problem. So it makes sense that the problem becomes infeasible later in the audit procedure. However, these infeasibility errors have become uncommon since the package allows for some slack in the equality constraints (that is what the 'criterion minimization' problem refers to). So this just warrants some investigation.

If each iteration doesn't take too long to run, you can increase the number of audits so that your estimates will better satisfy the shape constraints, e.g. until you have 0 violations of the shape constraints. But here we see that only leads to infeasibility.

What are the Xtilde's that you have there in the S-weight problem?

Sorry, the Xtilde is just the matrix of covariates for the IV-like specification.

Once we run the OLS we should be able to construct the S-weighting function with any additional inversion steps.

I interpret the last part as "without any additional inversion steps."

So is there another way I can construct the S-weighting functions from the matrix of covariates and coefficient estimates, without inverting anything?

a-torgovitsky commented 4 years ago

Yes without -- sorry!

Ok, so the issue as I understand it is that lm never actually computes the inverse of the design matrix. And yet we apparently need this object for our S--weights.

However, I think that need is really just apparent, not actual. That is, if we code it differently, we do not ever need to invert the design matrix.

I'm going to use the notation in our working paper. The only thing we need the function s for is to compute \Gamma{sdk}. (Please correct me if I am wrong here --- it's been a while since I looked at this.) The \beta{s} we get directly from the lm/ivregress routine.

Now look at the definition of \Gamma{sdk} on pg. 4. It has the form: E[s(d,X,Z) \times stuff] where stuff is an integral that depends on the basis functions and propensity score. In your example above, this should be the same as regressing stuff against \tilde{X}, then the e{j} part just picks out the jth component of the resulting vector of coefficients.

That means we could just use lm to compute \Gamma_{sdk} without having to ever construct the s function at all.

Let me know if that makes sense.

jkcshea commented 4 years ago

Ah, that's fantastic! Very simple, too. Sure, I can implement that.

a-torgovitsky commented 4 years ago

Ok, great. Some modification is probably needed since our IV--like estimands are of the TSLS form, right? But I believe that should just mean the same idea works with "regressing stuff against \tilde{X}" replaced by "run TSLS with stuff as the outcome variable" --- but please check.

Are the unit tests sufficiently mature that they will catch unforeseen problems with this new formulations? Just skimming through them, I don't see any tests of the form "we know this is the right estimate/bound to get, let's check that we got it." Might be good to put in one of those as a sanity check before making the change.

jkcshea commented 4 years ago

Some modification is probably needed since our IV--like estimands are of the TSLS form, right? But I believe that should just mean the same idea works with "regressing stuff against \tilde{X}" replaced by "run TSLS with stuff as the outcome variable" --- but please check.

Yep, that looks right.

Are the unit tests sufficiently mature that they will catch unforeseen problems with this new formulations? Just skimming through them, I don't see any tests of the form "we know this is the right estimate/bound to get, let's check that we got it." Might be good to put in one of those as a sanity check before making the change.

Ah, but that is indeed what the testthat checks do. e.g. they end with

##-------------------------
## Test bounds
##-------------------------

test_that("LP problem", {
    expect_equal(result$bound, bound)
})

result$bound is estimated using the package. bound is manually calculated.

A relevant test is also that the gamma moments are constructed correctly.

##-------------------------
## Test equivalence of Gamma terms
##-------------------------

test_that("Gamma moments", {
    expect_equal(as.numeric(c(result$gstar$g0, result$gstar$g1)),
                 as.numeric(unlist(g.star.late)))
    expect_equal(as.numeric(c(result$s.set$s1$g0, result$s.set$s1$g1)),
                 as.numeric(unlist(g.ols.d)))
    expect_equal(as.numeric(c(result$s.set$s2$g0, result$s.set$s2$g1)),
                 as.numeric(unlist(g.ols.x1)))
})

This check should be more meaningful now, since the testthat code will have constructed the gamma moments using the S-weighting functions, rather than this new regression method.

Question: @johnnybonney and @cblandhol have requested the S-weights be included in the output before. So should this become an option?

a-torgovitsky commented 4 years ago

Ah ok, I just skimmed the tests so didn't look closely enough. That's great then, should be easy to do the modification without breaking anything. A big win for disciplined unit testing!

jkcshea commented 4 years ago

@johnnybonney and @cblandhol have requested the S-weights be included in the output before.

And we're okay with keeping the old method as an option?

e.g. ivmte(..., return.weights = TRUE).

a-torgovitsky commented 4 years ago

Ah yes, good point, yeah I suppose we should preserve that. However, the new method should definitely be the default.

jkcshea commented 4 years ago

@slacouture The cause of the error was due to scaling of certain variables. The pop variable is in the order of 1e6. This variable is then squared and cubed, resulting in a range of magnitude beyond double precision. This document by Gurobi briefly explains the issue with scaling. And this  other document by Gurobi provides some examples.

I rescaled your pop variable as follows:

data[, pop := pop / 1e5]

The package was able to provide reasonable bounds in under a minute. However, the minimum criterion becomes enormous, which may indicate a problem with the model. Let me know if the changes work on your end.

slacouture commented 4 years ago

Thank you, @jkcshea I was not aware of that issue with scaling on Gurobi. It is working for me now. I am going to check what may be driving the large minimum criterion.