Regression based approach and point-identifying LATE

jkcshea / ivmte

An R package for implementing the method in Mogstad, Santos, and Torgovitsky (2018, Econometrica).

GNU General Public License v3.0

18 stars 2 forks source link

Regression based approach and point-identifying LATE #221

Open johnnybonney opened 2 years ago

johnnybonney commented 2 years ago

Posting a new issue, lest we start talking about too many different things in the same thread

My original question: Using the moment-based approach, we can point-identify the LATE in simple cases (i.e., no covariates, binary instrument). This is regardless of the MTR specifications. The regression-based approach does not produce such point identification. Is this expected behavior?

Here is my example:

library(ivmte)

args <- list(data = AE,
             outcome = 'worked',
             target = 'late',
             late.from = list(samesex = 0),
             late.to = list(samesex = 1),
             propensity = morekids ~ samesex,
             m0 = ~u + I(u^2),
             m1 = ~u + I(u^2),
             solver = 'gurobi')

# estimate using regression-based approach
late_reg <- do.call(ivmte, args)
print(late_reg)

produces

Bounds on the target parameter: [-0.2995747, 0.1298702]
Audit terminated successfully after 1 round

and

# explicitly add moments
args$ivlike <- list(worked ~ morekids + samesex + morekids:samesex)
# estimate via moment approach
late_moment <- do.call(ivmte, args)
print(late_moment)

produces

Bounds on the target parameter: [-0.08484221, -0.08484221]
Audit terminated successfully after 1 round

Something else to look into: At https://github.com/jkcshea/ivmte/issues/220#issuecomment-1176310741, @a-torgovitsky is unable to reproduce my initial bounds, and gets [-0.2758405, 0.1061561].

a-torgovitsky commented 2 years ago

Ok, I get the example now. It is consistent with the theory. Here's an explanation, let me know if it doesn't make sense.

The regression approach tries to satisfy the first order conditions for a linear regression: $E[(Y - \theta'B)B] = 0$ where $\theta'B$ is the implied form of $E[Y | D, Z]$ from your parameterization. Since you have 6 unknown parameters (3 x 2), the dimension of both $\theta$ and $B$ is 6. So the regression approach is trying to minimize a set of 6 moments. It can't do it in a unique way because $Z$ is binary, so there is collinearity among the components of $B$. So the solution is driven by the Euclidean norm, which balances all 6 moments.

In contrast, the moment approach tries to fit only 4 moments. There's a unique way to do this with 6 parameters as long as one solution exists. The moment approach exactly fits $E[Y | D,Z]$, and thus reproduces the LATE, which is just a function of $E[Y|D,Z]$ together with the propensity score.

Is this a deficiency of the regression approach? Maybe. I guess I would say it reflects the fact that the regression approach is less flexible, which in statistics is both a pro and a con. Essentially what you are doing in the moment-based approach is picking 4 of the 6 moments to focus on. You can see this by running the regression approach with a linear MTR instead of quadratic, which effectively zeros out two the moments (and two of the parameters).

args <- list(data = AE,
             outcome = 'worked',
             target = 'late',
             late.from = list(samesex = 0),
             late.to = list(samesex = 1),
             propensity = morekids ~ samesex,
             m0 = ~ u,
             m1 = ~ u,
             solver = 'gurobi')
print(do.call(ivmte, args))

produces

Point estimate of the target parameter: -0.08484221

Notice that there's nothing special about linear here, it's really just the number of parameters:

args$m0 <- ~ I(u^2)
args$m1 <- args$m0
print(do.call(ivmte, args))

also produces

Point estimate of the target parameter: -0.08484221

As for the discrepancy, I still get Bounds on the target parameter: [-0.2758405, 0.1061561] for the initial bounds. @jkcshea did we have come up with any conclusion/explanation for this from the previous time we dealt with this issue? I thought @johnnybonney and I were matching before...

jkcshea commented 2 years ago

Hm, I cannot replicate either of your bounds. From #214, we learned that at least two things can lead to us having different bounds:

LAPACK/BLAS libraries
Number of threads being used by Gurobi.

When I matched those two things to the server on my machine, I was able to perfectly replicate the bounds from the server. Hopefully there aren't more sources for the discrepancies.

So I've switched back to the generic BLAS/LAPACK libraries:

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

And I'm limiting myself to 1 thread via the option solver.options = list(threads = 1). My bounds are [-0.2466363, 0.07675489]. What happens when you guys restrict the threads on your machines?

a-torgovitsky commented 2 years ago

With solver.options = list(threads = 1) I get bounds quite close to yours:

Bounds on the target parameter: [-0.2466971, 0.07695303]
Audit terminated successfully after 1 round

Does Gurobi say anything about threads and reproducibility in the manual or forums?

johnnybonney commented 2 years ago

Thanks for the detailed explanation, @a-torgovitsky. That makes complete sense. This might come up again later on when we discussing our new local LATEs results offline.

If I restrict to one thread, I get results that differ quite a bit from both of you:

Bounds on the target parameter: [-0.208006, 0.03833655]
Audit terminated successfully after 1 round

jkcshea commented 2 years ago

Does Gurobi say anything about threads and reproducibility in the manual or forums?

Here, Gurobi says that it is deterministic if all inputs are the same. Presumably, the number of threads must be an input since the results depends on the number of threads used. There is also a seed, but its default value is 0 and thus shouldn't be driving the discrepancies. There are scenarios where Gurobi is non-deterministic, but we are not in any of those scenarios.

Here, Gurobi says the solutions can depend on hardware, which appears to be what we're seeing. So to reconcile this with the paragraph above (as described by Gurobi support when answering a user's question about determinacy): Gurobi is designed to be deterministic within a machine. But its results can vary across machines.

Here, Gurobi says having more cores will speed up the barrier algorithm. This is what is used to solve the QP and QCQP problems. But it doesn't say anything about the determinacy of the solver.

@jkcshea did we have come up with any conclusion/explanation for this from the previous time we dealt with this issue? I thought @johnnybonney and I were matching before...

Ah, you matched on the Cholesky decomposition, which was related to the BLAS/LAPACK packages. But your bounds differed.

To do another check on this, I ran @johnnybonney's example on Acropolis. When the code is run on the head node, I get [-0.2469076, 0.07695205]. This is close to what I get on my machine, but not quite. However, when I run the code on the cluster, I get [-0.2466363, 0.07675489], which matches the bounds on my machine. So maybe my machine is similar to the processors in the cluster node? In case anyone is curious, here are the specs of Acropolis from 2019 (perhaps they've upgraded some hardware since then).

Below is a figure on how the bounds depend on the number of threads. The lower (upper) bounds do not monotonically decrease (increase) with the number of threads. But having that second thread/additional computational power seems to help. So I wonder, how old is your machine, @johnnybonney? Because your bounds with 1 thread are much tighter than @a-torgovitsky's and mine.

johnnybonney commented 2 years ago

I'm working on a 2020 MacBook Pro.

a-torgovitsky commented 2 years ago

Hi @johnnybonney I want to revise my answer to your previous question.

I think the reason you don't get a point for LATE is because ivmte is using the two-step estimator designed for partially identified cases. ivmte is able to automatically detect when the MTE function is point identified in the regression approach, and in such cases will automatically estimate it as if it is point identified. That's what happens in the linear case for example. However, it's not clear (at least to me) how one would go about checking that the target parameter is point identified. So in the quadratic case you gave, ivmte assumes the model is partially identified, and applies an estimator that builds in some slack (via criterion.tol) to account for potential partial identification, hence why you get bounds.

The regression that ivmte runs is this:

library("ivmte")
library("tidyverse")

df <- AE
df %>% select(worked, morekids, samesex) -> df
args <- list(data = df,
             outcome = 'worked',
             target = 'late',
             late.from = list(samesex = 0),
             late.to = list(samesex = 1),
             propensity = morekids ~ samesex,
             m0 = ~u + I(u^2),
             m1 = ~u + I(u^2),
             solver = 'gurobi')

# estimate using regression-based approach
late_reg <- do.call(ivmte, args)
print(late_reg)

B <- late_reg$X

df$p <- late_reg$propensity$phat

# Manually reproduce B
Bm <- matrix(data = NA, nrow = nrow(B), ncol = ncol(B))
Bm[,1] <- 1 - df$morekids
Bm[,2] <- (1/2)*(1 + df$p)*(1 - df$morekids)
Bm[,3] <- (1/3)*(1 - df$p^3)/(1 - df$p) * (1 - df$morekids)
Bm[,4] <- df$morekids
Bm[,5] <- df$morekids * df$p/2
Bm[,6] <- df$morekids * (df$p^2)/3

diff <- abs(B - Bm)
print(paste("Maximum difference is", max(diff), "."))

dfB <- as.data.frame(B)
dfB <- cbind(df$worked, dfB)
names(dfB) <- c("worked", "B1", "B2", "B3", "B4", "B5", "B6")

r <- lm(data = dfB, worked ~ 0 + B1 + B2 + B3 + B4 + B5 + B6)
fit <- predict(r)
print(unique(fit))

df %>%
    group_by(morekids, samesex) %>%
    summarize(ey = mean(worked)) %>% print

produces

Bounds on the target parameter: [-0.2758405, 0.1061561]
Audit terminated successfully after 1 round 

[1] "Maximum difference is 1.11022302462516e-16 ."
[1] 0.5807656 0.5837607 0.4418227 0.4376162
`summarise()` has grouped output by 'morekids'. You can override using the `.groups` argument.
# A tibble: 4 x 3
# Groups:   morekids [2]
  morekids samesex    ey
     <int>   <int> <dbl>
1        0       0 0.581
2        0       1 0.584
3        1       0 0.438
4        1       1 0.442

It's a saturated regression (6 parameters, 4 values of regressors) so it exactly reproduces $E[Y|D,Z]$, which implies that any exact minimizer will produce the same implied value of LATE. But again, in the second step, since ivmte is anticipating partial identification, it slacks the criterion a bit to allow in in-exact minimizers.

An unfortunate thing is that even setting criterion.tol to be small does not eliminate this slack:

args$criterion.tol <- 1e-15
print(do.call(ivmte, args))

yields

Bounds on the target parameter: [-0.1764386, 0.006084727]
Audit terminated successfully after 1 round

@jkcshea can you remind me:

Is criterion.tol entering multiplicatively or additively?
Is the "normal equation" approach still buried in the code somewhere?

a-torgovitsky commented 2 years ago

@jkcshea the discrepancy might be worth opening on issue on the Gurobi forums for. If we can save the .mps file and show them that the solution depends dramatically on the hardware used. Would be interesting to see what they say...although I'm guessing they will just reply that it is badly scaled.

jkcshea commented 2 years ago

I'm working on a 2020 MacBook Pro.

Hm, that's a powerful machine. Then my conjecture that your narrow bounds are linked to a slower processor/less memory is likely wrong.

Is criterion.tol entering multiplicatively or additively?

It enters multiplicatively. Let min.criterion denote the minimized criterion. Then the RHS of the quadratic constraint is min.criterion * (1 + criterion.tol). (But in practice it is implemented slightly differently.)

Is the "normal equation" approach still buried in the code somewhere?

Are you referring to this LP approach for the direct regression? regression-again.pdf If so, yep, it is there. We can call on it using the option direct = 'lp' (by default, we have direct = 'qp'). The bounds I get from this approach are [-0.08484222, -0.08484221].

@jkcshea the discrepancy might be worth opening on issue on the Gurobi forums for.

Sure, I'll do that. Good news: we're not actually poorly scaled in this example.

Also, @johnnybonney, could you upload your models when you set thread = 1? If you set debug = TRUE, ivmte will save the models as .rds files in your working directory. I wonder if you are also getting a different minimum criterion.

a-torgovitsky commented 2 years ago

It enters multiplicatively. Let min.criterion denote the minimized criterion. Then the RHS of the quadratic constraint is min.criterion * (1 + criterion.tol). (But in practice it is implemented slightly differently.) Are you referring to this LP approach for the direct regression? regression-again.pdf If so, yep, it is there. We can call on it using the option direct = 'lp' (by default, we have direct = 'qp'). The bounds I get from this approach are [-0.08484222, -0.08484221].

Ok, perfect, thanks. Here's what is going on:

The minimum least squares criterion is not zero because the variance of the residuals isn't zero. (We are minimizing $E[(Y - B\theta)^{2}]$ rather than $E[(E[Y|D,Z] - B'\theta)^{2}]$. The argmin set is the same for both, but the minimums are different.)
On the other hand, if we try to match the first-order conditions as moments (direct = "lp"), we can do so exactly because the model is saturated, and we get 0 as the minimum criterion.
With the least squares criterion if we blow it up by criterion.tol% we get something strictly larger than the minimum, which allows in some other functions that come close to minimizing the least squares criterion but yield different LATE values.
But with the first-order conditions we had 0, and 0*(1 + criterion.tol) is still 0. So anything in the second step there has to also exactly match the first-order conditions, and thus yield the same value of LATE (as well as anything else that depends only on $E[Y|D,Z]$).

So there's an advantage of the direct = "lp" approach that we hadn't appreciated @jkcshea . The disadvantage as we found was that it's generally less stable than the least squares criterion. @johnnybonney do you have any thoughts/comments on this from the practitioner side?

johnnybonney commented 2 years ago

@jkcshea here are my models with thread = 1: models-JB-thread1.zip

The minimum criterion I get is 0.2442724.

@a-torgovitsky thanks for the updated explanation (and walking through the manual example - I appreciated that).

Here's my anecdotal experience so far, in terms of LP vs. QP. (To be taken with a grain of salt, since I haven't run too much using the direct = "LP" approach recently.)

In cases where the LP criterion is nonzero (i.e., we are unable to fit the moments perfectly), there is generally no huge difference between the LP and QP bounds. The QP bounds are sometimes a bit wider (perhaps this is related to the stability aspect).
In cases where the LP criterion is zero, and we can fit the moments perfectly, the QP bounds seem noticeably wider than the LP bounds. Obviously this will be the case for parameters where we get point identification with LP, as discussed above, but also for other partially identified parameters.

I would think that instability is less of a problem when we can match all the moments perfectly (though you both probably know much better than I whether this is true). So, given the above discussion, my gut instinct (as a practitioner) would be to use LP if I can match the moments perfectly, and in all other cases use QP...

a-torgovitsky commented 2 years ago

Hi @johnnybonney , I think your instinct is right. We probably don't want to mix and match the approaches in the paper though. Maybe try results each way and based on that we can decide which to go with? I assume this is as simple as just looping over c("lp", "qp").

jkcshea commented 2 years ago

After looking at @johnnybonney's models, I don't think the large discrepancy between @johnnybonney's bounds and ours (@a-torgovitsky's and mine) is because Gurobi is very sensitive to hardware. Rather, @johnnybonney got a slightly different minimum criterion. That compounded into very different bounds.

When minimizing the criterion using the model generated by my machine, I get -0.2906495. (This number is negative because it isn't actually the SSR. But it is the objective value from the problem of minimizing the criterion, and it is what I use to define the quadratic constraint). When minimizing the criterion using @johnnybonney's model, I get -0.290659. The difference is only 9.531663e-06. But under @johnnybonney's minimized criterion, I get bounds very close to his.

Recall also that those numbers above have been normalized: they have been divided by the sample size. Gurobi's feasibility tolerance is 1e-06 by default, though. So it may be possible that violations of the quadratic constraint within Gurobi's default tolerance can lead to noticeably different bounds. In this particular example, it does not. But Gurobi's guidebook warn about cases like this, where the scale of the constraint is comparable to the tolerance parameters. Not sure if that means we have to rethink normalizing the criterion.

a-torgovitsky commented 2 years ago

What do you mean by this?

(This number is negative because it isn't actually the SSR. But it is the objective value from the problem of minimizing the criterion, and it is what I use to define the quadratic constraint).

Also, this?

But Gurobi's guidebook warn about cases like this, where the scale of the constraint is comparable to the tolerance parameters. Not sure if that means we have to rethink normalizing the criterion.

The magnitude of the constraint here is like 1e-1, no?

jkcshea commented 2 years ago

(This number is negative because it isn't actually the SSR. But it is the objective value from the problem of minimizing the criterion, and it is what I use to define the quadratic constraint).

The SSR can be written as

$(Y - X \hat\beta)' (Y - X \hat\beta) = Y'Y - 2Y' X \hat \beta + \hat\beta'X' X \hat\beta$

This is our objective function when minimizing the criterion. $Y'Y$ is just a constant, so I ignore that. We also decided to divide the criterion by sample size $N$ to improve scaling. So that negative number I was showing above is $(- 2Y' X \hat\beta + \hat\beta'X' X \hat \beta) / N$. This number gets divided by $\Vert Y \Vert$ when rescale = TRUE. That's what we are working with in Gurobi. But when ivmte reports the minimized criterion on the console, it is reporting $SSR / N$.

But Gurobi's guidebook warn about cases like this, where the scale of the constraint is comparable to the tolerance parameters. Not sure if that means we have to rethink normalizing the criterion.

Ah sorry, the magnitude of the constraint is indeed 1e-1.

My concern was that changing the RHS of the quadratic constraint by ~1e-5 had a big effect on the bounds. So I was concerned that the bounds could also be sensitive to violations within Gurobi's feasibility tolerance of 1e-6. In this particular case, it isn't that sensitive. But I wonder if that can be an issue when the RHS of the quadratic constraint is smaller.

a-torgovitsky commented 2 years ago

Got it on the SSR, that makes sense.

One way to get a sense of how sensitive the bounds are to the quadratic constraint would be to look at the dual value for that constraint. Looks like this is how you get it in Gurobi, although not sure if you can query it from the R interface: https://www.gurobi.com/documentation/9.5/refman/qcpi.html

But I'm not following what you're suggesting generally here...?