Soft constraints - Githubissues

a-torgovitsky commented 1 year ago

Hi @jkcshea , I have a new idea for solving all of our computational problems. Being ever-optimistic, I actually think this one is going to work.

Here's the idea attached in a note. It should be quite simple to code up. Can you try it when you get a chance?

soft-constraint.pdf

jkcshea commented 1 year ago

Of course! This looks very nice. The criterion.tol argument will become $\kappa$ in this setting then.

Should there be a default value for the tuning parameter when using the soft-constraint approach? E.g., criterion.tol = Inf by default whenever the soft-constraint is used. Or, we can play with test cases and decide later.

a-torgovitsky commented 1 year ago

I think we should probably play with test cases and decide later. Having criterion.tol = Inf may be undesirable since it essentially imposes the "hard-constraint" again with $\epsilon = 0$, and we know that's a source of computational issues.

jkcshea commented 1 year ago

soft-constraint-simulations.pdf

Apologies for the delay. Above are the simulation results. Soft constraints was a brilliant idea, the estimates are super stable.

There are two issues mentioned in the document.

Sometimes, Gurobi incorrectly thinks that the matrix defining the quadratic loss is not PSD. It then throws out an error. These are the only instances when we don't get optimal bounds. That is, conditional on getting bounds (in the simulation), they are always optimal.
There is one test case where the bounds don't get tighter when I increase the penalty on the soft constraint. This seems strange to me... I'm checking to see if the solutions to optimization problem should be monotonic with respect to the tuning parameter. But maybe you know the answer already.

a-torgovitsky commented 1 year ago

Excellent, a couple of questions/comments:

LP refers to using the normal equations from the implied regression, whereas QP is the least squares criterion for that same regression. Is that correct?
LP Case 3 where the bounds don't shrink. From your graph it looks just like the average bounds don't shrink. I guess I would still find that peculiar. But what about for a given draw of the data? Keeping the draw fixed, do the bounds shrink as you increase criterion.tol?
I really like these graphs, but could you combine the QP/LP graphs for each case? Maybe put two bars for each value of criterion.tol? This would be visually helpful for me.
Can we also see criterion.tol = 1000,5000, and 10,000?
When the QP fails due to non-PSD, does Gurobi still try to solve? Or it just doesn't return anything?

jkcshea commented 1 year ago

Here's the updated version:

soft-constraint-simulations.pdf

1. LP refers to using the normal equations from the implied regression, whereas QP is the least squares criterion for that same regression. Is that correct?

Yep.

2. LP Case 3 where the bounds don't shrink. From your graph it looks just like the average bounds don't shrink. I guess I would still find that peculiar. But what about for a given draw of the data? Keeping the draw fixed, do the bounds shrink as you increase criterion.tol?

The width of the bounds are usually shrinking, but not always. For example, here's a particular draw where the bounds are neither shrinking nor nested as we increase criterion.tol. (Lower bound decreases when changing criterion.tol from 5 to 10; Bounds get wider when changing criterion.tol from 10 to 50)

## Test case 3, seed = 225437

    criterion.tol          lb        ub     width
 1:           0.1 -0.08805511 0.3028813 0.3909364
 2:           0.5 -0.08483842 0.3028813 0.3877197
 3:             1 -0.08483842 0.3028813 0.3877197
 4:             5 -0.02488354 0.3502489 0.3751324
 5:            10 -0.07627293 0.2946383 0.3709112
 6:            50 -0.22104037 0.1704597 0.3915000
 7:           100 -0.20173943 0.1704597 0.3721991

I'm looking through the code but haven't found an error yet. I'll keep looking.

3. I really like these graphs, but could you combine the QP/LP graphs for each case? Maybe put two bars for each value of criterion.tol? This would be visually helpful for me.

Done!

4. Can we also see criterion.tol = 1000,5000, and 10,000?

Done!

5. When the QP fails due to non-PSD, does Gurobi still try to solve? Or it just doesn't return anything?

Gurobi only returns an error message:

 Error 10020: Q matrix is not positive semi-definite (PSD). Set NonConvex parameter to 2 to solve model.

But when I pass the option NonConvex = 2, Gurobi still spits out the same error message. So we get no solution at all from Gurobi.

a-torgovitsky commented 1 year ago

Ok how about this line of debugging:

Focus on test case 3, seed = 225437, lower bound, criterion.tol 5 and 10.
Record at optimum
- Target parameter
- Q function so the first and second portions of the soft-constraint objective (without the $kappa$/criterion.tol term)

It should be the case that

Q at criterion.tol = 10 is smaller than Q at criterion.tol = 5
Target parameter at criterion.tol = 10 is larger than target parameter at criterion.tol = 5

We are not seeing the second one, but are we seeing the first one? Unless I have something wrong, I think this must provide a contradiction with both of being optimizers...

For the non-PSD part... In the first step, when we minimize the criterion, we are using the same quadratic form, right? So Gurobi is ok with that one, but not ok with the one in the second step? How could that be? Assuming that criterion.tol = 1 (a case where we still have problems in Case 4), all that changes is the linear terms in the quadratic objective, right?

jkcshea commented 1 year ago

Thanks @a-torgovitsky, that helped me figure out what I did wrong.

Shrinking bounds

There was a bug in the LP optimization. I incorrectly constructed the treatment effect. This has now been corrected. The bounds now become nested as we increase criterion.tol. The $Q$ also shrinks as criterion.tol increases. Here are the updated plots. soft-constraint-simulations.pdf

Question: The criterion for the max. and min. problems will differ under this soft constraint. Do we want to report them in the output?

non-PSD error

This error pops up because the scaled quadratic matrix violates some kind of tolerance when Gurobi checks for PSD-ness. For instance, for a given draw of data, if I set criterion.tol = 0.0001, we get bounds (albeit very wide). But if I set criterion.tol = 100, then we get the PSD error.

So why doesn't this error pop up when minimizing the criterion? This is because we scaled the criterion by $1/N$, where $N$ is the sample size (see Issue #208 about this). But when estimating the bounds, I undo that scaling. That is, I first multiply the quadratic constraint by $N$, and then I multiply it again by criterion.tol. If I don't do this, we need criterion.tol to be very large (e.g., 20,000) before we see the bounds shrink. Let me know if this should be undone.

But the problem has to do with the scaling of the constraint matrices. I'm happy to try rescaling variables again to improve scaling of the optimization problem. This is a new approach, perhaps it will work better than before.

a-torgovitsky commented 1 year ago

Great, thanks.

1.

What do you mean by this?

Question: The criterion for the max. and min. problems will differ under this soft constraint. Do we want to report them in the output?

Is it just that the lower bound problem is target parameter + penalty and the upper bound problem is target parameter - penalty ?

2.

I guess I'm a bit concerned about the interpretation of the size of criterion.tol here. So let's clear that up before looking at the PSD errors.

Scaling the quadratic criterion by $1/N$ meant that it's $$\frac{1}{N}\sum{i=1}^{N}(Y{i} - B_{i}'\theta)^{2}$$ So that should mean that it's not diverging with $N$. That seems kind of important.

What are we doing for the LP case? Dividing the moments by $N$ or no?

Can you remind me what the sample sizes are for each of the cases? In particular, Case 4 looks a bit concerning with only 85% QP solves even for very small criterion.tol.

jkcshea commented 1 year ago

1.

What do you mean by this? Is it just that the lower bound problem is target parameter + penalty and the upper bound problem is target parameter - penalty

Yeah, the penalty terms will be different for the max and min problems. I believe this implies the implicit "feasible sets" for the max and min problem are also different. Perhaps that's not a big deal, since the two penalties converge as criterion.tol gets very large.

The reason why I asked is because ivmte usually spits out the minimum criterion. But the criterions under the max and min problem can now differ.

2.

What are we doing for the LP case? Dividing the moments by $N$ or no?

No dividing by $N$ in the LP case.

Can you remind me what the sample sizes are for each of the cases? In particular, Case 4 looks a bit concerning with only 85% QP solves even for very small criterion.tol.

$N = 10,000$ in all examples. In case 4, the quadratic matrix has a very wide range of coefficients (e.g., from -11 to 5). The other cases do not have such a wide range.

I should mention that other cases do fail Gurobi's PSD check. But Gurobi is able to make suitable adjustments, e.g., these Gurobi warnings are from Case 2, seed = 219705, criterion.tol = 10, where we still get optimal bounds:

Warning: diagonal adjustment of 1.4e-09 needed to make Q PSD
Warning: diagonal adjustment of 1.4e-09 performed to make Q PSD
Warning: diagonal adjustment of 1.4e-09 performed to make Q PSD

Also, when doing the soft constraint, is it necessary to minimize the criterion? The minimum criterion simply enters the objective as a constant.

a-torgovitsky commented 1 year ago

I believe that 1) and 3) reflect the same issue.

You are right that as long as we don't set criterion.tol to be +Inf, then it is not necessary to minimize the criterion in the first step. I was still thinking that we were minimizing it in a first step...in which case the min.criterion to report is the same as it was before in the "hard constraint" approach, and the same for both upper and lower bound. However, I think it could be nice to preserve the +Inf possibility, for example to cover the case in which min.criterion = 0, and we want to set the criterion equal to zero exactly. (Note that this reduces to the previous "hard constraint" with criterion.tol = 0.) I suppose this can't happen for the least squares criterion (QP), but it could for the first order condition criterion (LP).

For 2), I think we should preserve the N^{-1} scaling for both the LP/QP cases, since this ensures that the criterion does not diverge as N grows. The asymptotics here are going to be based on sending min.criterion off to infinity, so that no deviation from minimum criterion is allowed asymptotically. Right now you are basically just setting the penalty to be $N \times \text{min.criterion}$ silently, which might be confusing to the user I suppose. I guess this means we need to have different x-axes for the two results, but I think we are doing that implicitly anyway now, since there's no reason the scales should be comparable across the LP/QP approaches.

The reason this is important is because we want to see whether this computational stability arises for "reasonable" values of criterion.tol---i.e. values after which the estimates have already stabilized. Perhaps we are just looking at extreme values that are beyond what anyone would care about. Case 3 suggests we are, while Case 4 suggests maybe not.

Fourth issue/question. Now that we are not doing constrained optimization, we could also change the normal equations (the LP approach) from an absolute value (1-norm) loss to a quadratic loss. That is, we look at the Euclidean distance $$\Vert \frac{1}{n}\sum{i=1}^{n} Y{i}B{i} - B{i}B_{i}'\theta \Vert^{2}$$ We weren't doing this before because the attraction of the normal equations were that we didn't need to do a QCQP. But now that attraction is gone. Perhaps it's worth trying? The other norm we could try is the max-norm, which could also be reformulated as a linear program.

One advantage of the normal equation based criterion is that we know that min.criterion = 0 is the smallest possible value (all moments can be met). With the least squares based criterion, the smallest possible value is something like the variance of the residual, which is going to depend on the context, and not be 0 except for trivial cases.

jkcshea commented 1 year ago

Sorry for the slow reply.

Sure, we can keep the criterion minimization problem, it is almost always solved immediately. I'll continue reporting the minimum criterion as with the hard constraint.
Sure, I'll remove the scaling by $N$ I had implemented earlier.
Yes, more than happy to incorporate the $\ell2$ and $\ell\infty$ norms. I'll update the plots to include the bounds we get from the new norms.

johnnybonney commented 1 year ago

Is there any update here---both on the method's performance and when we might be able to start using it? (@jkcshea)

jkcshea commented 1 year ago

Hi everyone,

I'm very sorry for falling behind. I've implemented the $\ell2$ and $\ell\infty$ norms and got a simulation started. I should have the results by tomorrow. I still need to clean up various parts of the code (e.g., update the bootstrapping) before merging it into the main branch.

@a-torgovitsky This soft constraint is written on a separate branch without all the rescaling/decomposition methods. None of those methods seemed to work. So should we scrap them?

a-torgovitsky commented 1 year ago

By not work you mean they didn't help the soft constraint (which is already pretty stable), right? If so, then yes I think we can scrap them and keep it simple.

jkcshea commented 1 year ago

Ah, by "not work" I meant the decomposition/rescaling methods didn't help with the hard constraints. I haven't tried combining the decomposition/rescaling methods with the soft constraints. Would you like me to?

a-torgovitsky commented 1 year ago

No, I don't think it's necessary to try them.

And I think we can just revert the hard constraint approach back to whatever was most stable. (But I don't think we want to fully get rid of it yet, so that @johnnybonney can compare across the two approaches.)

jkcshea commented 1 year ago

Okay sounds good. The decomposition and rescaling approaches will be on a separate branch.

jkcshea commented 1 year ago

Here are the updated simulations comparing the bounds for the four criteria we consider: $\ell_1$, $\ell2$, $\ell\infty$, and least squares.

soft-constraint-simulations.pdf

When deriving the bounds:

If the problem is an LP problem ( $\ell1$ or $\ell\infty$ criterion), then the bounds are always optimal.
If the problem is a QP problem ( $\ell_2$ or least squares criterion), then we might get that strange PSD error for large values of criterion.tol.
The bounds under the new $\ell2$ and $\ell\infty$ criteria are much wider compared to the original $\ell_1$ and least squares criteria.

a-torgovitsky commented 1 year ago

Thanks @jkcshea , this is great.

@johnnybonney , can you start playing with this in the empirical applications? Some experimentation will likely still be needed, but together with these simulation results I think we now have what we need. From the theoretical perspective I don't think there's any obvious and particularly compelling reason to choose any one of these over the others. Actually, the least squares criterion might be less attractive just because its logical minimum is not 0 whereas the normal equation criteria should be 0 if the data can be matched exactly.

Regarding the width of the bounds, note that the scale of criterion.tol is not comparable across cases since the criterion function has different a different scale for each case. For $\ell{1}$, it's the sum of the deviations. For $\ell{\infty}$, it's only the largest deviation. For $\ell_{2}$ the units are changed by squaring. So comparing criterion.tol = 10 across these different norms is not easy to interpret.

johnnybonney commented 1 year ago

Yes, I am eager to start experimenting---thanks for all of this, @jkcshea. Will these be merged onto the main branch?

jkcshea commented 1 year ago

Regarding the width of the bounds, note that the scale of criterion.tol is not comparable across cases since the criterion... So comparing criterion.tol = 10 across these different norms is not easy to interpret.

Ah yes, good point.

Will these be merged onto the main branch?

Yes, I hope to do that soon---thanks for your patience! The test code is currently available on this branch: enhance/soft-constraint-updates I will clean it up this weekend. But something that I need to check/think about is whether the bootstrap procedure works with soft constraints. If that requires more thought, you will at least be able to estimate the bounds without inference.

johnnybonney commented 1 year ago

OK, no worries at all---if it is a pain to merge at this point and you would prefer to wait, I have no problem pulling the test branch and working with that version.

jkcshea commented 1 year ago

Almost there. Once #230 is fixed, the update is good to go.

jkcshea commented 1 year ago

(Sorry for an earlier accidental post)

Okay, the master branch is good to go!

When using the conditional moment/regression approach, ivmte now allows for four different criteria:

$\ell_1:~ \widehat{Q}(\theta) = \left \Vert \frac{1}{n} \sum_i (Y_i B_i - B_i B_i' \theta) \right \Vert_1$, set direct = "l1"
$\ell_2:~ \widehat{Q}(\theta) = \left \Vert \frac{1}{n} \sum_i (Y_i B_i - B_i B_i' \theta) \right \Vert_2$ , set direct = "l2"
$\ell_\infty:~ \widehat{Q}(\theta) = \left \Vert \frac{1}{n} \sum_i (Y_i B_i - B_i Bi' \theta) \right \Vert\infty$, set direct = "linf"
Least squares: $\widehat{Q}(\theta) = \frac{1}{n}\sum_i (Y_i - \theta' B_i)^2$ direct = "ls"

You can use "hard" or soft constraints for each criterion by setting soft = TRUE or soft = FALSE.

The tuning parameter for the constraints is still criterion.tol. So if you want to enforce your constraints more tightly when soft = FALSE, then choose a small value of criterion.tol. But if you want to enforce your constraints more tightly when soft = TRUE, then choose a large value of criterion.tol. Note from the simulations you may have to choose very large values of criterion.tol when soft = TRUE in order to get tight bounds.

a-torgovitsky commented 1 year ago

Thanks @jkcshea Minor correction/clarification: 2 is squared norm, right?

jkcshea commented 1 year ago

Ah yes, that's right. Would you like me to rename the norm l2?

a-torgovitsky commented 1 year ago

No no, just wanted to clarify that this is indeed what we are doing. Thanks!

johnnybonney commented 1 year ago

Great, thanks @jkcshea! I've been experimenting and have a couple of questions/comments.

Comment 1: Audit grid and hard/soft equivalence

In theory, it should be the case that as $\kappa \rightarrow \infty$, the soft constraints approach is effectively the same as the hard constraints approach. In practice, then, I would expect setting criterion.tol very high (with soft = T) to deliver the same result as soft = F.

Reassuringly, this is true in many cases I have tested. However, in some flexible specifications, whether the two approaches are equivalent depends on the size of the audit grid---with larger audit grids leading to meaningful differences. Here is an example using these data: data.csv (I tried---and failed---to replicate the issue using ivmteSimData)

library(data.table)
library(ivmte)

dt <- fread("data.csv")

args <- list(
  data = dt,
  outcome = "y",
  target = "ate",
  propensity = d ~ factor(z)*factor(x),
  m1 = ~factor(x) * uSplines(degree = 3, knots = c(1/3, 2/3)),
  m0 = ~1, m0.ub = 0, m0.lb = 0,
  initgrid.nx = 6, audit.nx = 6,
  direct = "l1",
  solver = "gurobi"
)

# Start with an audit grid with 100 u-points, and very high kappa
est_hard <- do.call(ivmte, args = c(args, list(audit.nu = 100)))
est_soft <- do.call(ivmte, args = c(args, list(audit.nu = 100,
                                                soft = T,
                                                criterion.tol = 1e6)))

# Now repeat, but increasing audit grid to 1000 u-points
est_hard_v2 <- do.call(ivmte, args = c(args, list(audit.nu = 1000)))
est_soft_v2 <- do.call(ivmte, args = c(args, list(audit.nu = 1000,
                                                   soft = T,
                                                   criterion.tol = 1e6)))

print_fn <- function(audit.nu, soft, bounds) {
  if (soft) {
   message(sprintf("audit.nu = %s, soft = T, criterion.tol = 1e6: [%.3f, %.3f]",
                   audit.nu, bounds[1], bounds[2])) 
  } else {
    message(sprintf("audit.nu = %s, soft = F: [%.3f, %.3f]",
                    audit.nu, bounds[1], bounds[2])) 
  }
}

message("Cubic spline MTR:")
print_fn(audit.nu = 100, soft = F, est_hard$bounds)
print_fn(audit.nu = 100, soft = T, est_soft$bounds)
print_fn(audit.nu = 1000, soft = F, est_hard_v2$bounds)
print_fn(audit.nu = 1000, soft = T, est_soft_v2$bounds)

Cubic spline MTR:
audit.nu = 100, soft = F: [0.595, 0.679]
audit.nu = 100, soft = T, criterion.tol = 1e6: [0.596, 0.675]
audit.nu = 1000, soft = F: [0.595, 0.679]
audit.nu = 1000, soft = T, criterion.tol = 1e6: [0.594, 0.594]

As you can see, when audit.nu = 100, cranking up criterion.tol to something like 1e6 is sufficient for soft = T to closely approximate the soft = F bounds. But if I increase audit.nu to 1000, soft = T yields point identification (while soft = F bounds remain stable).

Is this a bug? Or can this be explained theoretically? (@a-torgovitsky)

Some notes:

If I replace direct = linf, the bounds become closer to equivalent: [0.457, 0.710] for hard and [0.461, 0.726] for soft.
If I replace m1 with a constant spline (e.g. ~factor(x) * uSplines(degree = 0, knots = seq(0.1, 0.9, 0.1))), then the above issues disappear---setting criterion.tol high always leads soft = T to closely replicate the soft = F bounds.

The temporary workaround is just to use an audit grid for which setting criterion.tol very high yields equivalence to the soft = F bounds (so that I can rest easy when using lower values of criterion.tol), but this probably isn't the optimal solution.

Comment 2: MOSEK

(This is not urgent, since I prefer to use Gurobi anyway, but I imagine it is something that should be fixed.)

I can never get the soft constraints approach to work with MOSEK, even for relatively simple specifications. For example:

args <- list(
  data = ivmteSimData,
  outcome = "y",
  target = "ate",
  propensity = d ~ factor(z),
  m1 = ~u + I(u^2) + I(u^3) + I(u^4),
  m0 = ~u + I(u^2) + I(u^3) + I(u^4),
  direct = "l1",
  soft = T,
  criterion.tol = 1000,
  solver = "rmosek"
)

mosek_est <- do.call(ivmte, args)

Obtaining propensity scores...

Generating target moments...
    Integrating terms for control group...
    Integrating terms for treated group...

Performing direct MTR regression...
    MTR is not point identified.

Performing audit procedure...
    Solver: MOSEK ('Rmosek')
    Criterion constraint: Soft
    Generating initial constraint grid...

    Audit count: 1
    Minimum criterion: 1.320817e-10
    Obtaining bounds...
    Model was unbounded.

    [...]

    Restarting audit with new settings:
    initgrid.nx =  68 
    initgrid.nu =  25 
    Solver: MOSEK ('Rmosek')
    Criterion constraint: Soft
    Generating initial constraint grid...

    Audit count: 1
    Minimum criterion: 2.793992e-10
    Obtaining bounds...
Bounds on the target parameter: [2.793992e-10, 2]

Error in if (audit$status.codes[1] == 6) { : argument is of length zero
In addition: Warning message:
Arguments 'initgrid.x' and 'audit.x' are ignored because MTR specifications do not include any covariates.

I get this kind of output every time with soft = T with MOSEK---regardless of my choice of criterion.tol and initial/audit grid size. (A side note is that the output is off---the output makes it appear that bounds were obtained, when they were not; there is the weird error; and there is a warning about initgrid.nx and audit.nx, which makes sense but could be confusing since I didn't set those parameters.)

a-torgovitsky commented 1 year ago

Hi @jkcshea can you confirm what the audit procedure does for the soft case? My guess would be it's just:

Solve the soft-penalized problem with initial grid.
Check optimal solution on the audit grid.
Add points at which audit fails to the initial grid.
Solve the soft-penalized problem with new grid.
Repeat until passing the audit.

Is that right? Just want to confirm before I start putting too much thought into this... Or is there perhaps an initial step of minimizing the criterion that is being repeated too?

jkcshea commented 1 year ago

@johnnybonney As always, thanks for rigorously testing the package. I will play with the package this weekend and see if I can find a bug. I will also try and resolve #228

@a-torgovitsky Yep, your guess is correct. That is the audit procedure for the soft case. All shape restrictions should still hold, since those are imposed using hard constraints. Only the criterion is subject to the soft constraint.

a-torgovitsky commented 1 year ago

Ok, and just to clarify, the minimum criterion is only computed once? Is that at the beginning? Or at the end, with the final audit grid?

jkcshea commented 1 year ago

Ah, sorry for the confusion. This step consists of two things.

Solve the soft-penalized problem with initial grid.

The first is minimizing the criterion. The second is estimating the bounds. This means the minimum criterion is calculated at the beginning of each iteration of the audit, just like with the hard constraint.

Although the minimum criterion doesn't affect the bounds when using the soft constraints, I thought we wanted to keep track of it. Maybe I misunderstood the first part of this post.

a-torgovitsky commented 1 year ago

No that's fine, I think it's just wasted computation right? Since the minimum criterion doesn't affect any of the intermediate steps of the audit, we could just compute it once at the end, couldn't we?

jkcshea commented 1 year ago

Yes, we certainly can. I'll update the package to do that for the soft constraint then.

a-torgovitsky commented 1 year ago

Ok thanks @jkcshea

@johnnybonney , regarding your first point, I'm not sure... One way to help debug: Suppose you take the solution to this (at either lb or ub): audit.nu = 100, soft = T, criterion.tol = 1e6: [0.596, 0.675] would it pass the audit when audit.nu = 1000? (@jkcshea just to double check, the audit grid with audit.nu = 1000 is going to contain the one with audit.nu = 100, right? We set the grid so that it is nested? Or did we not do that?) If so, then we have a bug. If not, then I'll think some more.

johnnybonney commented 1 year ago

@a-torgovitsky - no, the solution with audit.nu = 100, soft = T would not pass the audit when audit.nu = 1000. There are minor violations (on the order of 1e-4). If I recall correctly, the grids for $u$ are generated following a Halton sequence (rhalton()) so are all nested.

Something else I noticed when checking that--- If I run the following

args$m1 <- ~0 + factor(x) + factor(x):uSplines(degree = 3, knots = c(1/3, 2/3))
args$solver.options <- list(presolve = 0) # otherwise infeasibility error
# NB: turning off presolve does change the original hard bounds (they become
# wider). It doesn't change the original soft bounds or the substance of the issue.

and then re-estimate my original examples, I get

Cubic spline MTR:
audit.nu = 100, soft = F: [0.552, 0.626]
audit.nu = 100, soft = T, criterion.tol = 1e6: [0.596, 0.596]
audit.nu = 1000, soft = F: [0.552, 0.617]
audit.nu = 1000, soft = T, criterion.tol = 1e6: [0.561, 0.629]

That is, when I replace factor(x) * uSplines(degree = 3, knots = c(1/3, 2/3)) with its manually factored equivalent, the behavior reverses---increasing audit.nu brings the soft = T bounds away from point identification and closer to the soft = F bounds. I wonder if this is related in some sense to #214 (or #224).

a-torgovitsky commented 1 year ago

Hi @johnnybonney , thanks that's great debugging. This sounds like it's clearly a bug with uSplines then, doesn't it? @jkcshea could you look into this? It seems like there's still something wrong with uSplines, since replacing it with the same manually-coded specification produces something different.

@johnnybonney as an aside, why are you imposing presolve = 0? I know you say it doesn't change the example, but just wondering if you are finding this is necessary in some cases.

johnnybonney commented 1 year ago

as an aside, why are you imposing presolve = 0?

It is necessary in some cases when using the "hard constraints" approach, otherwise I get the following error:

Error:  The minimization problem was proven to be infeasible or unbounded. The maximization problem was proven to be infeasible or unbounded. Since a minimum criterion was found, the model is most likely feasible but unbounded. This can happen if the initial grid is too small. Try increasing the parameters 'initgrid.nx' and 'initgrid.nu'. If the model is indeed infeasible, consider exporting the model and passing it to the solver for more details.

In the past, we've concluded that this is due to numerical issues related to Gurobi's feasibility tolerance, and one way to try and address it is by turning off presolve (a solution suggested by Gurobi when feasibility issues arise).

The great news here is that I haven't (yet) been seeing these kind of issues with the soft approach, even for high values of criterion.tol.

a-torgovitsky commented 1 year ago

Ah good, yes I was thinking you were needing it for the soft approach, so was worried about that. Ok!

jkcshea commented 1 year ago

@jkcshea could you look into this? It seems like there's still something wrong with uSplines, since replacing it with the same manually-coded specification produces something different.

Sorry for the radio silence. Yep, I'll look into this, too.

jkcshea commented 1 year ago

Hi everyone,

To catch up on a couple things:

When soft = TRUE, the criterion is now only minimized after the audit ends.
The initial grid for the u variable is indeed a subset of the grid used in the audit.
The audit grids are nested as we increase audit.nu, e.g., the u-grid when audit.nu = 100 is nested in the u-grid when audit.nu = 1000 (as @johnnybonney mentioned, this is because we generate the u-grid using Halton sequences)

I've been reviewing the code for uSplines and haven't found anything broken yet. The changes in the bounds @johnnybonney showed in this post seem to be because the two MTR specifications are not quite the same. In order for the two MTR specifications to match, the intercept argument in uSplines must be set to TRUE. But by default, it is set to FALSE (which was decided in #214, here).

Nevertheless, the bounds aren't identical even when the two MTR specifications are equivalent. Here are the bounds using the original MTR specification (with intercept = TRUE).

audit.nu = 100, soft = F: [0.529, 0.678]
audit.nu = 100, soft = T, criterion.tol = 1e+06: [0.594, 0.594]
audit.nu = 1000, soft = F: [0.539, 0.645]
audit.nu = 1000, soft = T, criterion.tol = 1e+06: [0.595, 0.611]

(I still don't have an answer for why the bounds widen when soft = TRUE and audit.nu increases from 100 to 1000. I'll continue looking into this.) And here are the bounds using the second specification @johnnybonney tried (again, with intercept = TRUE).

audit.nu = 100, soft = F: [0.552, 0.627]
audit.nu = 100, soft = T, criterion.tol = 1e+06: [0.595, 0.595]
audit.nu = 1000, soft = F: [0.552, 0.637]
audit.nu = 1000, soft = T, criterion.tol = 1e+06: [0.594, 0.594]

These bounds are similar to the ones shown above, but not the same. I haven't figured out why this is yet, and will continue investigating/checking for errors.

johnnybonney commented 1 year ago

Thanks for investigating @jkcshea. One clarifying question---you say that these two MTR specifications are not the same when intercept = F:

~factor(x) * uSplines(degree = 3, knots = c(1/3, 2/3))

~0 + factor(x) + factor(x):uSplines(degree = 3, knots = c(1/3, 2/3))

Why are these not the same? (There may be something about splines that I don't understand here.)

My thought process was, if I have a continuous variable v, it will be the case that ~factor(x) * v and ~0 + factor(x) + factor(x):v are equivalent (i.e., if we use them to predict $Y$, they will both produce equivalent values of $\hat{Y}$). The same holds for a cubic polynomial in v---but not for a cubic spline?

jkcshea commented 1 year ago

Ah, I am wrong! I had forgotten that the * operator includes the interactions and the individual terms. E.g., a*b is a + b + a:b Sorry for the confusion.

To correct my mistake: It doesn't matter whether intercept is set to TRUE or FALSE in the uSplines function. Here's why the two MTR specifications weren't actually matching.

The original specification was factor(x)*uSplines(...).
The alternative specification was factor(x) + factor(x):uSplines(...).
But the alternative specification is missing a non-interacted uSplines(...).

Once I included that additional uSplines term, the bounds are almost identical across the two specifications. Here are the bounds using the original MTR specification (intercept = FALSE, as in the original example, but presolve = 0).

audit.nu = 100, soft = F: [0.584, 0.707]
audit.nu = 100, soft = T, criterion.tol = 1e+06: [0.596, 0.675]
audit.nu = 1000, soft = F: [0.584, 0.708]
audit.nu = 1000, soft = T, criterion.tol = 1e+06: [0.594, 0.594]

And here are the bounds using the alternative MTR specification.

audit.nu = 100, soft = F: [0.589, 0.705]
audit.nu = 100, soft = T, criterion.tol = 1e+06: [0.596, 0.675]
audit.nu = 1000, soft = F: [0.588, 0.705]
audit.nu = 1000, soft = T, criterion.tol = 1e+06: [0.594, 0.594]

I'm still looking into why increasing audit.nu doesn't give us point identification when soft = FALSE. I thought it had to do with how the default criterion.tol is 1e-4 when soft = FALSE, and the minimized criterion isn't 0. But here's what I get when I set criterion.tol = 0 when soft = FALSE.

audit.nu = 100, soft = F: [0.596, 0.677]
audit.nu = 1000, soft = F: [0.584, 0.707]

Two strange things.

The bounds under audit.nu = 100 are tighter than before after reducing criterion.tol to 0. This is expected. But the bounds under audit.nu = 100 are tighter than those under audit.nu = 1000. That is odd.
The bounds under audit.nu = 1000 became wider after reducing criterion.tol to 0. That is odd.

Gurobi reports that both sets of bounds are estimated optimally. I'll keep looking to figure out what's happening/if I did something wrong.

johnnybonney commented 1 year ago

Great, that makes more sense. There is still one thing that I'm unsure about.

You say that factor(x) + factor(x):uSplines(...) is missing a non-interacted uSplines(...). This is true, but it also includes a uSplines(...) term interacted with x = 1, while this interaction is excluded by factor(x)*uSplines(...). So both approaches have the same number of parameters (36, by looking at gstar.coef) and, as far as I can tell, should still be equivalent.

To be clear, this is exactly how I think ivmte should be treating the specifications (and it mirrors behavior from e.g. lm). But it remains unclear to me how the specifications differ.

jkcshea commented 1 year ago

@johnnybonney you're right... Your two specifications should indeed be the same. I'm still playing with this and haven't figure out why the two sets of bounds are different. Sorry for my confusion.

a-torgovitsky commented 1 year ago

Sorry, I've dropped the ball on this, a bit swamped with teaching this quarter. Is there something you are waiting for my input on?

jkcshea commented 1 year ago

No no, it's me being slow. @johnnybonney posted an example above where two equivalent MTRs were generating different bounds. So I've been looking into why this is happening. We suspected something was wrong with the uSplines function. But I haven't been able to find a bug yet.

Here's a summary of what I have learned.

Under the least-squares criterion, equivalent MTRs should generate the same minimized criteria and bounds. ivmte seems to be doing this.
Under the $\ell_1$ criterion, equivalent MTRs should generate the same bounds as long as the minimized criteria are 0. ivmte seems to be doing this.
But under the $\ell_1$ criterion, equivalent MTRs don't have to generate the same minimized criteria. If the minimized criteria are different, then the constraints when estimating the bounds are different. If the constraints are different, then the bounds can be different.

I believe @johnnybonney's example above is an instance of point 3. The bounds for the equivalent MTRs don't match when using the $\ell_1$ criterion (direct = "l1"). But the bounds match when using the least-squares criterion (direct = "ls").

Here is a more detailed note on this. equivalent-mtr.pdf

A note on uSplines:

The most complicated feature of uSplines is that it can interact with other variables. But I let R handle all of that for us. For example, suppose the user declares m0 = ~ factor(x):uSplines(...). To handle this, I replace uSplines with a temporary variable v so it's like the user declared m0 = ~ factor(x):v. I then let R generate the design matrix, which takes care of the interactions and collinearities. I then replace v in all the interactions with the actual spline.

jkcshea commented 1 year ago

@johnnybonney I believe the primary issues from your original post are resolved. That is, I'm no longer getting meaningful differences in the bounds when using soft or hard constraints. Also, the results are not changing with the size audit.nu, as before.

audit.nu = 100, soft = F: [0.595, 0.679]
audit.nu = 1000, soft = F: [0.595, 0.679]

audit.nu = 100, soft = T, criterion.tol = 1e6: [0.596, 0.675]
audit.nu = 1000, soft = T, criterion.tol = 1e6: [0.596, 0.675]

The bug wasn't with uSplines, but with the audit procedure. Here's what I did wrong. When using hard constraints, I need to add and remove the criterion constraint in each iteration of the audit. That is, in each iteration of the audit when soft = FALSE, ivtmte does the following:

Estimate minimum criterion
Estimate bounds subject to criterion consraint
If audit fails, then I remove the old criterion constraint.
Repeat steps 1--3 until we pass the audit

But when soft = TRUE, we don't estimate the criterion until we pass the audit. This means step 3 is unnecessary. My mistake was that I left step 3 in. As a result, i was incorrectly deleting a constraint. This has now been resolved.

a-torgovitsky commented 1 year ago

Hi @jkcshea thanks for your work on this.

If I understand correctly, the remaining issue here is what you describe (very clearly) in the note you posted: under the hard constraint, and using the l1 norm, "equivalent" MTR specifications can lead to different bounds. Is that correct?

Based on what you wrote, it seems like the soft constraint approach will not suffer from this problem. Do you think that's true? If so, that's one more point in favor of the soft constraint I suppose.

jkcshea commented 1 year ago

under the hard constraint, and using the l1 norm, "equivalent" MTR specifications can lead to different bounds. Is that correct?

Yep

Based on what you wrote, it seems like the soft constraint approach will not suffer from this problem.

Ah sorry for the confusion. Sadly, we can still get different bounds under “equivalent” MTR specifications when using soft constraints… @johnnybonney’s example is one such case. If my math was correct, the reason is that equivalent MTRs can generate different conditional moments to match on. If we can match all the conditional moments, then we will get the same bounds for the equivalent MTRs. But if we don’t match all the moments, we can get different bounds. Under the hard constraints, this happens because the feasible sets in the optimization problems for the equivalent MTRs can be different (when we don’t match all moments). Under the soft constraint, this happens because the objective functions in the optimization problems for the equivalent MTRs can be different (when we don’t match all moments).

Let me know if that doesn’t make sense. Perhaps I made another mistake somewhere.

From: Alexander Torgovitsky @.> Sent: Monday, January 16, 2023 9:52 AM To: jkcshea/ivmte @.> Cc: jkcshea @.>; Mention @.> Subject: Re: [jkcshea/ivmte] Soft constraints (Issue #229)

Hi @jkcsheahttps://urldefense.com/v3/__https://github.com/jkcshea__;!!DZ3fjg!_Q0399SVL_f3yUZHpYhiIoevNGXBMNw_ExdnYmA2_pUuOQiYFe6C0GU35WqNPoZVzoacczwneZC1ifaL0n0NGscIbJE$ thanks for your work on this.

If I understand correctly, the remaining issue here is what you describe (very clearly) in the note you posted: under the hard constraint, and using the l1 norm, "equivalent" MTR specifications can lead to different bounds. Is that correct?

Based on what you wrote, it seems like the soft constraint approach will not suffer from this problem. Do you think that's true? If so, that's one more point in favor of the soft constraint I suppose.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/jkcshea/ivmte/issues/229*issuecomment-1384241238__;Iw!!DZ3fjg!_Q0399SVL_f3yUZHpYhiIoevNGXBMNw_ExdnYmA2_pUuOQiYFe6C0GU35WqNPoZVzoacczwneZC1ifaL0n0NvhhM-co$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB7FBPVE5WHOWAW7XF3AKQ3WSVVDFANCNFSM6AAAAAARCLTRKM__;!!DZ3fjg!_Q0399SVL_f3yUZHpYhiIoevNGXBMNw_ExdnYmA2_pUuOQiYFe6C0GU35WqNPoZVzoacczwneZC1ifaL0n0N3HkBbkU$. You are receiving this because you were mentioned.Message ID: @.***>

jkcshea / ivmte

Soft constraints #229

Comment 1: Audit grid and hard/soft equivalence

Comment 2: MOSEK