Advice on dealing with numerical issues

cblandhol commented 2 years ago

@johnnybonney and I have been re-running the results for our Angrist and Evans application with the direct MTR approach and are running into the "numerical issues" error mentioned in issue #186. Do you have any advice on how to deal with this issue?

Below I have made a reproducible example which illustrates the issue with the same specification and similar number of covariate values we use in our application, but which uses the dataset provided in the package.

library(ivmte)
library(data.table)
library(gurobi)

data(AE)
data <- as.data.table(AE)

# rename variables
data[, instrument := samesex]
data[,  y := worked]
data[ , treatment:= morekids]

# covariate in our application takes values between 1 and 15
data[ , x := yob - 43]
x.max <- max(data$x)

# estimate propensity score
sat.pscore       <- as.formula(paste0("treatment ~ factor(instrument)*factor(x)"))

sat.pscore.res   <<- glm(formula = sat.pscore, 
                        data = data, 
                        family = "binomial")

data$pscore      <- predict(sat.pscore.res, type = "response")

# now, constant spline with knots at propensity scores for each value of x
degree = 0
x_knots  <- paste0(sort(unique(data[data$x == 1]$pscore)), collapse = ",")
u_part   <- paste0("~ factor(x) + as.integer(x == ",1, "):uSplines(knots =c(", x_knots,"), degree = ", degree, ")")

for (i in 2:x.max) {
    x_knots <- paste0(sort(unique(data[data$x == i]$pscore)), collapse = ",")
    u_part   <- paste0(u_part," + ", "as.integer(x == ",i, "):uSplines(knots = c(", x_knots,"), degree = ", degree, ")")
}
mtr.spec <- u_part
mtr.spec <- as.formula(mtr.spec)

# collect arguments
args <- list(
    data   = data,
    outcome = "y",
    m1 = mtr.spec,
    m0 = mtr.spec,
    propensity = treatment ~ factor(instrument)*factor(x),
    noisy = T,
    initgrid.nx = 15,
    audit.nx = 15,
    audit.max = 10,
    audit.nu = 40000,
    initgrid.nu = 10000,
    solver = "gurobi",
    target = "ate"
)

# Direct MTR approach
args$criterion.tol <- 0
res  <- try(do.call(ivmte, args))

Solver: Gurobi ('gurobi')

Obtaining propensity scores...

Generating target moments...
    Integrating terms for control group...
    Integrating terms for treated group...

Performing direct MTR regression...
    MTR is not point identified.

Performing audit procedure...
    Generating initial constraint grid...

    Audit count: 1
    Minimum criterion: 0.2420845
Warning:  The solver was unable to satisfy the optimality tolerance when minimizing the criterion, so a suboptimal solution is returned. Tolerance parameters for the solver can be passed through the argument 'solver.options'. 
    Obtaining bounds...
    Model resulted in numerical issues.

Setting presolve = FALSE results in an uninformative error message:

args$solver.presolve <- FALSE
res  <- try(do.call(ivmte, args))

Solver: Gurobi ('gurobi')

Obtaining propensity scores...

Generating target moments...
    Integrating terms for control group...
    Integrating terms for treated group...

Performing direct MTR regression...
    MTR is not point identified.

Performing audit procedure...
    Generating initial constraint grid...

    Audit count: 1
    Minimum criterion: 0.2421279
    Obtaining bounds...
    Model was unbounded.

    Restarting audit with new settings:
    initgrid.nx =  15 
    initgrid.nu =  15000 
    Generating initial constraint grid...

    Audit count: 1
    Minimum criterion: 0.2421279
    Obtaining bounds...
    Model was unbounded.

    Restarting audit with new settings:
    initgrid.nx =  15 
    initgrid.nu =  22500 
    Generating initial constraint grid...

    Audit count: 1
    Minimum criterion: 0.2421279
    Obtaining bounds...
    Model was unbounded.

    Restarting audit with new settings:
    initgrid.nx =  15 
    initgrid.nu =  33750 
    Generating initial constraint grid...

    Audit count: 1
    Minimum criterion: 0.2421279
    Obtaining bounds...
Error in abs(x) : non-numeric argument to mathematical function

The MST estimator does not return the same issue:

# MST estimator
ivlike      <- c(y ~ treatment*factor(instrument)*factor(x))
components  <- l()
args$ivlike     <- ivlike
args$components <- components
res  <- try(do.call(ivmte, args))

Obtaining propensity scores...

Generating target moments...
    Integrating terms for control group...
    Integrating terms for treated group...

Generating IV-like moments...
    Moment 1...
    Moment 2...
    Moment 3...
    Moment 4...
    Moment 5...
    Moment 6...
    Moment 7...
    Moment 8...
    Moment 9...
    Moment 10...
    Moment 11...
    Moment 12...
    Moment 13...
    Moment 14...
    Moment 15...
    Moment 16...
    Moment 17...
    Moment 18...
    Moment 19...
    Moment 20...
    Moment 21...
    Moment 22...
    Moment 23...
    Moment 24...
    Moment 25...
    Moment 26...
    Moment 27...
    Moment 28...
    Moment 29...
    Moment 30...
    Moment 31...
    Moment 32...
    Moment 33...
    Moment 34...
    Moment 35...
    Moment 36...
    Moment 37...
    Moment 38...
    Moment 39...
    Moment 40...
    Moment 41...
    Moment 42...
    Moment 43...
    Moment 44...
    Moment 45...
    Moment 46...
    Moment 47...
    Moment 48...
    Moment 49...
    Moment 50...
    Moment 51...
    Moment 52...
    Moment 53...
    Moment 54...
    Moment 55...
    Moment 56...
    Moment 57...
    Moment 58...
    Moment 59...
    Independent moments: 59 

Performing audit procedure...
    Generating initial constraint grid...

    Audit count: 1
    Minimum criterion: 1.2216
    Obtaining bounds...
    Violations: 0
    Audit finished.

Bounds on the target parameter: [-0.6235989, 0.3173004]

a-torgovitsky commented 2 years ago

@jkcshea is busy with time-critical job market stuff at the moment. But I have a lead on problems like these that I found in another project using a similar approach. So when he gets some breathing room maybe he could implement it and we could see if that helps. (Or you could try---it shouldn't be too hard, but the code is rather involved at this point so it might be hard to find the right spot.)

The approach I found worked well in another context was very simple. The direct approach has a second step QCQP problem, and the quadratic constraint there is the source of the problem. It is set up like ||Ax - b||^2 <= constant where A is a design matrix for the regression. Expand it out and it looks like x'A'Ax - 2x'A'b + b'2 <= constant. The problem is that A'A is dense, which is tough for the QCQP algorithms. The simple and surprising solution is to include some additional variables y, and constrain them like this: y = Ax Then change the quadratic constraint to ||y - b||^2 <= constant which expands to y' I y - 2y'b + b'b <= constant where now instead of A'A we have I, which is of course sparse. I was shocked to see that this works...but apparently Gurobi is not as smart as I thought, at least not for QCQPs.

Alternatively, you could try changing criterion.tol to something strictly positive. When it's zero the second step program is not convex (since it's like x^2 = c vs. x^2 <= c),

jkcshea commented 2 years ago

Hi everyone,

Sorry for taking so long to get to this, thank you for being patient with me.

I am currently carrying out some checks for the procedure described by @a-torgovitsky above. I'm still running into "numerical issues." Unfortunately, that is about all the information returned by Gurobi's status codes.

But the performance is really taking a hit under this alternate approach. This is because the number of new variables is equal to the number of observations. And the AE data set has 200,000+ observations.

@cblandhol and @johnnybonney: After cleaning up the code and completing some checks, I will push a separate branch with the new procedure so you can put it to the test. Feel free to also post any new examples I should look at.

@a-torgovitsky Let me know if you want to allow the user to toggle between this approach and the original approach.

a-torgovitsky commented 2 years ago

Ahh yes, I didn't consider that. (The other setting I was using this trick in didn't have the constraints scaling with n.)

Are you saying you get the "numerical issues" warning with the reformulated version?

How much of a hit does performance take? And how does the solve time (and accuracy) change with the sample size?

jkcshea commented 2 years ago

Okay, the code is now pushed to the branch bugfix/issue209.

git fetch origin bugfix/issue209

Are you saying you get the "numerical issues" warning with the reformulated version?

When I use large enough subset of the AE data---yes. If I use a subset (e.g,. 10,000 observations), then I won't run into numerical issues. But the solutions may not be optimal. Nevertheless, this reformulation does seem to work.

How much of a hit does performance take? And how does the solve time (and accuracy) change with the sample size?

Performance with 10,000 observations (with AE data):

Original formulation: 5 seconds, Gurobi returns suboptimal solutions for both max and min problems.
New formulation: 1 minute, Gurobi returns suboptimal solution only for min problem.

Performance with 20,000 observations (with AE data):

Original formulation: 6 seconds, suboptimal solutions for both max and min problems.
New formulation: 2 minutes, suboptimal solution only for max problem.

Performance with 50,000 observations (with AE data):

Original formulation: 12 seconds, suboptimal solution for both max and min problems.
New formulation: 3.5 minutes, optimal solutions.

Performance with 100,000 observations (with AE data):

Original formulation: 20 seconds, suboptimal solution only for min problem.
New formulation: 7 minutes, optimal solutions.

Regarding accuracy:

In the AE examples above, whenever both procedures returned optimal lower/upper bounds, the estimates were very similar (less than a percent different).
The same is true for simpler examples.

a-torgovitsky commented 2 years ago

Ok, thanks. That's pretty rough! When you say suboptimal, you mean the same message that @cblandhol was reporting? Is that on the first step problem, the second step problem, or both? Have you tried adjusting criterion.tol?

Also, do you have a sense of how sparse (or not sparse) the matrix A'A is here?

jkcshea commented 2 years ago

When you say suboptimal, you mean the same message that @cblandhol was reporting?

Ah yes sorry, there are indeed two messages. So here is the suboptimal warning message @cblandhol showed. This happened when minimizing the criterion.

Warning:  The solver was unable to satisfy the optimality 
tolerance when minimizing the criterion, so a suboptimal 
solution is returned. Tolerance parameters for the solver
can be passed through the argument 'solver.options'.

But since a criterion was estimated, the function keeps running and tries to find the bounds. It is here that we run into the error message about numerical issues.

Obtaining bounds...
Model resulted in numerical issues.

We can also get suboptimal bounds. In all the cases mentioned above using subsets of the AE data, I was referring to the bounds when discussing suboptimality. So here are the optimality statuses for the criterion:

Original formulation: criterion always optimal, unless I use the full AE data.
New formulation: criterion always optimal.

Have you tried adjusting criterion.tol?

Adjusting criterion.tol really improves things. But it still isn't enough to estimate the model using the full AE data.

10,000 obs. from AE data, criterion.tol = 0.01:

Original formulation: Optimal criterion, optimal bounds, 5 seconds.
New formulation: Optimal criterion, optimal bounds, 25 seconds.

20,000 obs. from AE data, criterion.tol = 0.01:

Original formulation: Optimal criterion, optimal bounds, 6 seconds.
New formulation: Optimal criterion, optimal bounds, 40 seconds.

50,000 obs. from AE data, criterion.tol = 0.01:

Original formulation: Optimal criterion, optimal bounds, 9 seconds.
New formulation: Optimal criterion, optimal bounds, 3 minutes.

100,000 obs. from AE data, criterion.tol = 0.01:

Original formulation: Optimal criterion, optimal bounds, 15 seconds.
New formulation: Optimal criterion, optimal bounds, 8 minutes.

Full AE data, criterion.tol = 0.01:

Original formulation: Suboptimal criterion, numerical issues for the bounds, 25 seconds.
New formulation: Optimal criterion, numerical issues for the bounds, 4 minutes.

Even when criterion.tol = 1 I still run into numerical issues.

Also, do you have a sense of how sparse (or not sparse) the matrix A'A is here?

Pretty sparse, 95% of the entries are 0. The MTR specification is made up splines of degree 0 interacted with indicator variables, e.g., (x == 2):uSplines(degree = 0, knots = c(...)) So that's where the sparsity comes from.

a-torgovitsky commented 2 years ago

Thanks that's very helpful. The sparsity to me suggests that we definitely want to keep the original formulation in. It probably has an advantage in this case.

What I find interesting is that the full AE data (300-400k observations if I recall?) is so intractable even while the 100k sample seems to be just fine. Can you try different 100k subsamples? It would be useful to know whether this is due to the size per se, or due to some particular feature of that 100k subsample you are trying.

jkcshea commented 2 years ago

You're right, it may just be certain samples.

I got optimal criteria/bounds up until 185,000. Afterwards, the criteria are no longer optimal.

For samples > 185,000, there are times we can estimate bounds, (e.g., at 201,000 I estimated bounds). And there are times where we can't (e.g.. 185,500 ran into numerical errors; 202,000 ran into numerical errors). But again, the criteria here are suboptimal---I suppose that is like having a much larger criterion.tol value.

a-torgovitsky commented 2 years ago

Is the scale of the criterion invariant to the sample size? (i.e. are you dividing by "n")?

jkcshea commented 2 years ago

Yep, the criterion is being divided by n and its scale is invariant to sample size.

a-torgovitsky commented 2 years ago

Is it always the case that suboptimal criteria => numerical issues for bounds and, is it always the case that optimal criteria => optimal bounds ?

jkcshea commented 2 years ago

Is it always the case that suboptimal criteria => numerical issues for bounds

Not always. If I use a sample of 190,000, and criterion.tol = 0, then I get a suboptimal criterion and suboptimal bounds. So no numerical issues for bounds.

There are also cases where we get a suboptimal criterion but optimal bounds. E.g., a sample of 200,000 and criterion.tol = 0.01 is an example of this.

and, is it always the case that optimal criteria => optimal bounds

Also not always. If I use a sample of 150,000 and criterion.tol = 0 then I get an optimal criterion but suboptimal upper bound.

a-torgovitsky commented 2 years ago

Ok, let me reverse the question:

Is it always the case that (numerical issues) or (suboptimal) for bounds => suboptimal criteria ?

Also, do you have a case where criterion.tol > 0 but you also have: optimal criterion and (numerical issues) or (suboptimal) for bounds

jkcshea commented 2 years ago

Is it always the case that (numerical issues) or (suboptimal) for bounds => suboptimal criteria

Not always: criterion.tol = 0 and n = 175000. Returns suboptimal bounds but optimal criterion. (The extra details are there in case I need to revisit these examples. The seed for subsampling the data in these examples, and all examples above, is 10L.)

Also, do you have a case where criterion.tol > 0 but you also have: optimal criterion and (numerical issues) or (suboptimal) for bounds

Yes: criterion.tol = 0.01 and n = 201500. Optimal criterion, but numerical issues for bounds.

a-torgovitsky commented 2 years ago

Ok, here's a slightly more involved exercise that I think would be informative.

Restrict ourselves to cases where we get an optimal criterion. Then keep increasing criterion.tol by an amount (say .01) until both the LB and UB solve cleanly. Do this for a certain number of subsamples of size n for different n. (I don't know how computationally intensive this is, but maybe we just start with 20 subsamples for each n?) Plot out the distribution of the criterion.tol needed to get clean solves for each value of n.

It may be that it makes sense to do something like this in the module, where we increase criterion.tol iteratively until solving cleanly. This exercise will be informative about whether that's a sensible solution. After we've figured this out, we can then go back to look at cases where optimality is not met in the criterion problem to begin with.

jkcshea commented 2 years ago

Here are the results from the exercise described above. For each sample size and value of criterion.tol, I resampled the AE data 200 times and estimated the bounds.

Here is the fraction of resamples with optimal criteria.

Also, here are the fraction of resamples with violations in the audit grid (red line), where audit.max = 30. The average number of points being violated (conditional on there being any violations; blue line) are diminishing with sample size. These violations are very small, on the order of 1e-05. They seem to arise because the QCQP is rescaled. So solutions that pass Gurobi's feasibility tolerance may not pass ivmte's feasibility tolerance. This was discussed somewhere in #198.

Here are the fraction of optimal bounds, conditional on optimal criteria. But I allow for cases where there are still violations in the audit grid, since these violations are few and small in magnitude. So for sample sizes <= 150,000, the fraction of optimal bounds is estimated using 170+ resamples (the ones excluded did not have optimal criteria). For sample sizes > 150,000, the fraction of optimal bounds is estimated using 120--150 resapmles. Note that the x-axis is not scaled linearly, since I was trying various magnitudes of criterion.tol. For sample sizes <= 150,000, over 90% of the bounds are optimal when criterion.tol = 0.02 or above. For larger samples, there doesn't seem to be much improvement in stability for higher levels of criterion.tol.

a-torgovitsky commented 2 years ago

Are you saying that even for the final graph you can't get N = 200,000 to go up to 100% by choosing criterion.tol to be sufficiently large? What if you take criterion.tol = +Inf for example?

jkcshea commented 2 years ago

I'm sorry, I've really not done a good job pushing criterion.tol to large values. Let me run really push criterion.tol up and see what happens.

A quick check suggest that criterion.tol = 5 can still lead to suboptimal bounds. Also, these bounds are very uninformative, i.e., approx. [-1, 1] when the outcome is binary.

Setting criterion.tol = 10 results in optimal bounds. But again, the bounds are uninformative.

a-torgovitsky commented 2 years ago

But are you incrementally increasing criterion.tol the way I described in the previous post?

jkcshea commented 2 years ago

Hm, I realized I am not---sorry for the confusion.

So just to be clear, here is what I should do:

Draw N observations from AE data.
Start with, e.g., criterion.tol = 0.
Estimate the bounds, see if we get optimal bounds. (to simplify, I assume here that criteria are always optimal)
If bounds are suboptimal, then increase criterion.tol, and return to previous step.
If bounds are optimal, save that value of criterion.tol.
Return to step 1.

Suppose I do the steps above many times. Then for sample size N, I have a distribution of values of criterion.tol where the bounds are optimal.

a-torgovitsky commented 2 years ago

Yeah that's the procedure I had intended. With this procedure all of the bounds should eventually be optimal for some criterion.tol So I guess we also want to have a sense of what those optimal bounds are too.

This is probably much more computationally demanding than what you already did.... So you could just lower the number of simulations I think.

jkcshea commented 2 years ago

Ah okay. Apologies for the confusion and thanks for clarifying. I'll get this thing running.

jkcshea commented 2 years ago

Sorry this took longer than expected. The results look quite different from before. This is because I was previously sampling without replacement from the AE data set when generating the random data. This was a mistake---I am now sampling with replacement.

Here is a smoothed distribution of criterion.tol where both lower and upper bounds are estimated optimally. For each sample size, I resampled the AE data 400 times and estimated the bounds. I exclude cases where the criteria are suboptimal, leaving us with 310--350 estimates for each sample size.

But increasing criterion.tol does not always lead to optimal bounds. E.g., I get optimal bounds when criterion.tol = 0.0004, sample size is 200,000, and the seed is 21. However, I get suboptimal bounds if I set criterion.tol = 0.0005.

a-torgovitsky commented 2 years ago

Ok, this is good. Let's zoom in on that example you mention (n = 200,000 and seed 21). Can you plot whether bounds are suboptimal as a function of criterion.tol?

jkcshea commented 2 years ago

Sure, here it is.

The bounds were optimal when criterion.tol exceeded 0.0025.

a-torgovitsky commented 2 years ago

Hi @jkcshea , thanks for this. In thinking about this, I thought there is probably a better way to implement the reformulation that I suggested, one that scales better with n.

To recap, my initial proposal was:

The approach I found worked well in another context was very simple. The direct approach has a second step QCQP problem, and the quadratic constraint there is the source of the problem. It is set up like ||Ax - b||^2 <= constant where A is a design matrix for the regression. Expand it out and it looks like x'A'Ax - 2x'A'b + b'2 <= constant. The problem is that A'A is dense, which is tough for the QCQP algorithms. The simple and surprising solution is to include some additional variables y, and constrain them like this: y = Ax Then change the quadratic constraint to ||y - b||^2 <= constant which expands to y' I y - 2y'b + b'b <= constant where now instead of A'A we have I, which is of course sparse. I was shocked to see that this works...but apparently Gurobi is not as smart as I thought, at least not for QCQPs.

The problem you found was that A is n x k, where n is the sample size. So y is n-dimensional, and the number of slack variables increases with n.

Suppose instead that we first apply a Cholesky decomposition to A'A. So A'A = C'C, where C is a k x k dimensional lower-triangular matrix. After doing that, we add slack variables y = Cx, which are now k-dimensional. Then the quadratic constraint becomes y'y - 2x'A'b + b'b <= constant which has a nice sparse quadratic matrix (I_{k}) and all that we have added in turn is y = Cx which should just be a k-dimensional system of linear equations (Not only that but it too is fairly sparse, since C is lower triangular.)

Could you give that a shot and see if it performs better?

jkcshea commented 2 years ago

That's very nice... Yes, I can certainly implement that.

jkcshea commented 2 years ago

Sorry this is taking longer than expected.

I'm still testing the code. This new approach is about as fast as the original procedure. But I have noticed that this approach can lead to strange cases where the minimized criterion is negative. (N = 202,000, seed = 10, sample AE data without replacement) I know that shouldn't be possible, but Gurobi is returning such solutions without spitting errors about suboptimal criteria. I believe this has to do with how we rescale the QP problem, as it no longer happens when rescale = FALSE. My guess is that the relative tolerance when solving these rescaled QP problems are very messed up with large sample sizes. So this is what I am currently investigating.

Once I figure out what's going on above, I will generate the distribution of criterion.tol that ends with optimal bounds to see if this approach improves the stability of the optimization problems.

a-torgovitsky commented 2 years ago

But I have noticed that this approach can lead to strange cases where the minimized criterion is negative.

So you're using this reformulation for the initial QP too? (I'm not sure whether that is a good idea or a bad idea.)

I believe this has to do with how we rescale the QP problem, as it no longer happens when rescale = FALSE.

This is totally possible. It might be that the rescaling we were doing before is no longer necessary (and perhaps harmful) with this new reformulation.

jkcshea commented 2 years ago

So you're using this reformulation for the initial QP too? (I'm not sure whether that is a good idea or a bad idea.)

Hm, neither am I. I will continue to diagnose the problem to make sure it isn't because I've made some coding error. After that, I'll try restricting this approach just for the QCQP.

It might be that the rescaling we were doing before is no longer necessary (and perhaps harmful) with this new reformulation.

If we end up using this approach even for the QP, I'll add an error message about the rescale option if we get a negative criterion.

jkcshea commented 2 years ago

The negative criteria appear to be arising because Gurobi cannot satisfy the constraints defining the new variable y = Cx. Oddly, Gurobi thinks its solution is optimal and feasible. To be clear, I am not rescaling the solutions and finding that the equality constraints are being violated. I'm finding that Gurobi is violating its own feasibility tolerance. I'm trying to construct a MWE for this to post on Gurobi's support page.

Seeing how this strange behavior is tied to the auxiliary variables, I restricted the new approach only to the QCQP. The distribution of criterion.tol required for optimal bounds is much, much worse.

This is the distribution of criterion.tol from before. This is the same simulation from above but I've zoomed out. The multiple modes reflect how I chose the grid for criterion.tol.

Now here is the distribution using the Cholesky approach.

I don't know why this is happening yet. But I wanted to provide an update. I'm very happy to keep investigating if you think it is worthwhile. I am also re-running the simulations with rescale = FALSE, just to see what happens.

a-torgovitsky commented 2 years ago

Yeah I think more investigation is warranted. The new graph in particular looks quite odd. You may want to also look at the options for the Cholesky decomposition (I assume you are using chol). Since we are only doing it once, I expect we can do it to high precision without a serious computational burden.

jkcshea commented 2 years ago

I am indeed using the chol() function. Since we are decomposing positive semidefinite matrices, I am using the pivot option. The decomposition is being done correctly, though. Playing with the tolerance option doesn't have much of an effect.

Here is the density of criterion.tol for optimal bounds when estimated using a wider bandwidth.

Here is the simulation where rescale = FALSE and we're using the Cholesky decomposition. It improves a lot. Sample size does not seem to matter.

I will see what happens with the original procedure when rescale = FALSE.

a-torgovitsky commented 2 years ago

Can you remind me what rescale = TRUE is doing exactly? The results you show with rescale = FALSE are encouraging.

jkcshea commented 2 years ago

We do two things with rescale = TRUE:

Each column of the design matrix is normalized to have an l2 norm of 1
The vector and matrix defining the quadratic objective/constraint are divided by the l2 norm of Y.

jkcshea commented 2 years ago

Hello everyone,

@cblandhol @johnnybonney A new regression approach has been implemented. This is done on the branch enhance/direct-regress-lp. The package can be used exactly as before. Below is a nice explanation of the procedure by @a-torgovitsky.

regression-again.pdf

The regression approach solves LP problems instead of QP/QCQP problems. The hope was that solving LP problems would improve the stability of the regression approach. But the figures below suggest that the Cholesky approach is more stable.

The figures are generated by estimating the bounds for an increasing sequence of criterion.tol values. Displayed is the histogram of the smallest value of criterion.tol such that both the lower and upper bounds are estimated optimally. This is the same exercise done three posts above. I'm using histograms this time because the kernel density estimates smoothened things too much. But that is my fault---I should've checked the histogram or played more with bandwidths. Sorry, @a-torgovitsky.

Regardless of the sample size, the figures suggest the Cholesky approach is actually very stable. Setting rescale = TRUE worsens the stability, but only slightly. The LP approach requires larger values of criterion.tol to obtain optimal bounds. (The right tails of the histograms are sparse because I used a non-uniform grid for criterion.tol.)

a-torgovitsky commented 2 years ago

Hi @jkcshea thanks for this. I'm a bit surprised that the QP/QCQP approach is more stable (even with the Cholesky decomposition), but that's good that it is.

Can we get a more absolute sense of how stable the different procedures are? So maybe for a fixed criterion tolerance, what proportion of each approach results in "clean" solves all the way through. I guess that would mean:

Resample the data x times
For each resample, try solving with each of the 3 methods, many different criterion.tol
Plot proportion of clean solves as a function of criterion.tol for each of the three methods

Also, might as well record the time-to-solve when doing this. Would be useful to see if we take a big performance hit in the QP/QCQP relative to the LP.

@cblandhol and @johnnybonney could you experiment with these methods on the empirical applications and let us know what problems you run into? I think we're settling on something that's going to be pretty computationally stable.

jkcshea commented 2 years ago

Sure, I can do this.

I should also try other models. The simulations so far use the model @cblandhol posted up top.

johnnybonney commented 2 years ago

We will experiment!

jkcshea commented 2 years ago

Distribution of minimum `criterion.tol` required for optimal bounds

The histograms I posted above are quite misleading. I had misread how fine the grid for criterion.tol was. So here are bar charts showing the distribution of the smallest value of criterion.tol such that both the lower and upper bounds are estimated optimally. I'm only showing this for sample sizes of 200,000 for brevity. The plots for different sample sizes are very similar.

The Cholesky approach seems very stable if rescale = FALSE. The LP approach is not as stable and has a very long right tail. I found this to be the case under different models as well.

Distribution of optimal bounds conditional on `criterion.tol`

Here are the proportions of estimates returning optimal bounds for each method of estimation, conditional on criterion.tol. The plots look very similar for different sample sizes.

The Cholesky approach with rescale = FALSE continues to dominate. The Cholesky approach with rescale = TRUE rapidly improves as criterion.tol is increased. The LP approach seems to only improve when criterion.tol > 0.01.

Run time

When N = 100,000, LP is actually the slowest on average.

But when N = 200,000, LP is the fastest on average.

a-torgovitsky commented 2 years ago

Thanks @jkcshea

How do I read the first plots (distribution of minimum criterion.tol)? The probabilities don't sum to 1 within method?

Can you remind me why the computational difficulty increases with N? The size of the problem (number of variables/constraints) stays the same, doesn't it? (Or maybe it doesn't due to the audit procedure taking it's defaults from the sample size?)

jkcshea commented 2 years ago

How do I read the first plots (distribution of minimum criterion.tol)? The probabilities don't sum to 1 within method?

Ah sorry, I didn't explain that I broke the distribution into two separate bar charts. So for each color, summing the bars over both bar charts will give you probability 1.

Can you remind me why the computational difficulty increases with N?

The number of variables and constraints indeed remains the same across sample size. The times shown above are the total run time of the ivmte function. So this includes the time taken to generate the $B$ matrix. I will re-run the simulation looking only at the time Gurobi took to solve the problem. I realize now that was what you wanted originally.

But here is something odd: The total run time should be slower when the sample is large. That is what happens on my machine. This is because it takes more time to construct that B matrix. So I don't know how the LP approach got faster when the sample doubled in size... Perhaps it is something related to the server. I will investigate.

a-torgovitsky commented 2 years ago

Ok I see, sorry---didn't read the two graphs correctly. I see what you did here.

I guess breaking down run-time between Gurobi and non-Gurobi might be useful. Just sort of kicking around the tires here to make sure we understand what's going on. I'm surprised---but not disappointed---that the Cholesky approach seems to work better than LP. That's a good outcome.

jkcshea commented 2 years ago

Okay, I re-ran the simulations for the LP approach to the direct regression. The simulations for the Cholesky approaches are still running. I wasn't able to regenerate the run-times shown above on the server... But the new run-times look much more reasonable. So here is a plot showing the total run-time (Gurobi and non-Gurobi) for large and small samples and different values of criterion.tol.

Almost all of that time is non-Gurobi. The time difference between N = 100,000 and N = 200,000 makes sense: Generating the $B$ matrix requires us to carry out N integrals. So more time is spent constructing the $B$ matrix when N is large.

Here is the total time spent on Gurobi, as well as a break down of those times across the different optimization problems. I'm amazed how little time is actually spent on solving these LP problems. As a reminder, these times are for cases where the bounds are optimal.

Edit: The point of these plots was to check whether the Gurobi run-time varied with the size of the sample. There seems to be a relationship at small values of criterion.tol. But this relationship seems to fall apart as criterion.tol gets larger.

jkcshea commented 2 years ago

Run-times vs. sample size (Cholesky approach)

Here are the plots for the Cholesky approach with rescae = FALSE (i.e., the more stable Cholesky approach). The relationship between total run-time and size of the sample look much more reasonable. The Gurobi run-times make up a tiny fraction of the total run-time and do not seem to depend on sample size, either.

Run-times vs. methods (for `N = 200,000`)

The total run-time for the LP approach and Cholesky approach are roughly the same.

The Gurobi run-time is shorter under the LP approach. But the absolute difference is very small (less than 0.05 seconds, relative to total run-times of ~40 seconds).

And here are plots showing the run-times for each optimization problem. The LP problems are typically faster to solve on average.

a-torgovitsky commented 2 years ago

Ok thanks @jkcshea Just so I understand what's going on, the $B$ matrix (which involves the order-$N$ analytic integration) has to be generated in either approach right? So the only non-Gurobi difference is computing the Cholesky decomposition? Or am I forgetting a step?

jkcshea commented 2 years ago

Just so I understand what's going on, the $B$ matrix (which involves the order- analytic integration) has to be generated in either approach right?

Yep.

So the only non-Gurobi difference is computing the Cholesky decomposition?

In the LP problem, we also have to construct the slack variables and account for them throughout. We don't have to worry about that in the QP problem. This is relatively minor, but I thought I should point it out to be precise.

a-torgovitsky commented 2 years ago

Ok that makes sense. I think we can close this issue now? @jkcshea please do the honors if you agree. A lot of work, but seems like we hit on a rock solid solution, so that's great success! @cblandhol and @johnnybonney feel free to re-open if you find something that seems related to numerical stability of the direct approach.

jkcshea commented 2 years ago

Ah, but which approach will we go with? Or do we want to keep both the LP and QCQP approaches?

Edit: Well, presumably we go with the Cholesky approach and set rescale = FALSE by default. But do we want to keep the LP approach?

a-torgovitsky commented 2 years ago

I would just assume keep these around as features that can be used if we need to in the future for debugging at least. So maybe there are two options: rescale is boolean and direct is QP or LP

The defaults are rescale = FALSE and direct = QP

What do you think?

jkcshea / ivmte