Closed cblandhol closed 2 years ago
@jkcshea is busy with time-critical job market stuff at the moment. But I have a lead on problems like these that I found in another project using a similar approach. So when he gets some breathing room maybe he could implement it and we could see if that helps. (Or you could try---it shouldn't be too hard, but the code is rather involved at this point so it might be hard to find the right spot.)
The approach I found worked well in another context was very simple. The direct approach has a second step QCQP problem, and the quadratic constraint there is the source of the problem. It is set up like ||Ax - b||^2 <= constant where A is a design matrix for the regression. Expand it out and it looks like x'A'Ax - 2x'A'b + b'2 <= constant. The problem is that A'A is dense, which is tough for the QCQP algorithms. The simple and surprising solution is to include some additional variables y, and constrain them like this: y = Ax Then change the quadratic constraint to ||y - b||^2 <= constant which expands to y' I y - 2y'b + b'b <= constant where now instead of A'A we have I, which is of course sparse. I was shocked to see that this works...but apparently Gurobi is not as smart as I thought, at least not for QCQPs.
Alternatively, you could try changing criterion.tol
to something strictly positive.
When it's zero the second step program is not convex (since it's like x^2 = c vs. x^2 <= c),
Hi everyone,
Sorry for taking so long to get to this, thank you for being patient with me.
I am currently carrying out some checks for the procedure described by @a-torgovitsky above. I'm still running into "numerical issues." Unfortunately, that is about all the information returned by Gurobi's status codes.
But the performance is really taking a hit under this alternate approach. This is because the number of new variables is equal to the number of observations. And the AE data set has 200,000+ observations.
@cblandhol and @johnnybonney: After cleaning up the code and completing some checks, I will push a separate branch with the new procedure so you can put it to the test. Feel free to also post any new examples I should look at.
@a-torgovitsky Let me know if you want to allow the user to toggle between this approach and the original approach.
Ahh yes, I didn't consider that. (The other setting I was using this trick in didn't have the constraints scaling with n.)
Are you saying you get the "numerical issues" warning with the reformulated version?
How much of a hit does performance take? And how does the solve time (and accuracy) change with the sample size?
Okay, the code is now pushed to the branch bugfix/issue209
.
git fetch origin bugfix/issue209
Are you saying you get the "numerical issues" warning with the reformulated version?
When I use large enough subset of the AE data---yes. If I use a subset (e.g,. 10,000 observations), then I won't run into numerical issues. But the solutions may not be optimal. Nevertheless, this reformulation does seem to work.
How much of a hit does performance take? And how does the solve time (and accuracy) change with the sample size?
Performance with 10,000 observations (with AE data):
Performance with 20,000 observations (with AE data):
Performance with 50,000 observations (with AE data):
Performance with 100,000 observations (with AE data):
Regarding accuracy:
Ok, thanks. That's pretty rough!
When you say suboptimal, you mean the same message that @cblandhol was reporting?
Is that on the first step problem, the second step problem, or both?
Have you tried adjusting criterion.tol
?
Also, do you have a sense of how sparse (or not sparse) the matrix A'A is here?
When you say suboptimal, you mean the same message that @cblandhol was reporting?
Ah yes sorry, there are indeed two messages. So here is the suboptimal warning message @cblandhol showed. This happened when minimizing the criterion.
Warning: The solver was unable to satisfy the optimality
tolerance when minimizing the criterion, so a suboptimal
solution is returned. Tolerance parameters for the solver
can be passed through the argument 'solver.options'.
But since a criterion was estimated, the function keeps running and tries to find the bounds. It is here that we run into the error message about numerical issues.
Obtaining bounds...
Model resulted in numerical issues.
We can also get suboptimal bounds. In all the cases mentioned above using subsets of the AE data, I was referring to the bounds when discussing suboptimality. So here are the optimality statuses for the criterion:
Have you tried adjusting criterion.tol?
Adjusting criterion.tol
really improves things.
But it still isn't enough to estimate the model using the full AE data.
10,000 obs. from AE data, criterion.tol = 0.01
:
20,000 obs. from AE data, criterion.tol = 0.01
:
50,000 obs. from AE data, criterion.tol = 0.01
:
100,000 obs. from AE data, criterion.tol = 0.01
:
Full AE data, criterion.tol = 0.01
:
Even when criterion.tol = 1
I still run into numerical issues.
Also, do you have a sense of how sparse (or not sparse) the matrix A'A is here?
Pretty sparse, 95% of the entries are 0.
The MTR specification is made up splines of degree 0 interacted with indicator variables,
e.g., (x == 2):uSplines(degree = 0, knots = c(...))
So that's where the sparsity comes from.
Thanks that's very helpful. The sparsity to me suggests that we definitely want to keep the original formulation in. It probably has an advantage in this case.
What I find interesting is that the full AE data (300-400k observations if I recall?) is so intractable even while the 100k sample seems to be just fine. Can you try different 100k subsamples? It would be useful to know whether this is due to the size per se, or due to some particular feature of that 100k subsample you are trying.
You're right, it may just be certain samples.
I got optimal criteria/bounds up until 185,000. Afterwards, the criteria are no longer optimal.
For samples > 185,000, there are times we can estimate bounds, (e.g., at 201,000 I estimated bounds).
And there are times where we can't (e.g.. 185,500 ran into numerical errors; 202,000 ran into numerical errors).
But again, the criteria here are suboptimal---I suppose that is like having a much larger criterion.tol
value.
Is the scale of the criterion invariant to the sample size? (i.e. are you dividing by "n")?
Yep, the criterion is being divided by n
and its scale is invariant to sample size.
Is it always the case that suboptimal criteria => numerical issues for bounds and, is it always the case that optimal criteria => optimal bounds ?
Is it always the case that suboptimal criteria => numerical issues for bounds
Not always.
If I use a sample of 190,000, and criterion.tol = 0
, then I get a suboptimal criterion and suboptimal bounds.
So no numerical issues for bounds.
There are also cases where we get a suboptimal criterion but optimal bounds.
E.g., a sample of 200,000 and criterion.tol = 0.01
is an example of this.
and, is it always the case that optimal criteria => optimal bounds
Also not always.
If I use a sample of 150,000 and criterion.tol = 0
then I get an optimal criterion but suboptimal upper bound.
Ok, let me reverse the question:
Is it always the case that (numerical issues) or (suboptimal) for bounds => suboptimal criteria ?
Also, do you have a case where criterion.tol > 0
but you also have:
optimal criterion
and
(numerical issues) or (suboptimal) for bounds
Is it always the case that (numerical issues) or (suboptimal) for bounds => suboptimal criteria
Not always: criterion.tol = 0
and n = 175000
.
Returns suboptimal bounds but optimal criterion.
(The extra details are there in case I need to revisit these examples.
The seed for subsampling the data in these examples, and all examples above, is 10L
.)
Also, do you have a case where criterion.tol > 0 but you also have: optimal criterion and (numerical issues) or (suboptimal) for bounds
Yes: criterion.tol = 0.01
and n = 201500
.
Optimal criterion, but numerical issues for bounds.
Ok, here's a slightly more involved exercise that I think would be informative.
Restrict ourselves to cases where we get an optimal criterion.
Then keep increasing criterion.tol
by an amount (say .01
) until both the LB and UB solve cleanly.
Do this for a certain number of subsamples of size n for different n.
(I don't know how computationally intensive this is, but maybe we just start with 20 subsamples for each n?)
Plot out the distribution of the criterion.tol
needed to get clean solves for each value of n.
It may be that it makes sense to do something like this in the module, where we increase criterion.tol
iteratively until solving cleanly.
This exercise will be informative about whether that's a sensible solution.
After we've figured this out, we can then go back to look at cases where optimality is not met in the criterion problem to begin with.
Here are the results from the exercise described above.
For each sample size and value of criterion.tol
, I resampled the AE data 200 times and estimated the bounds.
Here is the fraction of resamples with optimal criteria.
Also, here are the fraction of resamples with violations in the audit grid (red line), where audit.max = 30
.
The average number of points being violated (conditional on there being any violations; blue line) are diminishing with sample size.
These violations are very small, on the order of 1e-05.
They seem to arise because the QCQP is rescaled.
So solutions that pass Gurobi's feasibility tolerance may not pass ivmte
's feasibility tolerance.
This was discussed somewhere in #198.
Here are the fraction of optimal bounds, conditional on optimal criteria.
But I allow for cases where there are still violations in the audit grid, since these violations are few and small in magnitude.
So for sample sizes <= 150,000, the fraction of optimal bounds is estimated using 170+ resamples (the ones excluded did not have optimal criteria).
For sample sizes > 150,000, the fraction of optimal bounds is estimated using 120--150 resapmles.
Note that the x-axis is not scaled linearly, since I was trying various magnitudes of criterion.tol
.
For sample sizes <= 150,000, over 90% of the bounds are optimal when criterion.tol = 0.02
or above.
For larger samples, there doesn't seem to be much improvement in stability for higher levels of criterion.tol
.
Are you saying that even for the final graph you can't get N = 200,000 to go up to 100% by choosing criterion.tol
to be sufficiently large? What if you take criterion.tol = +Inf
for example?
I'm sorry, I've really not done a good job pushing criterion.tol
to large values.
Let me run really push criterion.tol
up and see what happens.
A quick check suggest that criterion.tol = 5
can still lead to suboptimal bounds.
Also, these bounds are very uninformative, i.e., approx. [-1, 1] when the outcome is binary.
Setting criterion.tol = 10
results in optimal bounds.
But again, the bounds are uninformative.
But are you incrementally increasing criterion.tol
the way I described in the previous post?
Hm, I realized I am not---sorry for the confusion.
So just to be clear, here is what I should do:
criterion.tol = 0
.criterion.tol
, and return to previous step.criterion.tol
.Suppose I do the steps above many times.
Then for sample size N, I have a distribution of values of criterion.tol
where the bounds are optimal.
Yeah that's the procedure I had intended.
With this procedure all of the bounds should eventually be optimal for some criterion.tol
So I guess we also want to have a sense of what those optimal bounds are too.
This is probably much more computationally demanding than what you already did.... So you could just lower the number of simulations I think.
Ah okay. Apologies for the confusion and thanks for clarifying. I'll get this thing running.
Sorry this took longer than expected. The results look quite different from before. This is because I was previously sampling without replacement from the AE data set when generating the random data. This was a mistake---I am now sampling with replacement.
Here is a smoothed distribution of criterion.tol
where both lower and upper bounds are estimated optimally.
For each sample size, I resampled the AE data 400 times and estimated the bounds.
I exclude cases where the criteria are suboptimal, leaving us with 310--350 estimates for each sample size.
But increasing criterion.tol
does not always lead to optimal bounds.
E.g., I get optimal bounds when criterion.tol = 0.0004
, sample size is 200,000, and the seed is 21.
However, I get suboptimal bounds if I set criterion.tol = 0.0005
.
Ok, this is good. Let's zoom in on that example you mention (n = 200,000 and seed 21).
Can you plot whether bounds are suboptimal as a function of criterion.tol
?
Sure, here it is.
The bounds were optimal when criterion.tol
exceeded 0.0025.
Hi @jkcshea , thanks for this. In thinking about this, I thought there is probably a better way to implement the reformulation that I suggested, one that scales better with n.
To recap, my initial proposal was:
The approach I found worked well in another context was very simple. The direct approach has a second step QCQP problem, and the quadratic constraint there is the source of the problem. It is set up like ||Ax - b||^2 <= constant where A is a design matrix for the regression. Expand it out and it looks like x'A'Ax - 2x'A'b + b'2 <= constant. The problem is that A'A is dense, which is tough for the QCQP algorithms. The simple and surprising solution is to include some additional variables y, and constrain them like this: y = Ax Then change the quadratic constraint to ||y - b||^2 <= constant which expands to y' I y - 2y'b + b'b <= constant where now instead of A'A we have I, which is of course sparse. I was shocked to see that this works...but apparently Gurobi is not as smart as I thought, at least not for QCQPs.
The problem you found was that A is n x k, where n is the sample size. So y is n-dimensional, and the number of slack variables increases with n.
Suppose instead that we first apply a Cholesky decomposition to A'A. So A'A = C'C, where C is a k x k dimensional lower-triangular matrix. After doing that, we add slack variables y = Cx, which are now k-dimensional. Then the quadratic constraint becomes y'y - 2x'A'b + b'b <= constant which has a nice sparse quadratic matrix (I_{k}) and all that we have added in turn is y = Cx which should just be a k-dimensional system of linear equations (Not only that but it too is fairly sparse, since C is lower triangular.)
Could you give that a shot and see if it performs better?
That's very nice... Yes, I can certainly implement that.
Sorry this is taking longer than expected.
I'm still testing the code.
This new approach is about as fast as the original procedure.
But I have noticed that this approach can lead to strange cases where the minimized criterion is negative.
(N = 202,000, seed = 10, sample AE data without replacement)
I know that shouldn't be possible, but Gurobi is returning such solutions without spitting errors about suboptimal criteria.
I believe this has to do with how we rescale the QP problem, as it no longer happens when rescale = FALSE
.
My guess is that the relative tolerance when solving these rescaled QP problems are very messed up with large sample sizes.
So this is what I am currently investigating.
Once I figure out what's going on above, I will generate the distribution of criterion.tol
that ends with optimal bounds to see if this approach improves the stability of the optimization problems.
But I have noticed that this approach can lead to strange cases where the minimized criterion is negative.
So you're using this reformulation for the initial QP too? (I'm not sure whether that is a good idea or a bad idea.)
I believe this has to do with how we rescale the QP problem, as it no longer happens when rescale = FALSE.
This is totally possible. It might be that the rescaling we were doing before is no longer necessary (and perhaps harmful) with this new reformulation.
So you're using this reformulation for the initial QP too? (I'm not sure whether that is a good idea or a bad idea.)
Hm, neither am I. I will continue to diagnose the problem to make sure it isn't because I've made some coding error. After that, I'll try restricting this approach just for the QCQP.
It might be that the rescaling we were doing before is no longer necessary (and perhaps harmful) with this new reformulation.
If we end up using this approach even for the QP, I'll add an error message about the rescale
option if we get a negative criterion.
The negative criteria appear to be arising because Gurobi cannot satisfy the constraints defining the new variable y = Cx
.
Oddly, Gurobi thinks its solution is optimal and feasible.
To be clear, I am not rescaling the solutions and finding that the equality constraints are being violated.
I'm finding that Gurobi is violating its own feasibility tolerance.
I'm trying to construct a MWE for this to post on Gurobi's support page.
Seeing how this strange behavior is tied to the auxiliary variables, I restricted the new approach only to the QCQP.
The distribution of criterion.tol
required for optimal bounds is much, much worse.
This is the distribution of criterion.tol
from before.
This is the same simulation from above but I've zoomed out.
The multiple modes reflect how I chose the grid for criterion.tol
.
Now here is the distribution using the Cholesky approach.
I don't know why this is happening yet.
But I wanted to provide an update.
I'm very happy to keep investigating if you think it is worthwhile.
I am also re-running the simulations with rescale = FALSE
, just to see what happens.
Yeah I think more investigation is warranted.
The new graph in particular looks quite odd.
You may want to also look at the options for the Cholesky decomposition (I assume you are using chol
). Since we are only doing it once, I expect we can do it to high precision without a serious computational burden.
I am indeed using the chol()
function.
Since we are decomposing positive semidefinite matrices, I am using the pivot
option.
The decomposition is being done correctly, though.
Playing with the tolerance option doesn't have much of an effect.
Here is the density of criterion.tol
for optimal bounds when estimated using a wider bandwidth.
Here is the simulation where rescale = FALSE
and we're using the Cholesky decomposition.
It improves a lot.
Sample size does not seem to matter.
I will see what happens with the original procedure when rescale = FALSE
.
Can you remind me what rescale = TRUE
is doing exactly?
The results you show with rescale = FALSE
are encouraging.
We do two things with rescale = TRUE
:
Hello everyone,
@cblandhol @johnnybonney A new regression approach has been implemented.
This is done on the branch enhance/direct-regress-lp
.
The package can be used exactly as before.
Below is a nice explanation of the procedure by @a-torgovitsky.
The regression approach solves LP problems instead of QP/QCQP problems. The hope was that solving LP problems would improve the stability of the regression approach. But the figures below suggest that the Cholesky approach is more stable.
The figures are generated by estimating the bounds for an increasing sequence of criterion.tol
values.
Displayed is the histogram of the smallest value of criterion.tol
such that both the lower and upper bounds are estimated optimally.
This is the same exercise done three posts above.
I'm using histograms this time because the kernel density estimates smoothened things too much.
But that is my fault---I should've checked the histogram or played more with bandwidths.
Sorry, @a-torgovitsky.
Regardless of the sample size, the figures suggest the Cholesky approach is actually very stable.
Setting rescale = TRUE
worsens the stability, but only slightly.
The LP approach requires larger values of criterion.tol
to obtain optimal bounds.
(The right tails of the histograms are sparse because I used a non-uniform grid for criterion.tol
.)
Hi @jkcshea thanks for this. I'm a bit surprised that the QP/QCQP approach is more stable (even with the Cholesky decomposition), but that's good that it is.
Can we get a more absolute sense of how stable the different procedures are? So maybe for a fixed criterion tolerance, what proportion of each approach results in "clean" solves all the way through. I guess that would mean:
Also, might as well record the time-to-solve when doing this. Would be useful to see if we take a big performance hit in the QP/QCQP relative to the LP.
@cblandhol and @johnnybonney could you experiment with these methods on the empirical applications and let us know what problems you run into? I think we're settling on something that's going to be pretty computationally stable.
Sure, I can do this.
I should also try other models. The simulations so far use the model @cblandhol posted up top.
We will experiment!
criterion.tol
required for optimal boundsThe histograms I posted above are quite misleading.
I had misread how fine the grid for criterion.tol
was.
So here are bar charts showing the distribution of the smallest value of criterion.tol
such that both the lower and upper bounds are estimated optimally.
I'm only showing this for sample sizes of 200,000 for brevity.
The plots for different sample sizes are very similar.
The Cholesky approach seems very stable if rescale = FALSE
.
The LP approach is not as stable and has a very long right tail.
I found this to be the case under different models as well.
criterion.tol
Here are the proportions of estimates returning optimal bounds for each method of estimation, conditional on criterion.tol
.
The plots look very similar for different sample sizes.
The Cholesky approach with rescale = FALSE
continues to dominate.
The Cholesky approach with rescale = TRUE
rapidly improves as criterion.tol
is increased.
The LP approach seems to only improve when criterion.tol > 0.01
.
When N = 100,000
, LP is actually the slowest on average.
But when N = 200,000
, LP is the fastest on average.
Thanks @jkcshea
How do I read the first plots (distribution of minimum criterion.tol
)? The probabilities don't sum to 1 within method?
Can you remind me why the computational difficulty increases with N? The size of the problem (number of variables/constraints) stays the same, doesn't it? (Or maybe it doesn't due to the audit procedure taking it's defaults from the sample size?)
How do I read the first plots (distribution of minimum criterion.tol)? The probabilities don't sum to 1 within method?
Ah sorry, I didn't explain that I broke the distribution into two separate bar charts. So for each color, summing the bars over both bar charts will give you probability 1.
Can you remind me why the computational difficulty increases with N?
The number of variables and constraints indeed remains the same across sample size.
The times shown above are the total run time of the ivmte
function.
So this includes the time taken to generate the $B$ matrix.
I will re-run the simulation looking only at the time Gurobi took to solve the problem.
I realize now that was what you wanted originally.
But here is something odd:
The total run time should be slower when the sample is large.
That is what happens on my machine.
This is because it takes more time to construct that B
matrix.
So I don't know how the LP approach got faster when the sample doubled in size...
Perhaps it is something related to the server.
I will investigate.
Ok I see, sorry---didn't read the two graphs correctly. I see what you did here.
I guess breaking down run-time between Gurobi and non-Gurobi might be useful. Just sort of kicking around the tires here to make sure we understand what's going on. I'm surprised---but not disappointed---that the Cholesky approach seems to work better than LP. That's a good outcome.
Okay, I re-ran the simulations for the LP approach to the direct regression.
The simulations for the Cholesky approaches are still running.
I wasn't able to regenerate the run-times shown above on the server...
But the new run-times look much more reasonable.
So here is a plot showing the total run-time (Gurobi and non-Gurobi) for large and small samples and different values of criterion.tol
.
Almost all of that time is non-Gurobi.
The time difference between N = 100,000
and N = 200,000
makes sense:
Generating the $B$ matrix requires us to carry out N
integrals.
So more time is spent constructing the $B$ matrix when N
is large.
Here is the total time spent on Gurobi, as well as a break down of those times across the different optimization problems. I'm amazed how little time is actually spent on solving these LP problems. As a reminder, these times are for cases where the bounds are optimal.
Edit: The point of these plots was to check whether the Gurobi run-time varied with the size of the sample.
There seems to be a relationship at small values of criterion.tol
.
But this relationship seems to fall apart as criterion.tol
gets larger.
Here are the plots for the Cholesky approach with rescae = FALSE
(i.e., the more stable Cholesky approach).
The relationship between total run-time and size of the sample look much more reasonable.
The Gurobi run-times make up a tiny fraction of the total run-time and do not seem to depend on sample size, either.
N = 200,000
)The total run-time for the LP approach and Cholesky approach are roughly the same.
The Gurobi run-time is shorter under the LP approach. But the absolute difference is very small (less than 0.05 seconds, relative to total run-times of ~40 seconds).
And here are plots showing the run-times for each optimization problem. The LP problems are typically faster to solve on average.
Ok thanks @jkcshea Just so I understand what's going on, the $B$ matrix (which involves the order-$N$ analytic integration) has to be generated in either approach right? So the only non-Gurobi difference is computing the Cholesky decomposition? Or am I forgetting a step?
Just so I understand what's going on, the $B$ matrix (which involves the order- analytic integration) has to be generated in either approach right?
Yep.
So the only non-Gurobi difference is computing the Cholesky decomposition?
In the LP problem, we also have to construct the slack variables and account for them throughout. We don't have to worry about that in the QP problem. This is relatively minor, but I thought I should point it out to be precise.
Ok that makes sense. I think we can close this issue now? @jkcshea please do the honors if you agree. A lot of work, but seems like we hit on a rock solid solution, so that's great success! @cblandhol and @johnnybonney feel free to re-open if you find something that seems related to numerical stability of the direct approach.
Ah, but which approach will we go with? Or do we want to keep both the LP and QCQP approaches?
Edit: Well, presumably we go with the Cholesky approach and set rescale = FALSE
by default.
But do we want to keep the LP approach?
I would just assume keep these around as features that can be used if we need to in the future for debugging at least.
So maybe there are two options:
rescale
is boolean
and direct
is QP
or LP
The defaults are rescale = FALSE
and direct = QP
What do you think?
@johnnybonney and I have been re-running the results for our Angrist and Evans application with the direct MTR approach and are running into the "numerical issues" error mentioned in issue #186. Do you have any advice on how to deal with this issue?
Below I have made a reproducible example which illustrates the issue with the same specification and similar number of covariate values we use in our application, but which uses the dataset provided in the package.
Setting
presolve = FALSE
results in an uninformative error message:The MST estimator does not return the same issue: