Closed a-torgovitsky closed 3 years ago
Here's the writeup: stability.pdf
This is great! The writeup is clear, and other than making sure I scale everything correctly, this should be straightforward to implement.
Ok, great!
Maybe put this into a development branch? Or create a new/old function temporarily? Not sure what the best approach is.
But we want to be able to check the unscaled/scaled approach against each other in several examples where we are confident the unscaled approach is working, so that we are sure there aren't any flaws in my reasoning.
Ah yes, I've been making all the changes on separate branches, and then merging them into the master branch once I think they are ready. So I'll continue using that approach, which will make it very easy to compare the scaled/unscaled versions of the code.
Excellent! And like I commented in the note, I think checking the point identified case first is a useful check, since we don't have any computational issues there.
I suppose it might be useful to turn off scaling for debugging purposes too. Is that going to be difficult to add an option to allow that?
Good idea. Adding the option shouldn't be a problem at all.
Sorry for taking a while on this.
1. The procedure does not eliminate suboptimal solutions
I've implemented the procedure.
It can be toggled by setting rescale
to be TRUE
or FALSE
.
However, setting rescale = TRUE
does not necessarily resolve the issue where the solver is unable to satisfy the optimality tolerance.
The following example from #196 still results in this error, regardless of the value for rescale
.
args <- list(data = AE,
target = "att",
outcome = "worked",
m0 = ~ u + I(u^2) + yob + u*yob,
m1 = ~ u + I(u^2) + I(u^3) + yob + u*yob,
propensity = morekids ~ samesex,
debug = TRUE,
rescale = FALSE,
noisy = TRUE)
r1 <- do.call(ivmte, args)
## Now rescale
args$rescale <- TRUE
r2 <- do.call(ivmte, args)
Another issue is that the solutions differ quite a bit depending on the value of rescale
.
For instance, below are the bounds from the block of code above.
> r1
Bounds on the target parameter: [-0.2871085, 0.05524949]
Audit reached audit.max (25)
> r2
Bounds on the target parameter: [-0.3084282, 0.1287807]
Audit reached audit.max (25)
I've checked the code several times, and I can't seem to find any new errors.
The problem seems to stem from the QCP, as I have yet to see this problem when minimizing the criterion.
Also, the minimum criterion does not vary with the value of rescale
.
The scaling does play a role, though.
The matrices and vectors defining the quadratic components of the QP and QCP can be very large even after implementing the rescaling procedure since they sum over the observations.
This is the case with the AE
data set, which has 200,000+ observations
So I tried scaling the quadratic components by 1/n
, where n
is the number of observations.
In this case, the bounds are no longer suboptimal solutions to the QCP when rescale = FALSE
.
But surprisingly, the bounds are suboptimal when rescale = TRUE
.
The difference between the unscaled and rescaled solutions are much smaller, though.
## Solutions when quadratic component is scaled by 1/n
> r1
Bounds on the target parameter: [-0.2647924, 0.03654437]
Audit terminated successfully after 1 round
> r2
Bounds on the target parameter: [-0.2677133, 0.04401816]
Audit terminated successfully after 1 round
2. Difference in solutions between unscaled and rescaled problems depends on number of variables and constraints
Although rescaling the components of the QCP should not matter in theory, it does in practice, as demonstrated above. The extent of the difference also seems to vary with the number of linear constraints. Below are plots (and code) from a simple simulation where I solve random QCPs, varying the number of linear constraints. Each point is generated by 1000 QCPs. The difference between the optimal objective values grows with the number of constraints.
If this is a computational issue, I'm not sure if there is much we can do. But I thought I should share this in case it indicates I've done something wrong.
I forgot to mention that the code is on a separate branch, enhance/stability
.
Did you check whether scaling ever has an impact in point identified cases?
The matrices and vectors defining the quadratic components of the QP and QCP can be very large even after implementing the rescaling procedure since they sum over the observations. This is the case with the AE data set, which has 200,000+ observations So I tried scaling the quadratic components by 1/n, where n is the number of observations.
This is going to change the problem though, isn't it? If you only scale the quadratic components by 1/n but not anything else, then you've changed the relative weight in the objective.
- Difference in solutions between unscaled and rescaled problems depends on number of variables and constraints
Although rescaling the components of the QCP should not matter in theory, it does in practice, as demonstrated above. The extent of the difference also seems to vary with the number of linear constraints. Below are plots (and code) from a simple simulation where I solve random QCPs, varying the number of linear constraints. Each point is generated by 1000 QCPs. The difference between the optimal objective values grows with the number of constraints.
But these are differences on the order of 1e-6, if I am reading the y--axis correctly. I don't think we can expect more precision than that from Gurobi on a program with a quadratic constraint or objective.
Did you check whether scaling ever has an impact in point identified cases?
So far, in the examples I've played with, scaling does not significantly impact point estimates.
For instance, in the following example, the point estimates differ by 1.5e-4
, but that is about 0.05% of each estimate.
> dtcf <- ivmte:::gendistCovariates()$data.full
> args <- list(data = dtcf,
+ target = 'ate',
+ outcome = 'ey',
+ m0 = ~ x1 + x2 + u + I(u^2),
+ m1 = ~ x1 + x2 + u + I(u^2),
+ propensity = ~p,
+ treat = 'd')
> ## Do not rescale
> args$rescale <- FALSE
> do.call(ivmte, args)
Point estimate of the target parameter: 0.315994
Warning message:
MTR is point identified via linear regression. Shape constraints are ignored.
> ## Now rescale
> args$rescale <- TRUE
> do.call(ivmte, args)
Point estimate of the target parameter: 0.3158347
Warning message:
MTR is point identified via linear regression. Shape constraints are ignored.
>
In regard to how scaling is performed in the point identified case:
When m0
and m1
have intercept terms, the scaling procedure in the writeup results in a collinear design matrix.
That is, the intercept terms of m0
and m1
will be collinear with the new intercept $\xi_0$.
So checking for point identification is done by checking the rank of the design matrix before rescaling everything.
Estimation is then performed by solving a QP, with the only linear constraint being equation (10) in the writeup, i.e. that $\xi_0$ is a linear combination of the other $\xi_k$.
If you only scale the quadratic components by 1/n but not anything else, then you've changed the relative weight in the objective.
Ah sorry, I worded that very poorly. What I meant to say was that I scaled the entire quadratic objective when minimizing the criterion, and I scaled the entire quadratic constraint when obtaining the bounds. i.e. I scaled the sum of squared residuals by 1/n. Let me know if I'm wrong to have done this.
Difference in solutions between unscaled and rescaled problems depends on number of variables and constraints
Usually the differences are indeed small, but they are noticeable sometimes. Here is one example where the lower bounds for the unscaled and scaled case differ by over 3%. (But maybe 3% isn't a big deal?) Gurobi does not spit out the suboptimality warning in this example.
> args <- list(data = dtcf,
+ target = 'ate',
+ outcome = 'ey',
+ m0 = ~ u + I(u^2),
+ m1 = ~ u + I(u^2),
+ propensity = ~p,
+ treat = 'd',
+ point = FALSE) ## Force partial identification
> ## Do not rescale
> args$rescale <- FALSE
> do.call(ivmte, args)
Bounds on the target parameter: [-0.04050069, -0.0325813]
Audit terminated successfully after 1 round
> ## Now rescale
> args$rescale <- TRUE
> do.call(ivmte, args)
Bounds on the target parameter: [-0.03910859, -0.03322879]
Audit terminated successfully after 1 round
When the suboptimality warning is thrown out, then the differences can indeed be quite large. In the following example, I was able to get the suboptimal warning even when minimizing the criterion. The unscaled estimate has a minimum criterion of 9, whereas the scaled one has a minimum criterion of 37. As a result, the bounds differ quite a bit.
> dtcf$x0 <- 1
> args <- list(data = dtcf,
+ target = 'ate',
+ outcome = 'ey',
+ m0 = ~ x0 + x1 + x2 + u + I(u^2),
+ m1 = ~ x0 + x1 + x2 + u + I(u^2),
+ propensity = ~p,
+ treat = 'd')
> ## Do not rescale
> args$rescale <- FALSE
> do.call(ivmte, args)
Warning: The LP solver was unable to satisfy the optimality tolerance when minimizing the criterion,
so a suboptimal solution is returned. Tolerance parameters for the LP solver can be passed through
the argument 'solver.options'.
Warning: The LP solver was unable to satisfy the optimality tolerance for the minimization problem,
so a suboptimal solution is returned. The LP solver was unable to satisfy the optimality tolerance for
the maximization problem, so a suboptimal solution is returned. Tolerance parameters for the LP
solver can be passed through the argument 'solver.options'.
Bounds on the target parameter: [0.1175554, 0.3361068]
Audit terminated successfully after 1 round
> ## Now rescale
> args$rescale <- TRUE
> do.call(ivmte, args)
...
Warning: The LP solver was unable to satisfy the optimality tolerance when minimizing the criterion,
so a suboptimal solution is returned. Tolerance parameters for the LP solver can be passed through
the argument 'solver.options'.
Warning: The LP solver was unable to satisfy the optimality tolerance for the minimization problem,
so a suboptimal solution is returned. Tolerance parameters for the LP solver can be passed through
the argument 'solver.options'.
Bounds on the target parameter: [-0.08836716, 0.4813966]
Audit terminated successfully after 2 rounds
Let's start with the simplest problem first:
In regard to how scaling is performed in the point identified case: When m0 and m1 have intercept terms, the scaling procedure in the writeup results in a collinear design matrix. That is, the intercept terms of m0 and m1 will be collinear with the new intercept $\xi_0$. So checking for point identification is done by checking the rank of the design matrix before rescaling everything.
If the rescaled design matrix is collinear, but the unrescaled design matrix is not, then the rescaling is changing the optimization problem. So this is an issue.
I see now why its happening. A constant in m0 becomes a B regressor like (1-D) -- recentered/scaled it is still (1-D). A constant in m1 becomes a B regressor like D -- recentered/scaled it is still D. So when we add a constant to the rescaled regression we have (1-D) + D = 1 ---> perfectly collinear.
I think this is a function of poor indexing on my part. If we just had any old regression, obviously we couldn't normalize the constant term because its upper bound is the same as its lower bound. In our regression, we actually have two constant terms potentially -- it's just that they don't look constant when we pool the regression across D = 0 and D = 1 arms.
So we shouldn't be normalizing either of those terms.
Here's a revised document that fixes this. stability.pdf Try that and let me know if it fixes the point identified case. If so, then we can re-examine the partially identified case. (Whatever was going on there may have been related to this issue as well.)
Thanks for that writeup, it has been implemented (but not yet pushed---let me know if you would like it, and I can clean up the code).
If both MTRs include intercepts, then a simple linear regression is performed whether or not everything is rescaled. That is, constraint (6) from the new writeup imposes no real restrictions. However, if either one of the MTRs is missing the intercept, then a QP problem is solved, and constraint (6) will need to be imposed.
Something annoying I found was that R will sometimes add tiny noise to the data when it is being manipulated. For instance, I passed an MTR with two constants, e.g.
dtcf$x0 <- 1 ## Add a constant variable to the data
args$m0 <- ~ 1 + x0 + x1 + u ## Pass MTR with two intercepts
If the data is not being rescaled/manipulated, then R will detect there is a problem with collinearity when performing the regression.
However, when the data is being rescaled/manipulated, then those intercept terms will actually be 1 + noise
, and the noise gets rescaled.
R then fails to detect collinearity when performing the regressions, and the coefficient estimates become quite ugly.
To get around this, I still perform the regression using the unscaled data to test for collinearity, even if rescale = TRUE
.
Let me know if you'd like me to look further into the new collinearity issue, or if my description at the top does not align with what you had in mind or the writeup.
If both MTRs include intercepts, then a simple linear regression is performed whether or not everything is rescaled. That is, constraint (6) from the new writeup imposes no real restrictions. However, if either one of the MTRs is missing the intercept, then a QP problem is solved, and constraint (6) will need to be imposed.
I don't understand. There should be no reason to be solving a QP here if we are point identified. Equation (6) is not a restriction to be imposed --- it just tells you how to interpret the intercept after you've run the rescaled regression
If the data is not being rescaled/manipulated, then R will detect there is a problem with collinearity when performing the regression. However, when the data is being rescaled/manipulated, then those intercept terms will actually be 1 + noise, and the noise gets rescaled. R then fails to detect collinearity when performing the regressions, and the coefficient estimates become quite ugly. To get around this, I still perform the regression using the unscaled data to test for collinearity, even if rescale = TRUE.
That should be fine.
To reiterate: the main purpose of rescaling for the point identified case is as a check to make sure everything is working before we go to the partially identified case, where we have less intuition.
So we want to see that rescaling makes no difference across many point identified examples.
Maybe take 500 random subsets of the AE data and check for each one whether you get the same thing, across say 3 or 4 specifications?
Solving with both the QP and lm
is also a good check, but again in practice we do not want to be solving a QP for the point identified case.
Equation (6) is not a restriction to be imposed --- it just tells you how to interpret the intercept after you've run the rescaled regression
Hm, but without equation (6), the computer cannot determine how many free parameters there really are, right?
For example, suppose both m0
and m1
are of the form ~ 1 + x
, so that there are four unknown coefficients in total.
We can estimate these four coefficients using a regression.
Whether or not we rescale, the design matrix will have four columns.
From this, the computer will correctly determine that this is a problem with four unknowns.
Now suppose I drop the intercepts so that m0
and m1
are ~ 0 + x
.
Without rescaling, the design matrix has two columns.
When performing a simple regression, the computer recognizes this is a problem with two unknowns.
But when rescaling, the design matrix will have four columns, because of the new intercept terms.
Without imposing equation (6), the computer will treat the problem as one with four (unrestricted) unknowns---but really, it's a problem with two unknowns, since the intercept is fully determined by the minimums of x
and the coefficients on x
for D = 0
and D = 1
.
Using MTRs without intercepts, I tried comparing the SSR and coefficient estimates from an unscaled regression against those of QP problems with and without equation (6) as a constraint. The SSR and coefficient estimates from the unscaled regression only match those of the QP with equation (6) as a constraint.
Let me know if I'm still getting it wrong, though.
Maybe take 500 random subsets of the AE data and check for each one whether you get the same thing, across say 3 or 4 specifications?
Sure, I'll get that started.
Now suppose I drop the intercepts so that m0 and m1 are ~ 0 + x. Without rescaling, the design matrix has two columns. When performing a simple regression, the computer recognizes this is a problem with two unknowns. But when rescaling, the design matrix will have four columns, because of the new intercept terms. Without imposing equation (6), the computer will treat the problem as one with four (unrestricted) unknowns---but really, it's a problem with two unknowns, since the intercept is fully determined by the minimums of x and the coefficients on x for D = 0 and D = 1.
Got it, thanks for the clarification -- I understand what you are saying now.
So to summarize:
There's still no reason to use a QP though, because CLS has a closed--form solution. Bruce Hansen's textbook Section 8.2 has a clear explanation of this. I wonder if there is a stable package (or maybe even a command in base R) that implements CLS? I am a bit wary of trying to code up the closed--form solution ourselves because it involves inverting the design matrix. As we know, that's not the most stable way to compute a linear regression. I expect someone has devised a better computational approach (and if so, it would probably be contained in whatever R package implements this).
Alternatively, what if we just drop the centering and only rescale? So Btilde = B/(ub - lb). Then we don't have this issue with the constants. Will that give us the right scaling? (I think Btilde has to be between -1 and 1? But might have gotten my inequalities confused)
There's still no reason to use a QP though, because CLS has a closed--form solution. Bruce Hansen's textbook Section 8.2 has a clear explanation of this.
Oh great, thank you for the reference.
I wonder if there is a stable package (or maybe even a command in base R) that implements CLS? I am a bit wary of trying to code up the closed--form solution ourselves because it involves inverting the design matrix.
Sure, I can check.
Alternatively, what if we just drop the centering and only rescale? So Btilde = B/(ub - lb).
Ah, maybe you mean to just scale each covariate x
by 1 / max(abs(x))
?
That will make the rescaled covariates lie between -1 and 1 (but not necessarily span the entire interval).
Maybe take 500 random subsets of the AE data and check for each one whether you get the same thing, across say 3 or 4 specifications?
So I tried these three specifications for both m0
and m1
:
~ u
~ u + yob
~ u + u:yob + yob
.I would also modify them so that:
m0
and m1
have interceptsm0
has an interceptm1
has an interceptm0
nor m1
have intercepts.So there were 12 different pairs of MTRs considered.
I estimated each specification on 500 subsamples of the AE data, each with 1000 observations.
For specifications of type ~ u
and ~ u + yob
, everything checks out.
That is, differences in point estimates, coefficient estimates, and SSRs are of the magnitude1-e7
or smaller.
For specifications of type ~ u + u:yob + yob
, there are a handful of cases where the point estimates and MTR coefficient estimates differ by much larger amounts (up to about 10% of the estimate).
Differences in the SSRs remain small, though.
It turns out these cases all occur when the AE sample is such that there is little variation in the propensity scores, which results in almost-collinear design matrices.
Ah, maybe you mean to just scale each covariate x by 1 / max(abs(x))? That will make the rescaled covariates lie between -1 and 1 (but not necessarily span the entire interval).
I had actually meant 1/ (ub(x) - lb(x)). But 1/max(abs(x)) seems like it would work too. Can you work out some cases to see any pros/cons between them?
I estimated each specification on 500 subsamples of the AE data, each with 1000 observations. For specifications of type ~ u and ~ u + yob, everything checks out. That is, differences in point estimates, coefficient estimates, and SSRs are of the magnitude1-e7 or smaller.
For specifications of type ~ u + u:yob + yob, there are a handful of cases where the point estimates and MTR coefficient estimates differ by much larger amounts (up to about 10% of the estimate). Differences in the SSRs remain small, though. It turns out these cases all occur when the AE sample is such that there is little variation in the propensity scores, which results in almost-collinear design matrices.
The comparison you are doing here is between scaled and unscaled? Or between QP and lm?
What happens if you kick up the sample size to say 5000? I imagine the AE instrument is quite weak with only 1000 observations.
I had actually meant 1/ (ub(x) - lb(x)). But 1/max(abs(x)) seems like it would work too. Can you work out some cases to see any pros/cons between them?
Sure thing.
The comparison you are doing here is between scaled and unscaled? Or between QP and lm?
Oh sorry, I wasn't clear on that. So if there are intercepts in both the MTRs, the comparisons are between scaled and unscaled (so R is just running a linear regression). All these comparisons check out.
The discrepancies only arise for ~ u +u:yob + yob
whenever an intercept is omitted, i.e. when comparing unscaled lm
against scaled QP.
What happens if you kick up the sample size to say 5000? I imagine the AE instrument is quite weak with only 1000 observations.
Ah you're absolutely right. With the larger sample size, all of the comparisons check out.
The discrepancies only arise for ~ u +u:yob + yob whenever an intercept is omitted, i.e. when comparing unscaled lm against scaled QP.
So you mean ~ 0 + u +u:yob + yob
?
Or am I missing something in the formula specification?
Anyway, this is probably because solving a QP directly involves inverting matrices (I think...unclear exactly what is going on in Gurobi) vs. lm
which is going to solve the system of equations using a QR decomposition. That's the reason we want to use lm
whenever possible. With the new normalizing schemes (scaling but not centering) hopefully that won't be a problem.
In the worst case, we can normalize by centering/scaling if there is a constant and just by scaling if there is not a constant. The only reason I can think for using a no-constant specification is if you want a specific interpretation for a set of dummies.
So you mean ~ 0 + u +u:yob + yob? Or am I missing something in the formula specification?
Argh, sorry I'm not doing a good job communicating. Yes, that's right.
Okay, it looks like the best package for CLS is lsei. It is based on Fortran code, and calls on C, so it's really fast. It also seems very robust to scaling. For the sake of documentation, I've included below a brief description of the other packages I considered, and why I did not go with them. @slacouture also expressed that he was looking for a suitable CLS package.
To better test lsei, I ran a simulation generating random data sets, and performed some CLS regression on them. The constraints include both equality and inequality constraints. The scaling of the data was varied in the simulation, beginning with data whose covariates were of similar magnitude, and ending with data where one of the covariates was extremely large. In this simulation, gurobi was able to solve 47% of the cases. lsei solved them all.
In regard to scaling, we've considered three methods.
[0, 1]
interval.1 / (max(x) - min(x))
.
But if a variable with a large mean has a small variance, then its scaled counterpart will still be large relative to the rest of the covariates.1 / max(abs(x))
.
This guarantees that each regressor lies in the [-1, 1]
intervalAs with the previous simulation, I generated some random data, and varied the scaling of one of the covariates across the simulations. I then tried all three methods of rescaling the data. The CLS regressions were performed on both gurobi and lsei. The fraction of cases gurobi could handle were 53% for method 1, 40% for method 2, and 50% for method 3. lsei could handle them all, though.
For documentation, this is the code for the simulations described above. gurobi-vs-lsei.zip
For documentation, here are the packages I considered.
pcls
, but that is a bit annoying to use since the user must find a feasible solution first.qprog
, but this was unstable and relies on the solve
command.lsqlin
, but that only allows for equality constraints.colf_nls
, but that only allows for box constraints.mlsei
, but this is for multivariate regression, and the linear constraints pertain to the coefficients for different outcomes.cgam
, but does not seem to allow for simple linear constraints on the coefficients.In regard to scaling, we've considered three methods.
- Center and scale (the original writeup).
Sorry, that should've been "the second writeup."
Well that's a bit disappointing, but the Lawson and Hanson book that is referenced in lsei
seems like exactly the book I was looking for and presumed existed but didn't know the name of.
So that's a good lead -- I'll read up on it and see if some of their insights can be applied to our problem.
How was the speed comparison of Gurobi vs. lsei
?
Also, when you say Gurobi couldn't handle a case, what does that mean exactly?
Do you have the distribution of the return codes for cases where it failed?
If we were only solving the QP in the first step, we could just use lsei
.
Unfortunately we have the second step QCQP to also contend with.
Any package that solves a second-order cone program (SOCP) or semi-definite program (SDP) would also work.
I wonder if some of these might be better than Gurobi because they are more tailored?
I found two that solve SOCPs:
There are likely others; the R optimization task view looks like a great reference: https://cran.r-project.org/view=Optimization
I also found this code for an algorithm the author claims is "safe" https://rstudio-pubs-static.s3.amazonaws.com/239135_7a61d7e587314277b79f47673cd74920.html Just recording it here for reference.
Do you think you could try scs and CLSOCP and compare them to Gurobi for the second-step problem?
How was the speed comparison of Gurobi vs. lsei?
lsei
is really fast...
It takes a quarter of the time to run compared to Gurobi.
This made me suspicious.
So I decided to compare more closely the estimates for Gurobi and lsei
.
I considered only the cases where the Gurobi status was OPTIMAL
.
Here's what I found.
When the scaling of the data is okay, then the estimates are almost identical (e.g. differ by 1e-9
percent on average).
When the scaling gets bad, the estimates can differ by up to 5% percent on average.
However, there are extreme cases where one of the coefficients estimated by Gurobi is 5 times that of the coefficient estimated by lsei
.
Again, I'm only doing this comparison whenever Gurobi returns an OPTIMAL
status.
So I'm wondering if some of the estimates by lsei
are suboptimal.
However, since lsei
does not return status codes, it's hard to know unless I really dive into the code.
The overall average difference (pooling across reasonably scaled and poorly scaled data) in the coefficient estimates is still small, though, and ranges betweeen 0.1% and 1.3%.
So perhaps the differences between Gurobi and lsei
are not a big deal.
Also, when you say Gurobi couldn't handle a case, what does that mean exactly? Do you have the distribution of the return codes for cases where it failed?
Ah, sorry.
I considered Gurobi unsuccessful in its optimization whenever it returned a status code other than OPTIMAL
.
Below is one of the distributions of the status codes.
Recall that I generated many data sets, and changed the scale of one of the variables, e.g. I multiplied the variable by a constant between a
and b
.
The left panel shows the status codes across the whole simulation.
The middle panel shows the distribution of status codes when I restrict attention to the bottom half of the scaled data, e.g. I've scaled one of its variables by a constant between a
and (a + b) / 2
.
And the right panel shows the distribution of status codes when I restrict attention to the top half of the scaled data, e.g. I've scaled one of its variables by a constant between (a + b) / 2
and b
.
I wonder if some of these might be better than Gurobi because they are more tailored? ... Do you think you could try scs and CLSOCP and compare them to Gurobi for the second-step problem?
Yep, I can do that.
Also, the CPLEX website and the cplexAPI manual seem to both suggest that it is possible for CPLEX to solve QCQPs in R. So I'll try to set that up as well.
So I decided to compare more closely the estimates for Gurobi and lsei. I considered only the cases where the Gurobi status was OPTIMAL.
But this still leaves 47% of cases where Gurobi was not OPTIMAL, doesn't it?
So I would still say lsei
is dominating Gurobi.
Or am I misunderstanding your results?
So I'm wondering if some of the estimates by lsei are suboptimal. However, since lsei does not return status codes, it's hard to know unless I really dive into the code.
So when you say that lsei
solved all of the problems what do you mean exactly?
Just that it didn't return an error code? (But it never returns error codes?)
An easy way to check whether it is suboptimal is just to compare the optimal criterion value to what Gurobi finds. Presumably it should always be smaller (in a minimization problem) if it is finding the optimal solution.
I considered Gurobi unsuccessful in its optimization whenever it returned a status code other than OPTIMAL.
I think that's the right way to evaluate performance. We don't want to have to be second guessing this all the time (and certainly don't want to user to be faced with uncertain return codes).
Also, the CPLEX website and the cplexAPI manual seem to both suggest that it is possible for CPLEX to solve QCQPs in R. So I'll try to set that up as well.
I have a vague recollection from a different project that this is quite hard to do with the CPLEX API and may not be possible. But I'm not sure. Anyway, Gurobi is essentially a fork of CPLEX (albeit several years removed now), so it's not clear to me that it's going to be much better.
Probably better to prioritize scs
since this is more closely tailored to the types of problems we are looking at. My guess is that Gurobi is just trying to "do too much" and not exploiting the structure that we have in our problem.
So I would still say lsei is dominating Gurobi. Or am I misunderstanding your results?
Nope, that's right.
So when you say that lsei solved all of the problems what do you mean exactly? Just that it didn't return an error code? (But it never returns error codes?)
Yeah, lsei
will either return coefficient estimates, or throw out the following error regarding least distance problems:
> lsei::lsei(a = X, b = y, c = eqA, d = eqRhs, e = A, f = rhs)
Error in ldp2(et, ft) : Incompatible inequalities in ldp()
The optimal criterion values calculated by Gurobi and lsei
match whenever Gurobi is optimal.
But my concern was to do with cases where we cannot compare lsei
against Gurobi, since Gurobi couldn't provide an optimal solution.
But my concerns are mostly allayed by the plots below.
The first plot compares the SSR from the constrained regressions by gurobi and lsei
.
On the x-axis is how much I scaled one of the variables by.
Blue dots represent gurobi, red dots represent lsei
.
Circular dots indicate optimal solutions.
I considered lsei
solutions to be optimal so long as all the constraints were satisfied.
The crosses indicate suboptimal solutions (for Gurobi, this coincided with cases where constraints were violated).
The plot seems to suggest the solutions obtained by lsei
are reliable, even when I cannot compare it to Gurobi.
I decided to really push the scaling to see how lsei
performs when near the point of breaking.
I found that you can really push the scaling before the SSR starts to behave strangely.
(The scaling can go much further than what I show in the plot below before lsei
returns an error.)
Even then, the coefficient estimates were still very reasonable.
So I think lsei
is indeed a good package to use.
Probably better to prioritize scs since this is more closely tailored to the types of problems we are looking at.
Sure thing.
Ok, sounds good, just a question on this:
Yeah, lsei will either return coefficient estimates, or throw out the following error regarding least distance problems:
lsei::lsei(a = X, b = y, c = eqA, d = eqRhs, e = A, f = rhs) Error in ldp2(et, ft) : Incompatible inequalities in ldp()
Can you explain that? Is that an error related to numerical problems? (Because it sounds like something else...)
The R function runs the Fortran function ldp
, and returns that message whenever Fortran indicates it is not successful in finding a solution.
r = .Fortran("ldp", e, m, m, n, f, x = x, xnorm, w, index,
mode = mode, PACKAGE = "lsei")[c("x", "mode")]
if (r$mode != 1)
stop("Incompatible constraints in ldp()")
Below is the Fortran code, which is available here. Fortran actually returns more detailed error codes, but unfortunately we won't know which.
C SUCCESSFUL RETURN.
110 MODE=1
RETURN
C ERROR RETURN. N .LE. 0.
120 MODE=2
RETURN
C RETURNING WITH CONSTRAINTS NOT COMPATIBLE.
130 MODE=4
RETURN
I don't quite understand Fortran code, but the Fortran function LDP()
calls on a non-negative least squares algorithm NNLS()
, and the output from that determines the error code returned by LDP()
.
I see, so just to be clear: you think that lsei
is returning an uninformative error code when it fails. (Even though with additional programming the author could have perhaps pulled a more detailed error code from the underlying Fortran routine.)
Yep, that's right.
Sorry this is taking longer than perhaps expected, I have to read up on SOCPs and SDPs.
In the case of SOCPs, the free solvers require the user to input non-rotated quadratic cones. However, rewriting a QCQP problem as an SOCP problem requires a rotated quadratic cone. I've attached a writeup explaining why this is a problem if we want to avoid quadratic constraints. Perhaps I am doing something wrong, though.
qcqp-to-socp.pdf [UPDATED: Includes explanation of why I am writing cones the way that I am.]
However, Mosek lets you input rotated cones. Like Gurobi and CPLEX, it offers free academic licenses.
I'm still looking into reformulating QCQPs into SDPs.
I realized I wasn't being clear in the PDF on why I was writing the conic constraint the way that I was. I've updated the PDF to include an explanation.
I'd be really surprised if it's not possible. It's probably just some OR tricks that we're not seeing.
I took a stab at it (see the last page). qcqp-to-socp.pdf I think it looks right, but I've never worked with this class of problems. What do you think?
I got a hold of the Lawson and Hanson book and read through the algorithm used for lsei
.
(By the way, ldp
is for a more general class of problems --- lsei
translates the problem into one of the form ldp
. That's what the errors talk about ldp
. Then ldp
in turn depends on an algorithm for solving non-negative least squares -- nnls
.)
It's pretty impressive stuff. It is simple and very tailored to the structure of the problem. After reading it I am not at all surprised that it dominates Gurobi.
So hopefully the SOCP solvers will have similar advantages.
In reading the book I also found an alternative way to scale the design matrix that should improve scaling and in particular are designed to improve the condition number. (Different than what we had come up with.) So we could try that too with Gurobi as a last-ditch effort.
It's probably just some OR tricks that we're not seeing.
Correction: OR tricks that I wasn't seeing (the simplest of them, too!).
Thanks for those notes, I've now been able to get CLSOCP
to work.
(Still trying to get scs
to work.)
So hopefully the SOCP solvers will have similar advantages.
Unfortunately, the performance of CLSOCP
does not look great.
One problem seems related to how each observation in the data generates two new variables in the SOCP setup.
One variable is trivial, and is set equal to the dependent variable.
But the other variable is not, and is equal to the residual.
So while CLSOP
can sometimes match Gurobi for small data sets, e.g. n = 3
, the solution becomes quite different by the time n = 10
.
It also takes longer to run when n
gets large.
Another problem CLSOP
doesn't seem to do a good job optimizing.
In the case of minimization problems, the optimal objective values from CLSOCP
are often much larger than those of Gurobi.
As a note: CLSOP
uses the algorithm linked a few posts above, i.e. that algorithm is not another alternative.
One problem seems related to how each observation in the data generates two new variables in the SOCP setup.
Maybe I'm wrong to think that, though, since the link on the algorithm used by CLSOCP
suggests it's easier to convert a QCQP to a SOCP and solve that, rather than solve the QCQP directly.
Hopefully scs
will be better.
It's certainly problematic that this formulation increases in difficulty with n
.
Does that seem like it is inherent to any SOCP formulation?
It might also be useful to compare to a general purpose solver that only uses derivative information. e.g. this one --- although there are many, so maybe that's not the best.
This would be a useful benchmark at least to compare Gurobi against.
It's certainly problematic that this formulation increases in difficulty with n. Does that seem like it is inherent to any SOCP formulation?
Ah I finally figured out how to get scs
to work, and it is much, much better than CLSOCP
.
However, it still suffers from this issue of becoming more difficult to solve as n
grows.
It can keep up with Gurobi in terms of speed and optimization up to about 300 observations before it starts to regularly return suboptimal solutions and slow down.
The SOCP formulation I proposed doesn't require us to introduce a new non-trivial variable for every observation. However, it involves those rotated cones, and it looks like we need Mosek to implement that formulation. But maybe there is another way to formulate the SOCP problem to avoid this---I'll give it another shot.
It might also be useful to compare to a general purpose solver that only uses derivative information. e.g. this one --- although there are many, so maybe that's not the best.
Sure, I'll look into that as well.
In the case of SOCPs, the free solvers require the user to input non-rotated quadratic cones. However, rewriting a QCQP problem as an SOCP problem requires a rotated quadratic cone. I've attached a writeup explaining why this is a problem if we want to avoid quadratic constraints. Perhaps I am doing something wrong, though.
Going back to the original problem, which was the idea that these solvers only take non-rotated quadratic cones. This Mosek page shows how to convert a rotated quadratic cone into a standard quadratic cone.
Your original proposed reformulation did not have the # of variables increasing in n, did it?
It seems based on the Mosek page (which has a wealth of information) that it should be able to phrase our problem with a non-rotated quadratic cone constraint by combining section 3.3.1 and section 3.1.
Your original proposed reformulation did not have the # of variables increasing in n, did it? It seems based on the Mosek page (which has a wealth of information) that it should be able to phrase our problem with a non-rotated quadratic cone constraint by combining section 3.3.1 and section 3.1.
Oh, right it was not increasing in n
.
Yes, I'll give this a go as well.
Good news: manually undoing the rotation in the SOCP works.
So we have a way to estimate QCQPs as SOCPs using scs
without adding new variables for every observation.
Just to be clear, I have not incorporated this into ivmte
yet.
I noticed that rescaling the problem really improves the performance of scs
.
For example, the quadratic constraint can involve fairly large matrices and vectors if the sample size is large.
If I don't rescale the quadratic constraint, scs
and gurobi
are about equally fast and typically get similar solutions.
However, there are some instances where scs
is way off.
But by dividing the entire quadratic constraint by the sample size, scs
always matches gurobi
, and takes a quarter of the time to run.
However, because of the way scs
requires the user to submit the problem, we must introduce a new slack variable for every inequality constraint.
This can potentially be problematic in the audit, but we can test this if we do decide to incorporate scs
into the ivmte
.
For the sake of documentation, I've updated the notes above to record how a QCQP problem can be reformulated to be compatible with scs
.
Great to hear!
Rescaling here just means dividing by sample size? Were you doing that with Gurobi, and if so was it making a difference?
In this case, rescaling does simply mean dividing by the sample size.
Also, it is only done for the quadratic constraint.
I will do another test with poorly scaled data, and then see how scs
responds when I rescale everything according to the writeup you posted earlier.
And I was indeed dividing the quadratic constraint by the sample size with Gurobi. Doing this did resolve some instances where we were getting suboptimal results. However, it didn't always lead to optimal results, and I believe it sometimes even brought about the suboptimal warnings.
Oh sorry, I realized I misunderstood your question.
Were you doing that with Gurobi, and if so was it making a difference?
I believe you are simply asking about this recent comparison of scs
against Gurobi, and not the package.
So yes, I was dividing by the sample size with Gurobi when comparing it against scs
.
But it did not really make a difference for Gurobi.
And I was indeed dividing the quadratic constraint by the sample size with Gurobi. Doing this did resolve some instances where we were getting suboptimal results. However, it didn't always lead to optimal results, and I believe it sometimes even brought about the suboptimal warnings.
What I was saying above pertains to the ivmte
package.
Sorry, I'm very confused now.
If I don't rescale the quadratic constraint, scs and gurobi are about equally fast and typically get similar solutions. However, there are some instances where scs is way off. But by dividing the entire quadratic constraint by the sample size, scs always matches gurobi, and takes a quarter of the time to run.
We are going through all of this effort because Gurobi is not terminating cleanly. So why should matching Gurobi be the benchmark for success?
You are saying here that scs matches Gurobi, but is faster. But speed was not out concern. Our concern was that Gurobi is terminating at suboptimal solutions.
Yes, the last few posts were for tests that Gurobi can actually solve. I did these because
For example, for problems that Gurobi can solve, CLSOCP
would often return different solutions that turned out to be suboptimal.
However, unlike Gurobi, CLSOCP
didn't warn me that they were suboptimal.
So I thought it was worth checking that the packages we are considering are comparable to Gurobi for simpler problems.
And if they turn out to be much worse than Gurobi, then it seems unlikely they would do better in problems that even Gurobi struggles with.
lsei
likewise doesn't report the optimization status, but its performance with the simpler problems give some credibility to its solutions for problems that Gurobi struggles with.
And I will certainly check how scs
performs for problems where Gurobi terminates early!
Ah, sorry for emphasizing the time. A question about speed come up earlier, so I thought I'd keep the details on that.
Got it, now it makes sense! The true test will be how scs performs on the problems that Gurobi fails on.
I wrote up a procedure for doing so in the attached note. @jkcshea let me know if it makes sense and looks feasible.
Here's a demonstration of why I think it will work:
output:
My guess is that this will solve #196 as well as countless future problems we have yet to encounter!