Improve scaling of optimization problems

a-torgovitsky commented 3 years ago

I wrote up a procedure for doing so in the attached note. @jkcshea let me know if it makes sense and looks feasible.

Here's a demonstration of why I think it will work:

library("ivmte")

df <- AE
args <- list(data = df,
             target = "att",
             outcome = "worked",
             m0 = ~ u + I(u^2) + yob + u*yob,
             m1 = ~ u + I(u^2) + I(u^3) + yob + u*yob,
             propensity = morekids ~ samesex,
             noisy = FALSE)
results <- do.call(ivmte, args)
B <- results$X

library("gurobi")

model <- list()
model$Q <- (t(B) %*% B) # the quadratic portion of our criterion

 objective
model$A <- matrix(rep(0, dim(B)[2]), nrow = 1)
model$rhs <- 0

# Relatively poorly scaled -- large magnitude differences
r <- gurobi(model)

# Scale each column and try again
Bsc <- apply(B, MARGIN = 2, FUN = function (x) (x - min(x))/diff(range(x)))
model$Q <- (t(Bsc) %*% Bsc)
# Looking much better
r <- gurobi(model)

output:

Warning:  The LP solver was unable to satisfy the optimality tolerance for the maximization problem, so a suboptimal solution is returned. Tolerance parameters for the LP solver can be passed through the argument 'solver.options'. 
Warning:  The LP solver was unable to satisfy the optimality tolerance for the minimization problem, so a suboptimal solution is returned. The LP solver was unable to satisfy the optimality tolerance for the maximization problem, so a suboptimal solution is returned. Tolerance parameters for the LP solver can be passed through the argument 'solver.options'. 
Gurobi Optimizer version 9.0.2 build v9.0.2rc0 (linux64)
Optimize a model with 1 rows, 11 columns and 0 nonzeros
Model fingerprint: 0x2e512046
Model has 36 quadratic objective terms
Coefficient statistics:
  Matrix range     [0e+00, 0e+00]
  Objective range  [0e+00, 0e+00]
  QObjective range [1e+01, 9e+08]
  Bounds range     [0e+00, 0e+00]
  RHS range        [0e+00, 0e+00]
Warning: Model contains large quadratic objective coefficients
         Consider reformulating model or setting NumericFocus parameter
         to avoid numerical issues.
Presolve removed 1 rows and 11 columns
Presolve time: 0.00s
Presolve: All rows and columns removed

Barrier solved model in 0 iterations and 0.00 seconds
Optimal objective 0.00000000e+00
Gurobi Optimizer version 9.0.2 build v9.0.2rc0 (linux64)
Optimize a model with 1 rows, 11 columns and 0 nonzeros
Model fingerprint: 0x58e202e2
Model has 36 quadratic objective terms
Coefficient statistics:
  Matrix range     [0e+00, 0e+00]
  Objective range  [0e+00, 0e+00]
  QObjective range [8e+04, 5e+05]
  Bounds range     [0e+00, 0e+00]
  RHS range        [0e+00, 0e+00]
Presolve removed 1 rows and 11 columns
Presolve time: 0.00s
Presolve: All rows and columns removed

Barrier solved model in 0 iterations and 0.00 seconds
Optimal objective 0.00000000e+00

My guess is that this will solve #196 as well as countless future problems we have yet to encounter!

a-torgovitsky commented 3 years ago

Here's the writeup: stability.pdf

jkcshea commented 3 years ago

This is great! The writeup is clear, and other than making sure I scale everything correctly, this should be straightforward to implement.

a-torgovitsky commented 3 years ago

Ok, great!

Maybe put this into a development branch? Or create a new/old function temporarily? Not sure what the best approach is.

But we want to be able to check the unscaled/scaled approach against each other in several examples where we are confident the unscaled approach is working, so that we are sure there aren't any flaws in my reasoning.

jkcshea commented 3 years ago

Ah yes, I've been making all the changes on separate branches, and then merging them into the master branch once I think they are ready. So I'll continue using that approach, which will make it very easy to compare the scaled/unscaled versions of the code.

a-torgovitsky commented 3 years ago

Excellent! And like I commented in the note, I think checking the point identified case first is a useful check, since we don't have any computational issues there.

a-torgovitsky commented 3 years ago

I suppose it might be useful to turn off scaling for debugging purposes too. Is that going to be difficult to add an option to allow that?

jkcshea commented 3 years ago

Good idea. Adding the option shouldn't be a problem at all.

jkcshea commented 3 years ago

Sorry for taking a while on this.

1. The procedure does not eliminate suboptimal solutions

I've implemented the procedure. It can be toggled by setting rescale to be TRUE or FALSE. However, setting rescale = TRUE does not necessarily resolve the issue where the solver is unable to satisfy the optimality tolerance. The following example from #196 still results in this error, regardless of the value for rescale.

args <- list(data = AE,
             target = "att",
             outcome = "worked",
             m0 = ~ u + I(u^2) + yob + u*yob,
             m1 = ~ u + I(u^2) + I(u^3) + yob + u*yob,
             propensity = morekids ~ samesex,
             debug = TRUE,
             rescale = FALSE,
             noisy = TRUE)
r1 <- do.call(ivmte, args)
## Now rescale
args$rescale <- TRUE
r2 <- do.call(ivmte, args)

Another issue is that the solutions differ quite a bit depending on the value of rescale. For instance, below are the bounds from the block of code above.

> r1

Bounds on the target parameter: [-0.2871085, 0.05524949]
Audit reached audit.max (25)

> r2

Bounds on the target parameter: [-0.3084282, 0.1287807]
Audit reached audit.max (25)

I've checked the code several times, and I can't seem to find any new errors. The problem seems to stem from the QCP, as I have yet to see this problem when minimizing the criterion. Also, the minimum criterion does not vary with the value of rescale.

The scaling does play a role, though. The matrices and vectors defining the quadratic components of the QP and QCP can be very large even after implementing the rescaling procedure since they sum over the observations. This is the case with the AE data set, which has 200,000+ observations So I tried scaling the quadratic components by 1/n, where n is the number of observations. In this case, the bounds are no longer suboptimal solutions to the QCP when rescale = FALSE. But surprisingly, the bounds are suboptimal when rescale = TRUE. The difference between the unscaled and rescaled solutions are much smaller, though.

## Solutions when quadratic component is scaled by 1/n
> r1

Bounds on the target parameter: [-0.2647924, 0.03654437]
Audit terminated successfully after 1 round 

> r2

Bounds on the target parameter: [-0.2677133, 0.04401816]
Audit terminated successfully after 1 round

2. Difference in solutions between unscaled and rescaled problems depends on number of variables and constraints

Although rescaling the components of the QCP should not matter in theory, it does in practice, as demonstrated above. The extent of the difference also seems to vary with the number of linear constraints. Below are plots (and code) from a simple simulation where I solve random QCPs, varying the number of linear constraints. Each point is generated by 1000 QCPs. The difference between the optimal objective values grows with the number of constraints.

Selection_044

linear-constraint-check.zip

If this is a computational issue, I'm not sure if there is much we can do. But I thought I should share this in case it indicates I've done something wrong.

jkcshea commented 3 years ago

I forgot to mention that the code is on a separate branch, enhance/stability.

a-torgovitsky commented 3 years ago

Did you check whether scaling ever has an impact in point identified cases?

The matrices and vectors defining the quadratic components of the QP and QCP can be very large even after implementing the rescaling procedure since they sum over the observations. This is the case with the AE data set, which has 200,000+ observations So I tried scaling the quadratic components by 1/n, where n is the number of observations.

This is going to change the problem though, isn't it? If you only scale the quadratic components by 1/n but not anything else, then you've changed the relative weight in the objective.

Difference in solutions between unscaled and rescaled problems depends on number of variables and constraints

Although rescaling the components of the QCP should not matter in theory, it does in practice, as demonstrated above. The extent of the difference also seems to vary with the number of linear constraints. Below are plots (and code) from a simple simulation where I solve random QCPs, varying the number of linear constraints. Each point is generated by 1000 QCPs. The difference between the optimal objective values grows with the number of constraints.

But these are differences on the order of 1e-6, if I am reading the y--axis correctly. I don't think we can expect more precision than that from Gurobi on a program with a quadratic constraint or objective.

jkcshea commented 3 years ago

Did you check whether scaling ever has an impact in point identified cases?

So far, in the examples I've played with, scaling does not significantly impact point estimates. For instance, in the following example, the point estimates differ by 1.5e-4, but that is about 0.05% of each estimate.

> dtcf <- ivmte:::gendistCovariates()$data.full
> args <- list(data = dtcf,
+              target = 'ate',
+              outcome = 'ey',
+              m0 = ~ x1 + x2 + u + I(u^2),
+              m1 = ~ x1 + x2 + u + I(u^2),
+              propensity = ~p,
+              treat = 'd')
> ## Do not rescale
> args$rescale <- FALSE
> do.call(ivmte, args)

Point estimate of the target parameter: 0.315994

Warning message:
MTR is point identified via linear regression. Shape constraints are ignored. 
> ## Now rescale
> args$rescale <- TRUE
> do.call(ivmte, args)

Point estimate of the target parameter: 0.3158347

Warning message:
MTR is point identified via linear regression. Shape constraints are ignored. 
>

In regard to how scaling is performed in the point identified case: When m0 and m1 have intercept terms, the scaling procedure in the writeup results in a collinear design matrix. That is, the intercept terms of m0 and m1 will be collinear with the new intercept $\xi_0$. So checking for point identification is done by checking the rank of the design matrix before rescaling everything. Estimation is then performed by solving a QP, with the only linear constraint being equation (10) in the writeup, i.e. that $\xi_0$ is a linear combination of the other $\xi_k$.

If you only scale the quadratic components by 1/n but not anything else, then you've changed the relative weight in the objective.

Ah sorry, I worded that very poorly. What I meant to say was that I scaled the entire quadratic objective when minimizing the criterion, and I scaled the entire quadratic constraint when obtaining the bounds. i.e. I scaled the sum of squared residuals by 1/n. Let me know if I'm wrong to have done this.

Difference in solutions between unscaled and rescaled problems depends on number of variables and constraints

Usually the differences are indeed small, but they are noticeable sometimes. Here is one example where the lower bounds for the unscaled and scaled case differ by over 3%. (But maybe 3% isn't a big deal?) Gurobi does not spit out the suboptimality warning in this example.

> args <- list(data = dtcf,
+              target = 'ate',
+              outcome = 'ey',
+              m0 = ~ u + I(u^2),
+              m1 = ~ u + I(u^2),
+              propensity = ~p,
+              treat = 'd',
+              point = FALSE) ## Force partial identification
> ## Do not rescale
> args$rescale <- FALSE
> do.call(ivmte, args)

Bounds on the target parameter: [-0.04050069, -0.0325813]
Audit terminated successfully after 1 round 

> ## Now rescale
> args$rescale <- TRUE
> do.call(ivmte, args)

Bounds on the target parameter: [-0.03910859, -0.03322879]
Audit terminated successfully after 1 round

When the suboptimality warning is thrown out, then the differences can indeed be quite large. In the following example, I was able to get the suboptimal warning even when minimizing the criterion. The unscaled estimate has a minimum criterion of 9, whereas the scaled one has a minimum criterion of 37. As a result, the bounds differ quite a bit.

> dtcf$x0 <- 1
> args <- list(data = dtcf,
+              target = 'ate',
+              outcome = 'ey',
+              m0 = ~ x0 + x1 + x2 + u + I(u^2),
+              m1 = ~ x0 + x1 + x2 + u + I(u^2),
+              propensity = ~p,
+              treat = 'd')
> ## Do not rescale
> args$rescale <- FALSE
> do.call(ivmte, args)
Warning:  The LP solver was unable to satisfy the optimality tolerance when minimizing the criterion, 
so a suboptimal solution is returned. Tolerance parameters for the LP solver can be passed through 
the argument 'solver.options'. 
Warning:  The LP solver was unable to satisfy the optimality tolerance for the minimization problem, 
so a suboptimal solution is returned. The LP solver was unable to satisfy the optimality tolerance for 
the maximization problem, so a suboptimal solution is returned. Tolerance parameters for the LP 
solver can be passed through the argument 'solver.options'. 

Bounds on the target parameter: [0.1175554, 0.3361068]
Audit terminated successfully after 1 round 

> ## Now rescale
> args$rescale <- TRUE
> do.call(ivmte, args)
...
Warning:  The LP solver was unable to satisfy the optimality tolerance when minimizing the criterion, 
so a suboptimal solution is returned. Tolerance parameters for the LP solver can be passed through 
the argument 'solver.options'. 
Warning:  The LP solver was unable to satisfy the optimality tolerance for the minimization problem, 
so a suboptimal solution is returned. Tolerance parameters for the LP solver can be passed through 
the argument 'solver.options'. 

Bounds on the target parameter: [-0.08836716, 0.4813966]
Audit terminated successfully after 2 rounds

a-torgovitsky commented 3 years ago

Let's start with the simplest problem first:

In regard to how scaling is performed in the point identified case: When m0 and m1 have intercept terms, the scaling procedure in the writeup results in a collinear design matrix. That is, the intercept terms of m0 and m1 will be collinear with the new intercept $\xi_0$. So checking for point identification is done by checking the rank of the design matrix before rescaling everything.

If the rescaled design matrix is collinear, but the unrescaled design matrix is not, then the rescaling is changing the optimization problem. So this is an issue.

I see now why its happening. A constant in m0 becomes a B regressor like (1-D) -- recentered/scaled it is still (1-D). A constant in m1 becomes a B regressor like D -- recentered/scaled it is still D. So when we add a constant to the rescaled regression we have (1-D) + D = 1 ---> perfectly collinear.

I think this is a function of poor indexing on my part. If we just had any old regression, obviously we couldn't normalize the constant term because its upper bound is the same as its lower bound. In our regression, we actually have two constant terms potentially -- it's just that they don't look constant when we pool the regression across D = 0 and D = 1 arms.

So we shouldn't be normalizing either of those terms.

Here's a revised document that fixes this. stability.pdf Try that and let me know if it fixes the point identified case. If so, then we can re-examine the partially identified case. (Whatever was going on there may have been related to this issue as well.)

jkcshea commented 3 years ago

Thanks for that writeup, it has been implemented (but not yet pushed---let me know if you would like it, and I can clean up the code).

If both MTRs include intercepts, then a simple linear regression is performed whether or not everything is rescaled. That is, constraint (6) from the new writeup imposes no real restrictions. However, if either one of the MTRs is missing the intercept, then a QP problem is solved, and constraint (6) will need to be imposed.

Something annoying I found was that R will sometimes add tiny noise to the data when it is being manipulated. For instance, I passed an MTR with two constants, e.g.

dtcf$x0 <- 1 ## Add a constant variable to the data
args$m0 <- ~ 1 + x0 + x1 + u ## Pass MTR with two intercepts

If the data is not being rescaled/manipulated, then R will detect there is a problem with collinearity when performing the regression. However, when the data is being rescaled/manipulated, then those intercept terms will actually be 1 + noise, and the noise gets rescaled. R then fails to detect collinearity when performing the regressions, and the coefficient estimates become quite ugly. To get around this, I still perform the regression using the unscaled data to test for collinearity, even if rescale = TRUE.

Let me know if you'd like me to look further into the new collinearity issue, or if my description at the top does not align with what you had in mind or the writeup.

a-torgovitsky commented 3 years ago

If both MTRs include intercepts, then a simple linear regression is performed whether or not everything is rescaled. That is, constraint (6) from the new writeup imposes no real restrictions. However, if either one of the MTRs is missing the intercept, then a QP problem is solved, and constraint (6) will need to be imposed.

I don't understand. There should be no reason to be solving a QP here if we are point identified. Equation (6) is not a restriction to be imposed --- it just tells you how to interpret the intercept after you've run the rescaled regression

If the data is not being rescaled/manipulated, then R will detect there is a problem with collinearity when performing the regression. However, when the data is being rescaled/manipulated, then those intercept terms will actually be 1 + noise, and the noise gets rescaled. R then fails to detect collinearity when performing the regressions, and the coefficient estimates become quite ugly. To get around this, I still perform the regression using the unscaled data to test for collinearity, even if rescale = TRUE.

That should be fine.

To reiterate: the main purpose of rescaling for the point identified case is as a check to make sure everything is working before we go to the partially identified case, where we have less intuition. So we want to see that rescaling makes no difference across many point identified examples. Maybe take 500 random subsets of the AE data and check for each one whether you get the same thing, across say 3 or 4 specifications? Solving with both the QP and lm is also a good check, but again in practice we do not want to be solving a QP for the point identified case.

jkcshea commented 3 years ago

Equation (6) is not a restriction to be imposed --- it just tells you how to interpret the intercept after you've run the rescaled regression

Hm, but without equation (6), the computer cannot determine how many free parameters there really are, right?

For example, suppose both m0 and m1 are of the form ~ 1 + x, so that there are four unknown coefficients in total. We can estimate these four coefficients using a regression. Whether or not we rescale, the design matrix will have four columns. From this, the computer will correctly determine that this is a problem with four unknowns.

Now suppose I drop the intercepts so that m0 and m1 are ~ 0 + x. Without rescaling, the design matrix has two columns. When performing a simple regression, the computer recognizes this is a problem with two unknowns. But when rescaling, the design matrix will have four columns, because of the new intercept terms. Without imposing equation (6), the computer will treat the problem as one with four (unrestricted) unknowns---but really, it's a problem with two unknowns, since the intercept is fully determined by the minimums of x and the coefficients on x for D = 0 and D = 1.

Using MTRs without intercepts, I tried comparing the SSR and coefficient estimates from an unscaled regression against those of QP problems with and without equation (6) as a constraint. The SSR and coefficient estimates from the unscaled regression only match those of the QP with equation (6) as a constraint.

Let me know if I'm still getting it wrong, though.

Maybe take 500 random subsets of the AE data and check for each one whether you get the same thing, across say 3 or 4 specifications?

Sure, I'll get that started.

a-torgovitsky commented 3 years ago

Now suppose I drop the intercepts so that m0 and m1 are ~ 0 + x. Without rescaling, the design matrix has two columns. When performing a simple regression, the computer recognizes this is a problem with two unknowns. But when rescaling, the design matrix will have four columns, because of the new intercept terms. Without imposing equation (6), the computer will treat the problem as one with four (unrestricted) unknowns---but really, it's a problem with two unknowns, since the intercept is fully determined by the minimums of x and the coefficients on x for D = 0 and D = 1.

Got it, thanks for the clarification -- I understand what you are saying now.

So to summarize:

Everything is as expected if both MTRs have constant terms.
If one (or both) do not have a constant term, then we essentially have a linear regression with equality constraints. This is called constrained least squares (CLS).

There's still no reason to use a QP though, because CLS has a closed--form solution. Bruce Hansen's textbook Section 8.2 has a clear explanation of this. I wonder if there is a stable package (or maybe even a command in base R) that implements CLS? I am a bit wary of trying to code up the closed--form solution ourselves because it involves inverting the design matrix. As we know, that's not the most stable way to compute a linear regression. I expect someone has devised a better computational approach (and if so, it would probably be contained in whatever R package implements this).

Alternatively, what if we just drop the centering and only rescale? So Btilde = B/(ub - lb). Then we don't have this issue with the constants. Will that give us the right scaling? (I think Btilde has to be between -1 and 1? But might have gotten my inequalities confused)

jkcshea commented 3 years ago

There's still no reason to use a QP though, because CLS has a closed--form solution. Bruce Hansen's textbook Section 8.2 has a clear explanation of this.

Oh great, thank you for the reference.

I wonder if there is a stable package (or maybe even a command in base R) that implements CLS? I am a bit wary of trying to code up the closed--form solution ourselves because it involves inverting the design matrix.

Sure, I can check.

Alternatively, what if we just drop the centering and only rescale? So Btilde = B/(ub - lb).

Ah, maybe you mean to just scale each covariate x by 1 / max(abs(x))? That will make the rescaled covariates lie between -1 and 1 (but not necessarily span the entire interval).

Maybe take 500 random subsets of the AE data and check for each one whether you get the same thing, across say 3 or 4 specifications?

So I tried these three specifications for both m0 and m1:

~ u
~ u + yob
~ u + u:yob + yob.

I would also modify them so that:

Both m0 and m1 have intercepts
Only m0 has an intercept
Only m1 has an intercept
Neither m0 nor m1 have intercepts.

So there were 12 different pairs of MTRs considered.

I estimated each specification on 500 subsamples of the AE data, each with 1000 observations. For specifications of type ~ u and ~ u + yob, everything checks out. That is, differences in point estimates, coefficient estimates, and SSRs are of the magnitude1-e7 or smaller.

For specifications of type ~ u + u:yob + yob, there are a handful of cases where the point estimates and MTR coefficient estimates differ by much larger amounts (up to about 10% of the estimate). Differences in the SSRs remain small, though. It turns out these cases all occur when the AE sample is such that there is little variation in the propensity scores, which results in almost-collinear design matrices.

a-torgovitsky commented 3 years ago

Ah, maybe you mean to just scale each covariate x by 1 / max(abs(x))? That will make the rescaled covariates lie between -1 and 1 (but not necessarily span the entire interval).

I had actually meant 1/ (ub(x) - lb(x)). But 1/max(abs(x)) seems like it would work too. Can you work out some cases to see any pros/cons between them?

I estimated each specification on 500 subsamples of the AE data, each with 1000 observations. For specifications of type ~ u and ~ u + yob, everything checks out. That is, differences in point estimates, coefficient estimates, and SSRs are of the magnitude1-e7 or smaller.

For specifications of type ~ u + u:yob + yob, there are a handful of cases where the point estimates and MTR coefficient estimates differ by much larger amounts (up to about 10% of the estimate). Differences in the SSRs remain small, though. It turns out these cases all occur when the AE sample is such that there is little variation in the propensity scores, which results in almost-collinear design matrices.

The comparison you are doing here is between scaled and unscaled? Or between QP and lm?

What happens if you kick up the sample size to say 5000? I imagine the AE instrument is quite weak with only 1000 observations.

jkcshea commented 3 years ago

I had actually meant 1/ (ub(x) - lb(x)). But 1/max(abs(x)) seems like it would work too. Can you work out some cases to see any pros/cons between them?

Sure thing.

The comparison you are doing here is between scaled and unscaled? Or between QP and lm?

Oh sorry, I wasn't clear on that. So if there are intercepts in both the MTRs, the comparisons are between scaled and unscaled (so R is just running a linear regression). All these comparisons check out.

The discrepancies only arise for ~ u +u:yob + yob whenever an intercept is omitted, i.e. when comparing unscaled lm against scaled QP.

What happens if you kick up the sample size to say 5000? I imagine the AE instrument is quite weak with only 1000 observations.

Ah you're absolutely right. With the larger sample size, all of the comparisons check out.

a-torgovitsky commented 3 years ago

The discrepancies only arise for ~ u +u:yob + yob whenever an intercept is omitted, i.e. when comparing unscaled lm against scaled QP.

So you mean ~ 0 + u +u:yob + yob? Or am I missing something in the formula specification?

Anyway, this is probably because solving a QP directly involves inverting matrices (I think...unclear exactly what is going on in Gurobi) vs. lm which is going to solve the system of equations using a QR decomposition. That's the reason we want to use lm whenever possible. With the new normalizing schemes (scaling but not centering) hopefully that won't be a problem.

In the worst case, we can normalize by centering/scaling if there is a constant and just by scaling if there is not a constant. The only reason I can think for using a no-constant specification is if you want a specific interpretation for a set of dummies.

jkcshea commented 3 years ago

So you mean ~ 0 + u +u:yob + yob? Or am I missing something in the formula specification?

Argh, sorry I'm not doing a good job communicating. Yes, that's right.

jkcshea commented 3 years ago

Okay, it looks like the best package for CLS is lsei. It is based on Fortran code, and calls on C, so it's really fast. It also seems very robust to scaling. For the sake of documentation, I've included below a brief description of the other packages I considered, and why I did not go with them. @slacouture also expressed that he was looking for a suitable CLS package.

To better test lsei, I ran a simulation generating random data sets, and performed some CLS regression on them. The constraints include both equality and inequality constraints. The scaling of the data was varied in the simulation, beginning with data whose covariates were of similar magnitude, and ending with data where one of the covariates was extremely large. In this simulation, gurobi was able to solve 47% of the cases. lsei solved them all.

In regard to scaling, we've considered three methods.

Center and scale (the original writeup). [CORRECTION: The second writeup] This guarantees that each regressor lies in the [0, 1] interval.
Only scale by 1 / (max(x) - min(x)). But if a variable with a large mean has a small variance, then its scaled counterpart will still be large relative to the rest of the covariates.
Only scale by 1 / max(abs(x)). This guarantees that each regressor lies in the [-1, 1] interval

As with the previous simulation, I generated some random data, and varied the scaling of one of the covariates across the simulations. I then tried all three methods of rescaling the data. The CLS regressions were performed on both gurobi and lsei. The fraction of cases gurobi could handle were 53% for method 1, 40% for method 2, and 50% for method 3. lsei could handle them all, though.

For documentation, this is the code for the simulations described above. gurobi-vs-lsei.zip

For documentation, here are the packages I considered.

mcgv has the function pcls, but that is a bit annoying to use since the user must find a feasible solution first.
coneproj has the function qprog, but this was unstable and relies on the solve command.
pracma has lsqlin, but that only allows for equality constraints.
colf has colf_nls, but that only allows for box constraints.
cmls has mlsei, but this is for multivariate regression, and the linear constraints pertain to the coefficients for different outcomes.
cgam has cgam, but does not seem to allow for simple linear constraints on the coefficients.

jkcshea commented 3 years ago

In regard to scaling, we've considered three methods.

Center and scale (the original writeup).

Sorry, that should've been "the second writeup."

a-torgovitsky commented 3 years ago

Well that's a bit disappointing, but the Lawson and Hanson book that is referenced in lsei seems like exactly the book I was looking for and presumed existed but didn't know the name of. So that's a good lead -- I'll read up on it and see if some of their insights can be applied to our problem.

How was the speed comparison of Gurobi vs. lsei? Also, when you say Gurobi couldn't handle a case, what does that mean exactly? Do you have the distribution of the return codes for cases where it failed?

If we were only solving the QP in the first step, we could just use lsei. Unfortunately we have the second step QCQP to also contend with. Any package that solves a second-order cone program (SOCP) or semi-definite program (SDP) would also work. I wonder if some of these might be better than Gurobi because they are more tailored? I found two that solve SOCPs:

There are likely others; the R optimization task view looks like a great reference: https://cran.r-project.org/view=Optimization

I also found this code for an algorithm the author claims is "safe" https://rstudio-pubs-static.s3.amazonaws.com/239135_7a61d7e587314277b79f47673cd74920.html Just recording it here for reference.

Do you think you could try scs and CLSOCP and compare them to Gurobi for the second-step problem?

jkcshea commented 3 years ago

How was the speed comparison of Gurobi vs. lsei?

lsei is really fast... It takes a quarter of the time to run compared to Gurobi.

This made me suspicious. So I decided to compare more closely the estimates for Gurobi and lsei. I considered only the cases where the Gurobi status was OPTIMAL.

Here's what I found. When the scaling of the data is okay, then the estimates are almost identical (e.g. differ by 1e-9 percent on average). When the scaling gets bad, the estimates can differ by up to 5% percent on average. However, there are extreme cases where one of the coefficients estimated by Gurobi is 5 times that of the coefficient estimated by lsei. Again, I'm only doing this comparison whenever Gurobi returns an OPTIMAL status. So I'm wondering if some of the estimates by lsei are suboptimal. However, since lsei does not return status codes, it's hard to know unless I really dive into the code.

The overall average difference (pooling across reasonably scaled and poorly scaled data) in the coefficient estimates is still small, though, and ranges betweeen 0.1% and 1.3%. So perhaps the differences between Gurobi and lsei are not a big deal.

Also, when you say Gurobi couldn't handle a case, what does that mean exactly? Do you have the distribution of the return codes for cases where it failed?

Ah, sorry. I considered Gurobi unsuccessful in its optimization whenever it returned a status code other than OPTIMAL.

Below is one of the distributions of the status codes. Recall that I generated many data sets, and changed the scale of one of the variables, e.g. I multiplied the variable by a constant between a and b. The left panel shows the status codes across the whole simulation. The middle panel shows the distribution of status codes when I restrict attention to the bottom half of the scaled data, e.g. I've scaled one of its variables by a constant between a and (a + b) / 2. And the right panel shows the distribution of status codes when I restrict attention to the top half of the scaled data, e.g. I've scaled one of its variables by a constant between (a + b) / 2 and b.

I wonder if some of these might be better than Gurobi because they are more tailored? ... Do you think you could try scs and CLSOCP and compare them to Gurobi for the second-step problem?

Yep, I can do that.

Also, the CPLEX website and the cplexAPI manual seem to both suggest that it is possible for CPLEX to solve QCQPs in R. So I'll try to set that up as well.

a-torgovitsky commented 3 years ago

So I decided to compare more closely the estimates for Gurobi and lsei. I considered only the cases where the Gurobi status was OPTIMAL.

But this still leaves 47% of cases where Gurobi was not OPTIMAL, doesn't it? So I would still say lsei is dominating Gurobi. Or am I misunderstanding your results?

So I'm wondering if some of the estimates by lsei are suboptimal. However, since lsei does not return status codes, it's hard to know unless I really dive into the code.

So when you say that lsei solved all of the problems what do you mean exactly? Just that it didn't return an error code? (But it never returns error codes?)

An easy way to check whether it is suboptimal is just to compare the optimal criterion value to what Gurobi finds. Presumably it should always be smaller (in a minimization problem) if it is finding the optimal solution.

I considered Gurobi unsuccessful in its optimization whenever it returned a status code other than OPTIMAL.

I think that's the right way to evaluate performance. We don't want to have to be second guessing this all the time (and certainly don't want to user to be faced with uncertain return codes).

Also, the CPLEX website and the cplexAPI manual seem to both suggest that it is possible for CPLEX to solve QCQPs in R. So I'll try to set that up as well.

I have a vague recollection from a different project that this is quite hard to do with the CPLEX API and may not be possible. But I'm not sure. Anyway, Gurobi is essentially a fork of CPLEX (albeit several years removed now), so it's not clear to me that it's going to be much better.

Probably better to prioritize scs since this is more closely tailored to the types of problems we are looking at. My guess is that Gurobi is just trying to "do too much" and not exploiting the structure that we have in our problem.

jkcshea commented 3 years ago

So I would still say lsei is dominating Gurobi. Or am I misunderstanding your results?

Nope, that's right.

So when you say that lsei solved all of the problems what do you mean exactly? Just that it didn't return an error code? (But it never returns error codes?)

Yeah, lsei will either return coefficient estimates, or throw out the following error regarding least distance problems:

> lsei::lsei(a = X, b = y, c = eqA, d = eqRhs, e = A, f = rhs)
Error in ldp2(et, ft) : Incompatible inequalities in ldp()

The optimal criterion values calculated by Gurobi and lsei match whenever Gurobi is optimal. But my concern was to do with cases where we cannot compare lsei against Gurobi, since Gurobi couldn't provide an optimal solution.

But my concerns are mostly allayed by the plots below. The first plot compares the SSR from the constrained regressions by gurobi and lsei. On the x-axis is how much I scaled one of the variables by. Blue dots represent gurobi, red dots represent lsei. Circular dots indicate optimal solutions. I considered lsei solutions to be optimal so long as all the constraints were satisfied. The crosses indicate suboptimal solutions (for Gurobi, this coincided with cases where constraints were violated). The plot seems to suggest the solutions obtained by lsei are reliable, even when I cannot compare it to Gurobi.

I decided to really push the scaling to see how lsei performs when near the point of breaking. I found that you can really push the scaling before the SSR starts to behave strangely. (The scaling can go much further than what I show in the plot below before lsei returns an error.) Even then, the coefficient estimates were still very reasonable. So I think lsei is indeed a good package to use.

Probably better to prioritize scs since this is more closely tailored to the types of problems we are looking at.

Sure thing.

a-torgovitsky commented 3 years ago

Ok, sounds good, just a question on this:

Yeah, lsei will either return coefficient estimates, or throw out the following error regarding least distance problems:
lsei::lsei(a = X, b = y, c = eqA, d = eqRhs, e = A, f = rhs)
Error in ldp2(et, ft) : Incompatible inequalities in ldp()

Can you explain that? Is that an error related to numerical problems? (Because it sounds like something else...)

jkcshea commented 3 years ago

The R function runs the Fortran function ldp, and returns that message whenever Fortran indicates it is not successful in finding a solution.

    r = .Fortran("ldp", e, m, m, n, f, x = x, xnorm, w, index, 
        mode = mode, PACKAGE = "lsei")[c("x", "mode")]
    if (r$mode != 1) 
        stop("Incompatible constraints in ldp()")

Below is the Fortran code, which is available here. Fortran actually returns more detailed error codes, but unfortunately we won't know which.

C                             SUCCESSFUL RETURN.
  110 MODE=1
      RETURN
C                             ERROR RETURN.       N .LE. 0. 
  120 MODE=2
      RETURN
C                             RETURNING WITH CONSTRAINTS NOT COMPATIBLE.
  130 MODE=4
      RETURN

jkcshea commented 3 years ago

I don't quite understand Fortran code, but the Fortran function LDP() calls on a non-negative least squares algorithm NNLS(), and the output from that determines the error code returned by LDP().

a-torgovitsky commented 3 years ago

I see, so just to be clear: you think that lsei is returning an uninformative error code when it fails. (Even though with additional programming the author could have perhaps pulled a more detailed error code from the underlying Fortran routine.)

jkcshea commented 3 years ago

Yep, that's right.

jkcshea commented 3 years ago

Sorry this is taking longer than perhaps expected, I have to read up on SOCPs and SDPs.

In the case of SOCPs, the free solvers require the user to input non-rotated quadratic cones. However, rewriting a QCQP problem as an SOCP problem requires a rotated quadratic cone. I've attached a writeup explaining why this is a problem if we want to avoid quadratic constraints. Perhaps I am doing something wrong, though.

qcqp-to-socp.pdf [UPDATED: Includes explanation of why I am writing cones the way that I am.]

However, Mosek lets you input rotated cones. Like Gurobi and CPLEX, it offers free academic licenses.

I'm still looking into reformulating QCQPs into SDPs.

jkcshea commented 3 years ago

I realized I wasn't being clear in the PDF on why I was writing the conic constraint the way that I was. I've updated the PDF to include an explanation.

a-torgovitsky commented 3 years ago

I'd be really surprised if it's not possible. It's probably just some OR tricks that we're not seeing.

I took a stab at it (see the last page). qcqp-to-socp.pdf I think it looks right, but I've never worked with this class of problems. What do you think?

a-torgovitsky commented 3 years ago

I got a hold of the Lawson and Hanson book and read through the algorithm used for lsei. (By the way, ldp is for a more general class of problems --- lsei translates the problem into one of the form ldp. That's what the errors talk about ldp. Then ldp in turn depends on an algorithm for solving non-negative least squares -- nnls.)

It's pretty impressive stuff. It is simple and very tailored to the structure of the problem. After reading it I am not at all surprised that it dominates Gurobi.

So hopefully the SOCP solvers will have similar advantages.

In reading the book I also found an alternative way to scale the design matrix that should improve scaling and in particular are designed to improve the condition number. (Different than what we had come up with.) So we could try that too with Gurobi as a last-ditch effort.

jkcshea commented 3 years ago

It's probably just some OR tricks that we're not seeing.

Correction: OR tricks that I wasn't seeing (the simplest of them, too!). Thanks for those notes, I've now been able to get CLSOCP to work. (Still trying to get scs to work.)

So hopefully the SOCP solvers will have similar advantages.

Unfortunately, the performance of CLSOCP does not look great. One problem seems related to how each observation in the data generates two new variables in the SOCP setup. One variable is trivial, and is set equal to the dependent variable. But the other variable is not, and is equal to the residual. So while CLSOP can sometimes match Gurobi for small data sets, e.g. n = 3, the solution becomes quite different by the time n = 10. It also takes longer to run when n gets large.

Another problem CLSOP doesn't seem to do a good job optimizing. In the case of minimization problems, the optimal objective values from CLSOCP are often much larger than those of Gurobi.

As a note: CLSOP uses the algorithm linked a few posts above, i.e. that algorithm is not another alternative.

jkcshea commented 3 years ago

One problem seems related to how each observation in the data generates two new variables in the SOCP setup.

Maybe I'm wrong to think that, though, since the link on the algorithm used by CLSOCP suggests it's easier to convert a QCQP to a SOCP and solve that, rather than solve the QCQP directly.

Hopefully scs will be better.

a-torgovitsky commented 3 years ago

It's certainly problematic that this formulation increases in difficulty with n. Does that seem like it is inherent to any SOCP formulation?

a-torgovitsky commented 3 years ago

It might also be useful to compare to a general purpose solver that only uses derivative information. e.g. this one --- although there are many, so maybe that's not the best.

This would be a useful benchmark at least to compare Gurobi against.

jkcshea commented 3 years ago

It's certainly problematic that this formulation increases in difficulty with n. Does that seem like it is inherent to any SOCP formulation?

Ah I finally figured out how to get scs to work, and it is much, much better than CLSOCP. However, it still suffers from this issue of becoming more difficult to solve as n grows. It can keep up with Gurobi in terms of speed and optimization up to about 300 observations before it starts to regularly return suboptimal solutions and slow down.

The SOCP formulation I proposed doesn't require us to introduce a new non-trivial variable for every observation. However, it involves those rotated cones, and it looks like we need Mosek to implement that formulation. But maybe there is another way to formulate the SOCP problem to avoid this---I'll give it another shot.

It might also be useful to compare to a general purpose solver that only uses derivative information. e.g. this one --- although there are many, so maybe that's not the best.

Sure, I'll look into that as well.

a-torgovitsky commented 3 years ago

In the case of SOCPs, the free solvers require the user to input non-rotated quadratic cones. However, rewriting a QCQP problem as an SOCP problem requires a rotated quadratic cone. I've attached a writeup explaining why this is a problem if we want to avoid quadratic constraints. Perhaps I am doing something wrong, though.

Going back to the original problem, which was the idea that these solvers only take non-rotated quadratic cones. This Mosek page shows how to convert a rotated quadratic cone into a standard quadratic cone.

Your original proposed reformulation did not have the # of variables increasing in n, did it?

It seems based on the Mosek page (which has a wealth of information) that it should be able to phrase our problem with a non-rotated quadratic cone constraint by combining section 3.3.1 and section 3.1.

jkcshea commented 3 years ago

Your original proposed reformulation did not have the # of variables increasing in n, did it? It seems based on the Mosek page (which has a wealth of information) that it should be able to phrase our problem with a non-rotated quadratic cone constraint by combining section 3.3.1 and section 3.1.

Oh, right it was not increasing in n. Yes, I'll give this a go as well.

jkcshea commented 3 years ago

Good news: manually undoing the rotation in the SOCP works. So we have a way to estimate QCQPs as SOCPs using scs without adding new variables for every observation. Just to be clear, I have not incorporated this into ivmte yet.

I noticed that rescaling the problem really improves the performance of scs. For example, the quadratic constraint can involve fairly large matrices and vectors if the sample size is large. If I don't rescale the quadratic constraint, scs and gurobi are about equally fast and typically get similar solutions. However, there are some instances where scs is way off. But by dividing the entire quadratic constraint by the sample size, scs always matches gurobi, and takes a quarter of the time to run.

However, because of the way scs requires the user to submit the problem, we must introduce a new slack variable for every inequality constraint. This can potentially be problematic in the audit, but we can test this if we do decide to incorporate scs into the ivmte.

For the sake of documentation, I've updated the notes above to record how a QCQP problem can be reformulated to be compatible with scs.

qcqp-to-socp.pdf

a-torgovitsky commented 3 years ago

Great to hear!

Rescaling here just means dividing by sample size? Were you doing that with Gurobi, and if so was it making a difference?

jkcshea commented 3 years ago

In this case, rescaling does simply mean dividing by the sample size. Also, it is only done for the quadratic constraint. I will do another test with poorly scaled data, and then see how scs responds when I rescale everything according to the writeup you posted earlier.

And I was indeed dividing the quadratic constraint by the sample size with Gurobi. Doing this did resolve some instances where we were getting suboptimal results. However, it didn't always lead to optimal results, and I believe it sometimes even brought about the suboptimal warnings.

jkcshea commented 3 years ago

Oh sorry, I realized I misunderstood your question.

Were you doing that with Gurobi, and if so was it making a difference?

I believe you are simply asking about this recent comparison of scs against Gurobi, and not the package. So yes, I was dividing by the sample size with Gurobi when comparing it against scs. But it did not really make a difference for Gurobi.

And I was indeed dividing the quadratic constraint by the sample size with Gurobi. Doing this did resolve some instances where we were getting suboptimal results. However, it didn't always lead to optimal results, and I believe it sometimes even brought about the suboptimal warnings.

What I was saying above pertains to the ivmte package.

a-torgovitsky commented 3 years ago

Sorry, I'm very confused now.

If I don't rescale the quadratic constraint, scs and gurobi are about equally fast and typically get similar solutions. However, there are some instances where scs is way off. But by dividing the entire quadratic constraint by the sample size, scs always matches gurobi, and takes a quarter of the time to run.

We are going through all of this effort because Gurobi is not terminating cleanly. So why should matching Gurobi be the benchmark for success?

You are saying here that scs matches Gurobi, but is faster. But speed was not out concern. Our concern was that Gurobi is terminating at suboptimal solutions.

jkcshea commented 3 years ago

Yes, the last few posts were for tests that Gurobi can actually solve. I did these because

I wanted to make sure I was reformulating the QCQP problem correctly
I wanted to be sure we can actually trust the results from these packages

For example, for problems that Gurobi can solve, CLSOCP would often return different solutions that turned out to be suboptimal. However, unlike Gurobi, CLSOCP didn't warn me that they were suboptimal. So I thought it was worth checking that the packages we are considering are comparable to Gurobi for simpler problems. And if they turn out to be much worse than Gurobi, then it seems unlikely they would do better in problems that even Gurobi struggles with.

lsei likewise doesn't report the optimization status, but its performance with the simpler problems give some credibility to its solutions for problems that Gurobi struggles with.

And I will certainly check how scs performs for problems where Gurobi terminates early!

Ah, sorry for emphasizing the time. A question about speed come up earlier, so I thought I'd keep the details on that.

a-torgovitsky commented 3 years ago

Got it, now it makes sense! The true test will be how scs performs on the problems that Gurobi fails on.

jkcshea / ivmte

Improve scaling of optimization problems #198