Improve scaling of optimization problems

a-torgovitsky commented 3 years ago

I wrote up a procedure for doing so in the attached note. @jkcshea let me know if it makes sense and looks feasible.

Here's a demonstration of why I think it will work:

library("ivmte")

df <- AE
args <- list(data = df,
             target = "att",
             outcome = "worked",
             m0 = ~ u + I(u^2) + yob + u*yob,
             m1 = ~ u + I(u^2) + I(u^3) + yob + u*yob,
             propensity = morekids ~ samesex,
             noisy = FALSE)
results <- do.call(ivmte, args)
B <- results$X

library("gurobi")

model <- list()
model$Q <- (t(B) %*% B) # the quadratic portion of our criterion

 objective
model$A <- matrix(rep(0, dim(B)[2]), nrow = 1)
model$rhs <- 0

# Relatively poorly scaled -- large magnitude differences
r <- gurobi(model)

# Scale each column and try again
Bsc <- apply(B, MARGIN = 2, FUN = function (x) (x - min(x))/diff(range(x)))
model$Q <- (t(Bsc) %*% Bsc)
# Looking much better
r <- gurobi(model)

output:

Warning:  The LP solver was unable to satisfy the optimality tolerance for the maximization problem, so a suboptimal solution is returned. Tolerance parameters for the LP solver can be passed through the argument 'solver.options'. 
Warning:  The LP solver was unable to satisfy the optimality tolerance for the minimization problem, so a suboptimal solution is returned. The LP solver was unable to satisfy the optimality tolerance for the maximization problem, so a suboptimal solution is returned. Tolerance parameters for the LP solver can be passed through the argument 'solver.options'. 
Gurobi Optimizer version 9.0.2 build v9.0.2rc0 (linux64)
Optimize a model with 1 rows, 11 columns and 0 nonzeros
Model fingerprint: 0x2e512046
Model has 36 quadratic objective terms
Coefficient statistics:
  Matrix range     [0e+00, 0e+00]
  Objective range  [0e+00, 0e+00]
  QObjective range [1e+01, 9e+08]
  Bounds range     [0e+00, 0e+00]
  RHS range        [0e+00, 0e+00]
Warning: Model contains large quadratic objective coefficients
         Consider reformulating model or setting NumericFocus parameter
         to avoid numerical issues.
Presolve removed 1 rows and 11 columns
Presolve time: 0.00s
Presolve: All rows and columns removed

Barrier solved model in 0 iterations and 0.00 seconds
Optimal objective 0.00000000e+00
Gurobi Optimizer version 9.0.2 build v9.0.2rc0 (linux64)
Optimize a model with 1 rows, 11 columns and 0 nonzeros
Model fingerprint: 0x58e202e2
Model has 36 quadratic objective terms
Coefficient statistics:
  Matrix range     [0e+00, 0e+00]
  Objective range  [0e+00, 0e+00]
  QObjective range [8e+04, 5e+05]
  Bounds range     [0e+00, 0e+00]
  RHS range        [0e+00, 0e+00]
Presolve removed 1 rows and 11 columns
Presolve time: 0.00s
Presolve: All rows and columns removed

Barrier solved model in 0 iterations and 0.00 seconds
Optimal objective 0.00000000e+00

My guess is that this will solve #196 as well as countless future problems we have yet to encounter!

jkcshea commented 3 years ago

Unfortunately, scs did not do very well in the second round of tests. In the plots below, gurobi is in black and scs is in red. Similar to the earlier set of tests, I am feeding each package a sequence of minimization problems, and the scaling is getting worse and worse. For each scaling, I generate 500 random problems, and then average over the 500 problems.

scs was more likely than gurobi to be unable to obtain a solution. Of the cases where it was able to obtain a solution, it was also more likely to be suboptimal.

I also noticed that scs doesn't do a good job satisfying the constraints in difficult problems. The constraints include 2 linear equality constraints, 2 linear inequality constraints, and the quadratic constraint. In the corresponding plot below, I think I'm being pretty generous in that I use a relative tolerance of 1e-1 to determine whether the constraints are satisfied. Even then, all the constraints were satisfied at most 60% of the time. So the fact that scs obtains a lower objective value than gurobi in these minimization problems on average seems meaningless.

I'll try those other packages you suggested to see if they are any better.

Selection_056

a-torgovitsky commented 3 years ago

Argh, that's disappointing.

What is the tuning parameter in these simulations? (The parameter kappa in the (1 + kappa)Q^{\star} constraint.) If its zero, can you try setting it to be something small (say kappa = .1) and see if that has an impact?

a-torgovitsky commented 3 years ago

Also, here's a different scaling strategy to try. I'd be interested to know if this improves any of the algorithms. stability-condition-number.pdf

jkcshea commented 3 years ago

What is the tuning parameter in these simulations? (The parameter kappa in the (1 + kappa)Q^{\star} constraint.) If its zero, can you try setting it to be something small (say kappa = .1) and see if that has an impact?

Increasing the tuning parameter improve the performance of both packages, but Gurobi still outperforms scs. In the simulations below, scs is still violating some of the constraints.

I noticed something while checking the code, though. Depending on kappa, Gurobi may struggle more with the rescaled problem (I did not rescale the data in the plots above). I don't think I'm making a coding error, since the solutions to the unscaled and rescaled problems match when I'm not feeding gurobi and scs poorly scaled data. Nevertheless, notice that rescaling the data does seem to improve the likelihood that Gurobi is able to obtain an optimal solution, so long as kappa is large enough and also the data is scaled poorly enough. Strangely, scs seems rather unresponsive to whether or not the data is rescaled---but I'm quite skeptical of the scs solutions.

Also, here's a different scaling strategy to try. I'd be interested to know if this improves any of the algorithms.

Sure, I'll give that a shot.

a-torgovitsky commented 3 years ago

What are the axes here again? X axis is a measure of how badly the problem is scaled? Is higher worse? Y axis is proportion of times the solver exits cleanly?

And for each rate x kappa you are doing 500 randomly drawn programs?

jkcshea commented 3 years ago

I'm so sorry, I didn't realize I cut off all the axis labels. I've updated the figures above to correct for this.

X axis is a measure of how badly the problem is scaled? Is higher worse?

Yes and yes. In the simulations, there is a covariate x1, and another x2 = x1 ^ 2. The x-axis measures how I scale x1. So as we move rightward along the x-axis, x1 is growing, and the difference in magnitude between x1 and x2 is also growing. This is what I'm doing to create poorly-scaled problems.

Y axis is proportion of times the solver exits cleanly?

For the plots with titles "Optimal solution": Yes. The proportions are conditional on the solver actually returning a solution. So these proportions exclude cases where there was a numerical error, or the program was supposedly infeasible/unbounded.

For the plots titled "Solution is returned", the y-axis is the proportion of times the solver is able to return a solution at all. The solution need not be optimal, though.

And for each rate x kappa you are doing 500 randomly drawn programs?

And yes. [Note: earlier graphs weren't as smooth as I had only used 200 randomly drawn programs. This has been corrected.]

Related to your second question, the fraction of times gurobi is able to return a solution (which may or may not be optimal) may also worsen with rescaling. Again, scs seems less sensitive to us rescaling the problem.

a-torgovitsky commented 3 years ago

Thanks. Can you explain this again?

For the plots titled "Solution is returned", the y-axis is the proportion of times the solver is able to return a solution at all. The solution need not be optimal, though.

The universe of Gurobi return codes we are considering is what?

Solved
Numerical Error
Suboptimal

So is it just "numerical error" that doesn't return a solution?

jkcshea commented 3 years ago

The universe of Gurobi return codes we are considering is what?

Solved

Numerical Error

Suboptimal

So is it just "numerical error" that doesn't return a solution?

In addition to that list, there is also INF_OR_UNBD. These happen less than 1% of the time. As we've discussed, the problem should not be infeasible, so it is probably the case that the problem is unbounded. This is my fault for not imposing enough restrictions on the problem to prevent this.

To clarify: I have not been incorporating all these solvers into ivmte, where the INF_OR_UNBD status should be very unlikely to happen because of all the default shape constraints on the MTRs. I tried incorporating one of the packages earlier, and it became very messy because of how much ivmte does. It has been faster and less error-prone to write code specific for this purpose of testing these solvers. But these simulations still follow the two-step approach of

Estimating a minimum criterion, subject to linear equality and inequality constraints (CLS)
Minimizing a linear objective subject to the same linear equality and inequality constraints, in addition to the new quadratic constraint (QCQP).

Let me know if you prefer me to integrate these packages into ivmte during this testing phase.

a-torgovitsky commented 3 years ago

I think INF_OR_UNBD can happen due to numerical problems (either in the primal or in the dual, which is why its "inf" or "unbounded") since there is a tolerance for detecting infeasibility as well as optimality.

Is it possible to modify this simplified example so that you know its feasible and bounded? Feasibility should be automatic due to the construction of step 1 and step 2. Boundedness is just a matter of putting box constraints on all variables.

jkcshea commented 3 years ago

Is it possible to modify this simplified example so that you know its feasible and bounded? Feasibility should be automatic due to the construction of step 1 and step 2. Boundedness is just a matter of putting box constraints on all variables.

Ah what a simple and obvious solution.

I've added the box constraints, and they indeed eliminate the INF_OR_UNBD statuses from Gurobi. These constraints seem to impact gurobi and scs differently.

For the unscaled problems, gurobi runs into more numerical errors when there are these new box constraints, compared to when there are not (plots immediately above). In contrast, scs runs into fewer such errors.
Rescaling the problem eliminates many of the numerical errors for gurobi. Rescaling seems to have little effect for scs.

gurobi is more likely to return the OPTIMAL status when there are box constraints, compared to when there are not. In contrast, scs is much less likely to return an optimal solution.
Rescaling the problem increases the likelihood of the OPTIMAL status from gurobi. Rescaling seems to have little effect for scs.

a-torgovitsky commented 3 years ago

Still not clear what the difference is between "Optimal" and "Solution returned."

How about this:

1 plot per solver, per kappa, per scaled/unscaled
On each plot, one line for each of the return codes. The sum of the lines is 1 for any given scale point.

That way we don't have this unclear distinction between "Optimal" and "Solution returned"

jkcshea commented 3 years ago

Yes, that layout is much clearer, sorry for the confusion.

Here are the updated plots for gurobi. Each row of plots corresponds to different values of kappa. Each column corresponds to whether I rescaled the data.

And here are the updated plots for scs.

a-torgovitsky commented 3 years ago

Thanks that is much clearer. What type of rescaling are you using here?

I'll be interested to see if the conditioning number rescaling works better. That is something I found suggested in the Lawson and Hanson book.

jkcshea commented 3 years ago

Oh, the plots above are still using the rescaling procedure from before. I'll have the ones using the conditioning number rescaling ready soon.

jkcshea commented 3 years ago

And here are the simulations with the conditioning number rescaling. They are almost identical to the earlier simulations. The only noticeable difference is the small dip that gurobi now exhibits when kappa = 0, rescale = TRUE, and the scaling along the x-axis is around 0.075.

a-torgovitsky commented 3 years ago

Ok, so what's the punchline here do you think? Gurobi + rescaling + kappa > 0 and we're good? Are we ready to try that on some examples that reproduce the structure of ivmte more closely?

jkcshea commented 3 years ago

We also imposed the box constraints to eliminate those INF_OR_UNBD gurobi statuses. But I've found that the box constraints really impact how successful rescaling is.

I re-ran the simulation, this time playing with the sizes of the box constraints. In the plot titles below, "Box scale = k" means I scaled the endpoints of the box constraints by k (box constraints were such that the lower bound was negative, and the upper bound was positive, so scaling the box constraints shouldn't 'skew' the estimates to one side). If k = Inf, that means I removed the box constraints. I am still using the conditioning number rescaling method.

When kappa = 0, then having box constraints isn't that helpful for rescaling.

But when kappa = 0.2, having some sort of box constraint---even if very wide---seems to really improve the stability.

In practice, how to choose the box constraints is not clear to me, since we may have no idea of the magnitude of the MTR coefficients. But perhaps the coefficient estimates when minimizing the criterion can be some sort of starting point. Do you think this is worth considering?

jkcshea commented 3 years ago

Something that I think is clear, though, is that we probably don't want to use scs. I still plan to try a general purpose solver to compare against gurobi. But after that, I think we can go ahead and do full-scale testing using the package.

Which of the rescaling methods do you want to use? Earlier, we found that your original proposal to rescale and recenter the covariates performed very well. And these recent simulations have found the conditioning number approach to be comparable.

a-torgovitsky commented 3 years ago

We could try a rule-of-thumb for adding box constraints by (as you suggest) inflating the coefficients found in the first step by some factor. If so, then we would want to incorporate this into the audit procedure and check that none of the box constraints are binding. If they are binding, then we relax the constraints and try again. This makes the audit procedure more complicated (and we've already had difficulty with it being perhaps too complicated...), but maybe it is worth trying if we see that the box constraints are helpful for stability.

Agreed that scs seems like it is no longer a contender.

As for rescaling, I think the most recent one where we use the norms of the columns is probably the best. This is something that has some theoretical basis, whereas the other rescaling approach was just something I made up from some naive intuition.

jkcshea commented 3 years ago

Sure, I'll get that all set up then.

jkcshea commented 3 years ago

Just to wrap up the discussion on other QCQP solvers, below are the simulation results for two other packages I tried. gurobi remains our best option.

The first is the NlcOptim package, which uses a sequential quadratic programming method. The package either returns a solution or returns nothing at all. There is no report of the optimization status. Nevertheless, the solutions are generally worse than those of gurobi, i.e. NlcOptim has larger objective values in the minimization problems. NlcOptim also doesn't do as good a job of satisfying the constraints as gurobi. As a reminder, Gurobi was always returning the OPTIMAL status when kappa >= 0.2 and Rescale = TRUE.

The other package I tried was the alabama package, which uses an augmented Lagrangian method. I chose this because it allowed the user to input the gradient function for the objective function, and the Hessians for the constraints. I thought this would give it an advantage over NlcOptim, but it did quite poorly (perhaps because the package is from 2015 and is older than NlcOptim). The documentation unfortunately only explains what the status code is when the optimization algorithm converges, and all other codes indicate some sort of failure to converge. Very rarely did the algorithm converge, and almost never were the solutions better than those of gurobi or NlcOptim.

a-torgovitsky commented 3 years ago

Ok, thanks! I guess Mosek is the only other serious contender that we haven't tried. Worth giving it a shot? How much effort will it take?

jkcshea commented 3 years ago

I had to install Mosek once upon a time, and if I recall correctly, it was fairly straightforward. So I'll give it a shot once completing this simulation exercise with ivmte.

a-torgovitsky commented 3 years ago

Ok, sounds good.

a-torgovitsky commented 3 years ago

By the way, have you at any point experimented with using sparse matrices for Gurobi? The example here uses them: https://www.gurobi.com/documentation/9.1/examples/bilinear_r.html

I wonder if it might help in some cases?

jkcshea commented 3 years ago

Ah yes, somewhat. We are actually using sparse matrices when passing all the constraints to Gurobi. This was an artifact of the old, old audit procedure, where I was being very inefficient. So to limit memory usage, I made all the matrices sparse.

However, I never tested whether Gurobi's ability to solve a problem was affected by the sparsity of the matrices. I can try commenting out the code that sparse-ifies the matrix defining the constraints, just to see what happens.

a-torgovitsky commented 3 years ago

Probably worth checking just to see. Perhaps Gurobi is able to make use of the sparsity in its algorithm.

jkcshea commented 3 years ago

I'm sorry this still isn't finished, but I've been trying to understand several odd issues.

1. Adding in the box constraints can really change the bounds.

Recall that we considered trying to include box constraints centered around the solutions from the criterion minimization. The box constraints would also automatically expand whenever they are binding when estimating the target parameters.

The solution that minimizes the criterion need not be near the solution that minimizes/maximizes the target parameter. So often, we do find the box constraints to be binding, and they do need to be expanded. With the problems being convex, I thought that Gurobi would still find its way to the same solution regardless of whether a box constraint is implemented. But this does not seem to be the case. That is, the bounds differ depending on whether I impose the box constraints, yet the optimization statuses for those bounds were all OPTIMAL.

Likewise, the results obtained using Gurobi to minimize the criterion may differ greatly from those using lsei. Again, the reason is that their solutions to the minimization problems are likely to be different. This leads to different box constraints, and potentially different solutions.

2. lsei breaks down when there are a lot of parameters

@slacouture identified this problem when he specified the MTRs to be splines with 100 knots. For documentation, here are his example: lsei examples.zip

The reason for this is that lsei performs a singular value decomposition of the design matrix using the Fortran function svdrs. However, svdrs breaks down in these cases, and returns a bunch of NA values. The remainder of the lsei function then breaks down.

3. lsei has its own scaling issues

@slacouture also identified this issue when he tried to estimate a model where the MTRs only had 5 terms, and Gurobi did much better than lsei. The lsei solutions were also wrong---it returned a vector of 0s.

Without getting into too much detail (but if you need more detail, I can do my best to provide it), lsei reformulates the original CLS problem as a non-negative least squares (NNLS) problem, which is then passed to Fortran. The scaling of the NNLS problem is quite different from the CLS problem. Even though Gurobi's solution satisfies the constraints of the problem; if you rescale them to align with the scaling of the NNLS problem, they will violate the constraints. Mathematically, this should not happen, and it appears to be a precision issue. If you rescale the problem before feeding it to lsei, then lsei returns a reasonable solution.

4. lsei has issues imposing upper and lower bounds

lsei has problems imposing certain box constraints because of the NNLS procedure. I'm still trying to understand what exactly is going on here. (But at this point, perhaps we do not want to continue using lsei.)

5. Rescaling should not change the bounds, but they do

The changes can be quite large, e.g. from -0.2 to 0.5. The optimization status for the bounds are also OPTIMAL. I'm still trying to figure out what is going on.

6. Sparsity doesn't affect anything

This is just to follow up on the last post.

a-torgovitsky commented 3 years ago

Adding in the box constraints can really change the bounds.

That is, the bounds differ depending on whether I impose the box constraints, yet the optimization statuses for those bounds were all OPTIMAL.

That makes sense because they are different optimization problems. But it only makes sense if the box constraints are binding at the solution. If they aren't binding, then there's a contradiction (given that the problem is convex) in finding different solutions with and without the box constraints. Both solutions cannot be optimal if the box constraints aren't binding.

lsei

I think we can just drop it. We already have found that Gurobi works better in the second step. Now you have additional evidence that it works better in the first step.

Rescaling should not change the bounds, but they do

Hard to say without an example. But a good debugging strategy when comparing two programs is to just compare the solutions. Start with the rescaled program A. Capture its solution (i.e. the variables). Unscale them. Evaluate the objective and constraints. How do they compare to the solution for the unscaled program B, and its objective and constraints at optimum? Another useful exercise is to use the solution of program A as a starting point to program B.

jkcshea commented 3 years ago

Regarding box constraints: Yes, this was my mistake. I had forgotten that I had implemented code that increases kappa when the solver runs into a numerical error. What happened was that some of those box constraints led to numerical errors. The audit then restarts, but with a higher kappa. Among all the test output, I had missed the message Audit restarting: criterion.tol increased to ..., and kappa is controlled by criterion.tol. It actually seems more reliable to not use box constraints. Sorry for the suggestion.

lsei This has now been removed.

Regarding rescaling Some of these cases were due to the same reason as above, i.e. I missed the message that kappa was being increased. However, there are indeed some odd cases. Here is the setup for one.

set.seed(1652L)
N <- 2000
args <- list(data = AE[sample(seq(nrow(AE)), size = N), ],
             target = "att",
             outcome = "worked",
             m0 = ~ u + I(u^2) +  I(u^3) + yob + I(yob^2),
             m1 = ~ u + I(u^2) +  I(u^3) + yob + I(yob^2),
             propensity = morekids ~ samesex,
             point = FALSE,
             m0.inc = TRUE,
             initgrid.nu = 2,
             initgrid.nx = 1,
             criterion.tol = 0.0)

## Unscaled estimate
set.seed(10L) ## Determines the initial grid
args$rescale <- FALSE
res.noscale <- do.call(ivmte, args)

## Scaled estimate
set.seed(10L) ## Determines the initial grid
args$rescale <- TRUE
res.rescale <- do.call(ivmte, args)

The optimization status for both specifications are optimal, but the lower bounds differ by quite a bit.

> res.noscale

Bounds on the target parameter: [-0.1406556, 0.2163217]
Audit terminated successfully after 2 rounds 

> res.rescale

Bounds on the target parameter: [0.1995454, 0.2008883]
Audit terminated successfully after 2 rounds

It turns out that the quadratic constraint is actually getting violated in the unscaled estimate. (All the linear constraints are satisfied, though.) Attached are several more simpler examples that I plan to send to Gurobi. Some of the violations can be enormous.

quad-constraint-test.zip

Below is a simulation to see how rescaling the data impacts the performance of Gurobi. For documentation, I've attached the simulation. rescale-test.zip

[EDIT: Sorry, I forgot to mention basic setup. There are 500 iterations, each one uses a random subset of 5000 observations from the AE data. The MTRs are ~ u + I(u^2) + I(u^3) + u:yob + yob + I(yob^2).]

In all cases, the criterion minimization ends up being optimal.

Scaling the data reduces the frequency of numerical errors. However, it seems like these numerical errors turn into suboptimal solutions Increasing kappa from 0 to 0.01 alone seems to do a better job than rescaling.

The figure below shows the frequency that the lower and upper bounds are (i) both optimal; (ii) not both optimal, but no numerical errors; (iii) at least one numerical error.

The next figure shows how the lower and upper bounds change with rescaling. The differences are reported in percentages, relative to the unscaled estimate. This figure may be misleading, though, since (i) the bounds are relatively small in magnitude, so small absolute differences become large proportional differences (but perhaps that is still unacceptable); (ii) there are some huge outliers.

This plot looks at how the linear constraints may be violated. (This check is still done manually, not via Gurobi, as discused in #199.) Scaling seems to resolve this issue, although the violations are small.

Finally, this plot looks at how the quadratic constraints may be violated. Again, scaling seems to resolve this issue, although the violations are small. However, as demonstrated at the very top of this post, minor violations in the quadratic constraints have the potential to greatly change the estimate of the bounds.

a-torgovitsky commented 3 years ago

Ok, interesting. I think what is becoming clear now is that we definitely want something like kappa = .01 as the default.

Here's another debugging strategy for the problematic runs. You have three programs that come out of any given run:

Minimum criterion
LB subject to minimum criterion
UB subject to minimum criterion

Do the minimum criteria change between the scaled/unscaled runs? If so, what happens if you manually fix the minimum criterion in the LB/UB solves so that they are the same for both scaled/unscaled runs? Still get large discrepancies?

This will help pin down whether the discrepancies are coming from

Step 1: Minimum criterion
Step 2: LB/UB subject to minimum criterion

I think it's possible that small variations in the optimal value found in step 1 could allow for large variations in bounds in step 2. If that is in fact the case, then the solution might be to tighten up the tolerances in step 1 so that we get a more accurate idea of what the minimum criterion value is.

jkcshea commented 3 years ago

This will help pin down whether the discrepancies are coming from Step 1: Minimum criterion Step 2: LB/UB subject to minimum criterion

It looks like the discrepancies are coming from both steps.

So here is the comparison of criterions with and without scaling the data from the simulation above. The difference is very small, i.e. an average a difference of 0.11, compared to an average minimum criterion of 1200.

Below are the results from changing the minimum criterions in the unscaled runs to match those of the scaled runs. The results are similar if I instead change the minimum criterions in the scaled runs to match those of the unscaled runs.

The discrepancies for lower bounds have decreased on average, but the opposite has occurred for upper bounds.

The fraction of times when the quadratic constraint is violated is roughly the same as before.

I ran the simulation once more, this time additionally lowering Gurobi's feasibility tolerance to 1e-08 (compared to its default of 1e-06). Just to be clear, I am fixing the criterion, as well as lowering the tolerance. This did reduce the difference between the bounds obtained using unscaled and scaled data, but did not eliminate the difference entirely.

It also increased the fraction of cases where the linear constraints were being violated. Previously, at the default tolerance, around 1% of them were being violated in the unscaled case with kappa = 0. Now 3--4% of the constraints are being violated. This may be expected given the lower tolerance. But if even when I check the constraints using a tolerance of 1e-06, I find that the 2-3% of them are violated. This outcome is a bit odd to me...

In contrast, reducing the tolerance had little effect on the fraction of times the quadratic constraints were being violated.

I tried the same exercise of fixing the minimum criterion on the example I posted above where the bounds change dramatically when scaling the data. For convenience, here is the example again.

set.seed(1652L)
N <- 2000
args <- list(data = AE[sample(seq(nrow(AE)), size = N), ],
             target = "att",
             outcome = "worked",
             m0 = ~ u + I(u^2) +  I(u^3) + yob + I(yob^2),
             m1 = ~ u + I(u^2) +  I(u^3) + yob + I(yob^2),
             propensity = morekids ~ samesex,
             point = FALSE,
             m0.inc = TRUE,
             initgrid.nu = 2,
             initgrid.nx = 1,
             criterion.tol = 0.0)

## Unscaled estimate
set.seed(10L) ## Determines the initial grid
args$rescale <- FALSE
res.noscale <- do.call(ivmte, args)

## Scaled estimate
set.seed(10L) ## Determines the initial grid
args$rescale <- TRUE
res.rescale <- do.call(ivmte, args)

> res.noscale

Bounds on the target parameter: [-0.1406556, 0.2163217]
Audit terminated successfully after 2 rounds 

> res.rescale

Bounds on the target parameter: [0.1995454, 0.2008883]
Audit terminated successfully after 2 rounds

Their minimum criterions differ only by 4.22e-07.

If I change the criterion in the scaled run to match that of the unscaled run, the scaled bounds are roughly the same. But if I instead change the criterion in the unscaled run to match that of the scaled run, I end up with a numerical error.

So despite how Gurobi often violates the quadratic constraints in the simulation, there seem to be cases where Gurobi is very sensitive to them.

a-torgovitsky commented 3 years ago

Are there any examples where kappa = .01 + scaled is causing problems though? I thought from your previous post that there was. But now looking at these graphs it seems kappa = .01 + scaled is completely fine.

jkcshea commented 3 years ago

Ah you are right. In these simulations using the AE data, setting kappa = 0.01 and scaling the data does not cause any problems.

In the earlier simulations, there were indeed some problems, but I was performing the tests using different data. It got very messy and error-prone when I tried to fully incorporate the solvers we were testing into ivmte, while also allowing for the flexibility required for testing (e.g. when we considered using lsei for minimizing the criterion, but another package for the bounds). It was much cleaner to simulate some simple data sets, scale them if desired, and then pass them to each solver. Since I was simulating the data, I was really able to push the scaling issues.

Since settling on Gurobi, I thought it made sense to revisit the AE data since that was where we saw these issues originally. (I'm still willing to try Mosek, though, as discussed earlier.)

I'll generate some data sets with terrible scaling and see how the package handles them.

a-torgovitsky commented 3 years ago

Mosek does seem like its worth a shot if it's not a huge investment.

jkcshea commented 3 years ago

Below is a simulation from some problematic data.

Here is a summary of the data.

       y                x1                x2                 x3         
 Min.   : 61.41   Min.   :-0.5860   Min.   :-1.92347   Min.   :-17.277  
 1st Qu.:129.94   1st Qu.: 0.1858   1st Qu.:-0.43966   1st Qu.: -4.596  
 Median :156.73   Median : 0.5141   Median : 0.03454   Median :  1.326  
 Mean   :159.62   Mean   : 0.5527   Mean   : 0.10255   Mean   :  1.693  
 3rd Qu.:189.49   3rd Qu.: 0.9055   3rd Qu.: 0.67075   3rd Qu.:  7.952  
 Max.   :282.78   Max.   : 1.6081   Max.   : 2.21336   Max.   : 20.072

The MTRs m0 and m1 are ~ x1 + I(x1^2) + x2 + I(x2^2) + x3 + I(x3^2) + u + I(u^2). So even with the quadratic terms, the scaling is not actually horrible. The coefficients on the MTR terms range from 120 (for one of the intercepts), down to 1e-07 (for one of the quadratic terms). As discussed below, the problem seems to be with the magnitude of the outcome variable y.

In the simulations, the minimum criterion is always optimal by the end of the audit (it may start off being suboptimal, though). The relative difference in the minimum criterion between the unscaled and scaled runs is small.

When kappa = 0, Gurobi really struggles. Rescaling does not help, but increasing kappa does.

Scaling the data marginally improves Gurobi's ability to satisfy the linear constraints. Increasing kappa is still better, though.

Scaling the data has no effect on being able to satisfy the quadratic constraints. Increasing kappa does, though.

The complaints of Gurobi regarding scaling have always been about the matrices defining the constraints. So I found it odd that Gurobi struggled so much with this particular problem since the scaling wasn't terrible.

So here are some things I tried:

Increasing/decreasing the scaling of the data---this had little effect.
Reducing the range of the MTR coefficients---also had little effect.
Using very simple MTRs, e.g., ~ x1---also had little effect.

What ultimately made the difference was the magnitude of the outcome variable, which determines the magnitude of the RHS of the quadratic constraint. Since the outcome variable in this simulated data is large, what I did was divide the quadratic constraint by n^2, where n is the sample size. Previously, we only divided it by n. Here are the simulation results when doing this.

Minimizing the criterion is as before.

When obtaining the bounds, scaling the data still has little effect. But increasing kappa leads to optimal solutions. There are a very small handful of INF_OR_UNBD statuses, though, which we did not have before.

The case with the constraints is roughly the same. That is, scaling seems to help a little with satisfying the constraints, but relaxing kappa is more effective.

To test this, I re-ran those AE simulations, but no longer divided the quadratic constraint by n. I also tried multiplying the constraint by 10. This led to a lot more suboptimal solutions from Gurobi when kappa = 0. When kappa = 0.01, the solutions were mostly optimal, but there were now some cases of numerical errors, but still no suboptimal solutions. Dividing the quadratic constraint by n^2 did not have much effect, though, although it did introduce some numerical errors in the setting where kappa = 0.01.

The last point suggests that scaling down the quadratic constraint can be helpful if the outcome variable is large, but may be harmful if the outcome variable is small. How to automate the choice of how much to scale the quadratic constraint by is not clear, though. If we're not afraid of adding more options to the package, perhaps this is another one we want to allow the user to have to improve Gurobi's stability?

a-torgovitsky commented 3 years ago

Something I don't understand about your results. The graphs where you are comparing the minimum criterions. Why would kappa affect those at all? The kappa parameter only comes in to the second step (finding lb/ub).

Second is the scaling of the outcome, which seems like it might be useful? (Although less useful than setting kappa > 0.) But I wasn't able to fully follow what you are doing in the different approaches to that. Could you write that up formally so I can stare at the optimization problems? Then we can try to think of a disciplined way to rescale the outcome that hopefully doesn't require input from the user.

jkcshea commented 3 years ago

The graphs where you are comparing the minimum criterions. Why would kappa affect those at all? The kappa parameter only comes in to the second step (finding lb/ub).

The MTR coefficient estimates differ for different values of kappa. As a result, the points in the audit grid that are being violated also differ. So after the first audit, the shape constraints we include when minimizing the criterion need not be the same for different values of kappa. As a result, we get different minimum criterions.

In only 2% of the most recent simulations do the criterions for kappa = 0 and kappa = 0.01 differ by more than 1%. On average, they differ by less than 0.08%. However, in absolute terms, this 0.08% can be quite large---it is equal to 386.

Could you write that up formally so I can stare at the optimization problems? Then we can try to think of a disciplined way to rescale the outcome that hopefully doesn't require input from the user.

Sure, here it is. It is very simple, and we are already doing it to an extent, but perhaps not to its full potential.

stability-scale-quadratic.pdf

a-torgovitsky commented 3 years ago

I see, I see. It is possible to configure our settings for the audit such that it is "deterministic" isn't it? (i.e. such that we know there will always be only a single audit step) That seems like a good debugging strategy to isolate how much of this is coming from the audit, and how much would be present even if we were to try to solve one large program at the outset.

However, in absolute terms, this 0.08% can be quite large---it is equal to 386.

The size of the criterion differences is not particularly important per se. It's how these gaps map into differences in bounds that are ~concerning~ (edit: that are of concern. I don't know if the magnitude is concerning anymore. I'm just saying the bounds are what we care about.).

Did you get a chance to compare to Mosek?

I am a bit behind but will read the outcome scaling note soon.

jkcshea commented 3 years ago

It is possible to configure our settings for the audit such that it is "deterministic" isn't it? (i.e. such that we know there will always be only a single audit step)

Ah, yes it is. Here are the simulations run where I've set audit.max = 1. So differences in the minimum criterion between scaled and unscaled estimates are now eliminated.

Here are the optimization statuses for the lower and upper bounds. The patterns are largely the same as before, i.e. increasing kappa is most helpful. There are also fewer numerical issues, but that may be because the problems have fewer constraints imposed (since there is only one iteration of the audit).

A big change is that the average difference in bounds between the scaled and unscaled estimates have drastically decreased. Previously, the average was about 90%, now it is between 10--15%.

There are no violations of the linear constraints. The quadratic constraint continues to be violated, though.

Did you get a chance to compare to Mosek?

Sorry for being slow on this. It is mostly set up, but some things remain to be done (e.g. allowing for options). (Since we ultimately want to compare these solvers in the context of the ivmte package, and Mosek is a competitor of Gurobi, I figured I'd go ahead and fully integrate into the package.)

Recall that Mosek has an excellent page on SOCPs. That's because Mosek doesn't let us declare QCQPs. The problem is that the matrix defining the quadratic constraint is sometimes not positive definite. It then isn't possible to transform the QCQP into an SOCP without further work. I haven't dealt with this yet.

Many iterations of the simulation above lead to this problem. But I am working on a simpler simulation comparing Gurobi and Mosek, and will post the results once they're ready.

jkcshea commented 3 years ago

Here is the simulation comparing Gurobi against Mosek. Gurobi is always on the left.

No differences in the criterion minimization.

When kappa = 0, Gurobi does much better. But when kappa > 0, Mosek does much, much better. The STALL (10006) status code from Mosek indicates that Mosek terminated the optimization because of "slow progress." The manual warns that the solutions may therefore not be feasible or optimal. Note the new cases specific to Mosek where the QCQP cannot be reformulated into a SOCP without further work (QUAD. MATRIX NOT PD).

The bounds from Mosek also looks to be less sensitive to scaling.

Linear constraints are almost never violated by Gurobi, and are never violated by Mosek.

Quadratic constraints continue to be violated, albeit the violations are very small. But Mosek does a better job, especially when kappa > 0.

[Edit 1: Oops, forgot one more plot. This one compares the bounds between the two solvers. Without scaling, and kappa = 0, Gurobi's solutions seem to be 2-3% better than those of Mosek. i.e., for lower bounds, Gurobi 3% smaller, and for upper bounds, Gurobi is 2% larger. But these differences disappear once kappa is adjusted or the data are scaled.]

[Edit 2: Ah, I remember why I originally did not post this figure. The comparison uses only the lower or upper bounds that are optimal for both Gurobi and Mosek. When kappa = 0, and the data are unscaled, there are only ~20 iterations where both the Gurobi and Mosek solutions were optimal, so those comparisons are quite unreliable. If kappa = 0 and the data are scaled, there are no iterations where both solvers returned optimal solutions. But when kappa > 0, there are 200+ of such iterations, so those comparisons are more meaningful, and they suggest the solvers return the same solutions.]

[And for documentation, this is the simulation.]

mosek-vs-gurobi.zip

a-torgovitsky commented 3 years ago

Ok. Going forward I don't think we need to keep looking at the kappa = 0 case. We know that kappa > 0 is important for stability.

I am confused by this:

The problem is that the matrix defining the quadratic constraint is sometimes not positive definite.

Do you mean positive definite or positive semi-definite? Either way I don't understand:

If PSD, then we know that's not right, because the quadratic constraint must actually be positive PSD given the structure of the quadratic term as an outer product. (e.g. E[XX'] in linear regression) So this must be some issue of precision. Is this Mosek that is telling you it is not PSD?
If PD, then I'm confused because that's basically requiring a strict inequality (>0, PD) vs. a weak inequality (>= 0, PSD). Generally in computation this is not a meaningful distinction.

jkcshea commented 3 years ago

Ah I'm sorry, I'm wrong, there is no problem really.

So to clarify my mistake, the error I was getting was from R, and was this:

Error in chol.default(Qd) : 
  the leading minor of order 6 is not positive definite

where Qd is the E[XX'] matrix. The reason I was doing a Cholesky decomposition was because I was trying to reformulate the QCQP as an SOCP. The error was popping up because Qd was not full rank.

But there's no reason for me to use that decomposition, because Mosek's modeling cookbook has a much better reformulation than what I posted earlier. So once I implement that, everything should be okay.

jkcshea commented 3 years ago

Okay, here are the simulations where the SOCPs are constructed following Mosek's webpage. So this simulation generates data that is likely to be collinear, but now we can handle them using Mosek. All simulations have kappa = 0.01, and terminate after one audit. mosek-vs-gurobi-updated.zip

No issues with minimizing the criterion, both solvers get the same criterion.

Gurobi performs much better than Mosek when the data is unscaled---not a single solution is optimal for Mosek. But Mosek responds extremely well to scaling---almost every solution becomes optimal. In contrast, Gurobi does not seem to respond to scaling.

As suggested by the previous plots, Gurobi's bounds for scaled and unscaled data are about the same. In contrast, Mosek's bounds (especially the lower bounds) respond to the scaling. So the difference between optimal and suboptimal bounds is potentially large for Mosek. But this doesn't necessarily mean Mosek's bounds are better (as shown below).

Violations of linear constraints are infrequent and small for Gurobi. No violations for Mosek.

However, Mosek is much more likely to violate the quadratic constraint.

Here are comparisons of the optimal bounds between Gurobi and Mosek. Recall that without scaling, Mosek has no optimal solutions, so there is nothing to be compared. But when there is scaling, the negative values in the plot indicate that Mosek did a better job optimizing (i.e. larger upper bounds, smaller lower bounds). However, the difference is very, very small, about 0.0001%. Taking into account the earlier plots, this suggests that Gurobi's suboptimal solutions are actually very close to the optimal solutions, whereas Mosek's suboptimal solutions are farther away.

a-torgovitsky commented 3 years ago

Ok, what does it look like when you let the audit run its course?

Are we still seeing absolute problems with either Gurobi or Mosek? Or are we now just comparing small differences that aren't going to be of practical importance?

a-torgovitsky commented 3 years ago

A comment and a question regarding your scaling note.

Comment: Normalizing the criterion by n just makes sense anyway (regardless of stability). That way we can always interpret the criterion as the average of the squared residuals, independently of n.

Question: You say we are already scaling by setting \alpha = n. Did you have evidence that some other choice of \alpha was better? Or was this just a question of whether to use \alpha = n or not?

jkcshea commented 3 years ago

At this point, it looks like the differences between Gurobi and Mosek are small and will not have practical importance. For example, here are the simulations allowing for multiple audits.

Criterion minimization is stable, although there are slight differences in the criterion values, and how much the criterion changes with and without scaling. But the plots below suggest this doesn't have much of an effect on the bounds.

The optimization statuses for the bounds look similar to before. That is, Gurobi is not very responsive to scaling. Mosek struggles to return an optimal solution without scaling, and responds well to scaling.

Gurobi's bounds differ by up to 25% when the data is scaled vs. unscaled. Since Mosek is responsive to scaling the data, its bounds may differ by up to 150%. But again, as shown below, the bounds across the two solvers are essentially the same.

No violations of the linear constraints. But there continue to be violations of the quadratic constraints by both solvers.

This plot compares the number of audits performed under each solver. Without scaling, the number of audits often differ by 1, with the average number of audits being about 3 for both solvers. With scaling, the number of audits are almost always the same.

And finally, here is the comparison of the optimal bounds obtained using each solver. Ignore the 'No scaling' results, since there are very few iterations where the lower/upper bound was optimal for both solvers. But when the data are scaled, then the final bounds returned by each solver are rouhgly the same.

Question: You say we are already scaling by setting \alpha = n. Did you have evidence that some other choice of \alpha was better? Or was this just a question of whether to use \alpha = n or not?

Yep. A while back I was playing with another simulation, here, where the outcome variable was continuous and could be large, e.g. 250. The results in the post show that setting \alpha = n^2 really improved the stability. It seemed like Gurobi didn't like how the RHS of the quadratic constraint was large.

As a quick example, here is what happens if I set \alpha = n^2 in this simulation, where the outcome can also be large, e.g. 400. In almost 100% of the iterations, both upper and lower bounds are optimal (whereas this was previously around 25% for Gurobi, and 50% for Mosek).

I have also tried setting \alpha = n^2 for the AE data. But the outcome variable is binary in the AE data, so the RHS of the quadratic constraint is relatively small to begin with. This did not have much of an effect, although it did introduce some numerical errors.

So maybe there's a window that Gurobi likes the RHS of the quadratic constraint to be in? If you think that's reasonable, but there isn't an obvious way to derive what this window is, then maybe another round of simulations?

jkcshea commented 3 years ago

Oh, some simple suggestions from Gurobi that I had forgotten.

we recommended that right-hand sides of inequalities representing physical quantities (even budgets) should be scaled so that they are on the order of 10^4 or less.

When defining your variables and constraints, it is important to choose units that are consistent with tolerances. To give an example, a constraint with a 10^10 right-hand side value is not going to work well with the default 10^{-6} feasibility tolerance.

We recommend that you scale the matrix coefficients so that their range is contained in six orders of magnitude or less, and hopefully within [10^{-3}, 10^{6}]

There will be a trade off between scaling down the RHS and keeping the coefficients within [10^{-3}, 10^{6}], but this may be a useful reference to start from.

jkcshea commented 3 years ago

With the way we are rescaling things, it should be the case that all entries in the scaled E[XX'] matrix are in [-1, 1]. So we know we can always scale the quadratic constraint down by a factor of 1000, and the quadratic matrix coefficients will remain in the suggested window of [10^{-3}, 10^{6}] (I assume the sign isn't important).

The scale of the coefficients on the linear portion of the quadratic constraint will depend on the scale of the outcome variable. So how much we can shrink these will depend on the scale of the outcome variable.

If you think it's reasonable, I can draft up an approach on how to set \alpha so that the coefficients and RHS of the quadratic constraint are closer to Gurobi's suggested ranges.

a-torgovitsky commented 3 years ago

I think we should always divide by n in the criterion.

We can also divide by something else, but it shouldn't depend on n. So n^2 I don't like.

Suppose Y is dollars. This is basically like saying do we want to take Y to be raw dollars, or $1,000s of dollars.

I'm not sure what the right amount is. Why did you suggest 1000? I didn't follow that. What about the norm of Y? Or scale it the same way we did for X?

jkcshea / ivmte

Improve scaling of optimization problems #198