a-torgovitsky commented 2 years ago

Ok @jkcshea I have an idea for addressing the issues we are seeing in #218 #221 #224 #225 (and possibly others...)

We want solutions to the system $Ax = b$. Reading through this old tome of wisdom, it seems like QR decompositions of $A$ are preferred from a numerical standpoint to Cholesky decompositions of $A'A$. Basically because one can get more precision on the QR decomposition than the Cholesky decomposition (not because of the decomposition per se, but because one is on the design matrix and the other is on the Gram matrix.) This could be exactly the problem we are having with spurious small terms showing up in the Cholesky decomposition.

So suppose we QR decompose $A = QR$ where $Q$ is orthogonal (so $Q'Q = I$) and $R$ is upper triangular.

Least squares criterion

The criterion is $x'A'Ax - 2x'A'b + b'b$ Substituting the QR decomposition, this becomes: $x'R'Q'QRx - 2x'R'Q'b + b'b$ I see three things we could do here:

No substitution, but use orthogonality to simplify to $x'R'Rx - 2x'R'Q'b + b'b$.
Substitute $y = QRx$, so the objective becomes $y'y - 2y'b + b'b$
Substitute $y = Rx$, so the objective becomes $y'y - 2y'Q'b + b'b$ where the quadratic term uses the orthogonality of $Q$.

Not clear to me which one is better in terms of scaling, so we should try all of them.

Normal equations approach

The criterion is the 1-norm of $A'Ax - A'b$ Substituting the QR decomposition, this becomes: $R'Q'QRx - R'Q'b$ Again I see three things we could do here:

No substitution, use orthogonality to simplify to $R'Rx - R'Q'b$
Substitute $y = QRx$, so now we are looking at the 1-norm of $R'Q'y - R'Q'b$
Substitute $y = Rx$, then use orthogonality, so we are looking at the 1-norm of $R'y - R'Q'b$.

Again not clear to me what's going to be better.

Can you give these a shot and see if they resolve the current problems were are having across various issues? Would have been good if we had had the foresight to set up a suite of test problems that we could run different methods on, so that we don't go in circles... Not sure how difficult it would be to dig up our old threads and check these new approaches on those problems too.

jkcshea commented 2 years ago

Hm, that looks like a very useful but also challenging book...

Sure, it's no problem to implement and try these different approaches. I'll also collect some examples from various threads for benchmarking.

jkcshea commented 2 years ago

Hi everyone,

Just thought I'd give an update on this since this is taking longer than I expected. I have implemented the various methods above. I'm currently testing them to make sure I did things correctly. For convenience, the code allows the user to choose between the methods we've implemented already, as well as the ones above.

So for the LP approach:

direct = "lp0" uses the original approach, i.e., no QR decomposition.
direct = "lp1" uses the the first QR approach for LPs proposed by @a-torgovitsky (which is really identical to the original approach since $R'R = R'Q'QR = A'A$).
direct = "lp2" uses the second QR approach.
direct = "lp3" uses the third QR approach.

Likewise for the QCQP approach:

direct = "qp0" uses the original Cholesky approach.
direct = "qp1" uses the the first QR approach for QCQPs.
etc.

But these different approaches may not be compatible with other features of the package. e.g., rescale, equal.coef, boostrapping haven't been dealt with.

a-torgovitsky commented 2 years ago

Thanks for the update @jkcshea . Were there any difficulties in particular that you ran into with this?

I think the lp1 approach is going to be appreciably different from lp0. Even though it ends up in $R'R = A'A$, how it gets there is different, because lp0 uses Cholesky and lp1 uses QR. The difference is whether we decompose $A'A$ after multiplying (Cholesky) which can compound small errors, vs. decomposing $A$ first and then multiplying. Remember the overarching issue here is that we are seeing very small numbers come out of the Cholesky decomposition which seems to be leading to bad scaling.

jkcshea commented 2 years ago

Were there any difficulties in particular that you ran into with this?

Oh no, it was just messy, that's all. You know, adjust one matrix, but that affects another matrix in another function, etc.

The difference is whether we decompose after multiplying (Cholesky) which can compound small errors, vs. decomposing first and then multiplying.

Ah, I am glad you pointed this out. All of the numerical issues recently have been to do with the direct regression. So the Cholesky was never applied to the LP approach. So lp0 was using the original criterion $A'Ax - A'b$. I'll adjust lp0 to use the Cholesky as well, then.

But I'll also allow for the original criterion, e.g., let lp4 use $A'Ax - A'b$. I know that, in R, $A'A \neq R'R$ and $A \neq QR$. They differ ever so slightly because of limitations in numerical precision. Maybe some of these differences can stem from very small numbers in $R$ and $Q$. So maybe its worth keeping the original criterion around.

a-torgovitsky commented 2 years ago

Ok, I didn't realize that, I thought lp0 was what we were using all along for the LP approach. Anyway, let's just compare them all. If we have the test problems set up (and saved in the package), then we should be able to deal with this issue more efficiently and hopefully resolve them for good. Something we should have done at the start, but you know how that goes... I'm hopeful that the QR approach will yield big gains in stability based on what I read in the Lawson and Hansen book.

jkcshea commented 2 years ago

Uploading the code for the simulation for reference.

issue-277-simulation.zip

jkcshea commented 2 years ago

Hi everyone,

Apologies for taking so long with this. Attached is a collection of plots showing how Gurobi performs when using various matrix decompositions. A couple things.

I limited the sample size to $N = 2,000$ in the simulations since methods lp2 and qp2 won't work with large data sets (the number of new variables introduced is equal to the number of observations). But sample size seems to affect stability, so I'm running the simulation again with $N = 10,000$.
As one of the measurements of stability, I should have recorded the range of magnitudes in the constraint matrices. I will include that next time.
Let me know if you want any other measurements/plots/test cases/etc.

decomp-test.pdf

a-torgovitsky commented 2 years ago

Hi @jkcshea , thanks this looks good.

Let's just get rid of lp2/qp2---not scaling well with the sample size is a deal-breaker. Sorry I didn't realize that when I wrote it down.

I'll reserve judgment on the rest until we see the new results with larger sample size and also range of magnitude in the constraint matrices. The latter will be an important indicator I think.

jkcshea commented 2 years ago

Here are the updated results using a sample size of 10,000.

decomp-test.pdf

In terms of which method is the most stable, there doesn't seem to be a clear winner. It seems to depend on the example.

Optimal bounds: For the LP problem, it seems detrimental to introduce new variables using the QR or Cholesky decompositions. But for the QCQP problem, it can be helpful when criterion.tol = 0.

Flipped bounds: Introducing new variables is helpful for the LP problem, but not for the QCQP problem.

# of audits: All methods are roughly the same.

# of violations: Decompositions can be helpful in certain cases, and detrimental in others.

Run time: All methods are roughly the same.

Order of magnitude in the constraint matrix: Introducing new variables using the QR or Cholesky decomposition leads to small entries. It can also lead to larger entries and large ranges of magnitudes.

a-torgovitsky commented 2 years ago

Thanks @jkcshea , these are informative.

Could you give me some example programs that

Have large magnitude ranges
Have some sort of error or imprecision when solving

.mps files would be good, then I can play around with various approaches easily Maybe one example from each approach (4 x lp/qp)

jkcshea commented 2 years ago

Sorry for the slow response. Here are the 8 examples: error-examples.zip

In each folder are the .mps and .rds files for the models. Notice there are two models: one is for the criterion problem, and the other is for the bounds. While modelsense is set as "min" for the criterion problem, it is not set for the bounds problem. If you're using the Gurobi shell, this can be done by model.ModelSense = 1 (min) and model.ModelSense = -1 (max), where model is the variable containing the model.

An annoying thing to mention: Given the scale of the simulation, it was run on Acropolis. So examples that were unstable or erroneous on Acropolis may be fine on our machines. For example, for qp0 and qp4, none of the cases that failed on Acropolis failed on my machine. But for lp0, ... lp4, qp1, and qp3, the attached examples are infeasible/unbounded/subopitmal.

With the exception of qp0 and qp1, the range of magnitudes in these examples are also large: 13 orders or more. For qp0 and qp1 (where we do not introduce new variables), the range is only 5.

a-torgovitsky commented 2 years ago

Thanks.

Two quick questions:

Are these from different models/draws of data?
For the bound problem in lp0 there is also a large magnitude range in the objective vector, which is different from what we've been looking at, which I thought was the range in the LHS constraint matrix. How common is it to have such a large range in the coefficient vector? Why does it occur? (i.e. what combination of target parameter/MTR specification gives rise to it?)

jkcshea commented 2 years ago

Are these from different models/draws of data?

Yes, they vary in both the model and the data. Should I instead prepare 8 .mps files for the same model and data? Maybe that is more helpful for comparing these methods.

there is also a large magnitude range in the objective vector... How common is it to have such a large range in the coefficient vector? Why does it occur?

Hm, I'm not sure. I will look into this.

a-torgovitsky commented 2 years ago

I've done some more reading on this, and it really just reinforces that we need to better understand why we are getting these large ranges. What is it about the problem or preliminary computations that leads to them?

Focusing on lp0 first seems like a good idea, since it is very poorly scaled, but doesn't have any transformations/substitutions, so is easier to understand. Here are the stats:

Gurobi Optimizer version 9.1.2 build v9.1.2rc0 (linux64)
Thread count: 8 physical cores, 16 logical processors, using up to 16 threads
Optimize a model with 525 rows, 240 columns and 4728 nonzeros
Model fingerprint: 0x7e240fbc
Coefficient statistics:
  Matrix range     [2e-13, 1e+00]
  Objective range  [1e-13, 1e+00]
  Bounds range     [0e+00, 0e+00]
  RHS range        [3e-14, 7e-01]

We already have the question of what components of the objective vector are taking such large and small values. What about for the constraint matrix? For the RHS?

The point is that in this model we should be able to understand exactly which variables (meaning, e.g. the quadratic term in the spline of knot sequence 5, interacted with age) or constraints (e.g. the one corresponding to increasing, or to matching moment 5, or whatever) are creating the problems. Then, hopefully, we can figure out a way to innocuously reformulate the problem to improve things. We did this in the past when introducing rescaling to handle a variable like age^2. But it seems like we missed something crucial.

jkcshea commented 2 years ago

For the bound problem in lp0 there is also a large magnitude range in the objective vector... How common is it to have such a large range in the coefficient vector? Why does it occur? (i.e. what combination of target parameter/MTR specification gives rise to it?)

The objective has entries with order of magnitude -13 or less about 2% of the time (4 instances out of 200 simulations). (The objective has entries with order of magnitude -10 or less about 18% of the time.)

There are two entries in the objective vector with magnitude -13. Both involve interactions with a spline of degree = 3 with knots at 0.1, 0.2, ... 0.9.

Interaction between the spline and x1, a binary variable with mean approx. 0.5.
Interaction between the spline and factor variable x2, in particular factor(x2)3. x2 == 3 with approx. 0.26 probability.

We have tiny entries in the objective because:

The target parameter is the ATU. So when integrating out u, we integrate from the propensity score phat up to 1. phat is sometimes estimated to be just below one of the knots , e.g. phat = 0.19958 but one of the knots is 0.2. So when we integrate some of the spline bases from phat up to 1, we get tiny values (magnitude of -13). We then interact those tiny values with x1 and average them to get the tiny entries in the objective vector. We don't have this issue if we change the target parameter to ATT, for example, since we integrate from 0 to phat instead.
The interactions don't "align." When integrating out u for one of the spline bases, the average integral has magnitude -4, so it isn't always tiny. However, for the observations with x2 == 3, the integral has magnitude -11. The corresponding propensity score isn't actually right up against a knot, but the spline basis is just very small in that region. So when we interact that particular basis spline with factor(x2), we end up with a tiny entry in the objective vector.

What about for the constraint matrix? For the RHS?

Tiny entries in the constraint matrix and RHS all pertain to the moments, i.e., trying to minimize $A'A - A'b$. They always corresponded to the interaction between a particular basis spline (the second one) and x1. Again, this is because there are some propensity scores very close to a knot. So when we try to minimize $A'Ax - A'b$, we have tiny entries in $A$, and thus tiny entries in $A'A$ and $A'b$.

Then, hopefully, we can figure out a way to innocuously reformulate the problem to improve things.

In the LP problem, it seems like tiny integrals of spline bases are indicative of scaling issues. After all, they are more likely to generate tiny values in $A$, which leads to tiny entries in $A'A$ and $A'b$. Maybe we can use those tiny values to inform us which variables and which constraints need to be rescaled.

a-torgovitsky commented 2 years ago

Hi @jkcshea , thanks this is super useful.

For 1, you say

We then interact those tiny values with x1 and average them to get the tiny entries in the objective vector.

Just to be clear, does the interaction itself create any issues? (For example, if x1 = 0, do we get 0 back, or do we get a small number, like 1e-13?) Or are you just saying that we are taking a small number and multiplying by roughly .5 (the mean of x1)?

Anyway, the broader point seems to be this: If the spline is interacted with categorical x and if $p(x,z)$ has mass points near the knot points then we can have integrals over very small intervals These integrals will be tiny for higher order spline terms (quadratics, cubics)

Is that an accurate summary? Do you think that is an accurate description of what leads to bad scaling in $A$ and the objective for the other examples we have too?

jkcshea commented 2 years ago

Just to be clear, does the interaction itself create any issues?

Fortunately not. R seems to be treating zeros as zeroes. Small entries in the vectors/matrices are due to there being small numbers that are supposed to be non-zero.

Anyway, the broader point seems to be this: ... Is that an accurate summary?

Yep, that sounds accurate

a-torgovitsky commented 2 years ago

Did you check the other examples too? The problems there also arise from the same issue with pscore and knots?

If so, then that's great---we've isolated the problem!

The solution is not clear to me, but I'll keep thinking about it.

jkcshea commented 2 years ago

I've finished going through the LP examples, but not the QCQP examples. I can do those next.

But from the LP examples, it looks like the decompositions can cause those tiny values.

For instance, in the third test case I used for diagnosing issue, there should not be any small entries in any of the matrices/vectors defining the LP problem. This holds if we do not use any decompositoins and substitutions. But once we introduce decompositions, we get all those tiny entries. This is because the matrices generated by the QR and Cholesky decompositions contain very small entries. So when we use those matrices to define the Gram matrix or substitute in variables, we get very small values.

a-torgovitsky commented 2 years ago

Ok, I see case 3, which has a specification like this:

args <- list(data = df,
             target = "att",
             outcome = "worked",
             m0 = ~ u + I(u^2) + yob + u*yob,
             m1 = ~ u + I(u^2) + I(u^3) + yob + u*yob,
             propensity = morekids ~ samesex,
             noisy = FALSE)

So there are no splines here. Yet the results you shared before say that there are always 8 order range of magnitude in coefficients, before decomposing the A matrix. So what is generating that here?

jkcshea commented 2 years ago

Did you check the other examples too?

Okay, I've gone through the QP examples now.

Small entries in the constraints

In the examples I considered, small entries in the constraints are never because of shape constraints. They are always because of the decomposition matrices used to substitute in variables. This is true for both the QR and Cholesky approaches.

Small entries in the objective

I found a typo in my code when using the QCQP approach. (It meant I was carrying the objective vector from the criterion problem to the bound problem...) So I'll have to re-run those examples to see:

How stable these approaches are.
How often we get small values in the objective for the QCQP approach. And if we get any, I'll find out why.

So there are no splines here. Yet the results you shared before say that there are always 8 order range of magnitude in coefficients, before decomposing the A matrix. So what is generating that here?

This is because the design matrix $A$ has entries with magnitudes ranging from -3 to 1. But when defining the equality constraints, we use the gram matrix $A'A$. The magnitudes of the Gram matrix range from -5 to 3.

a-torgovitsky commented 2 years ago

Thinking about the spline issue... Why doesn't rescaling the corresponding variable fix this? The variable is "smalll" but the coefficient would also be expected to be small.

jkcshea commented 2 years ago

I corrected the objective vector for the QCQP problem. Here is an updated version of the figures (only QCQP figures have changed).

decomp-test.pdf

In 2/4 test cases, the Cholesky approach seems more stable than the QR approach.

Regarding the small entries in the objective vector

In the case of the bounds problem, I forgot that the construction of the objective vector for the bounds is the same regardless of whether we use the LP or QCQP approach. So the frequency of and reason behind small values in the objective vector is the same for the LP and QCQP approaches. e.g. a degree 0 spline is interacted with a dummy variable $X$ such that when $X = 1$, the propensity score $p(X, Z)$ is very close to a knot.
In the case of the criterion problem, I will have to check.

Why doesn't rescaling the corresponding variable fix this?

Hm, not sure. I'll update the code for the rescale option. I can then re-run the simulation to see how things look. I've kept track of all the seeds used to generate each iteration of the simulation. So if needed, I can compare models with rescale = TRUE and rescale = FALSE easily.

a-torgovitsky commented 2 years ago

Ok, just to be clear, all of this is with rescale = FALSE? I would expect rescaling would have fixed this problem.

Also, you mention as an example a degree 0 interacted spline with knots close to a propensity score support point. I had assumed above that these were all cubic spline problems. Is that not the case? Even with a constant spline we can have this issue popping up?

jkcshea commented 2 years ago

Ok, just to be clear, all of this is with rescale = FALSE?

Yep, that's right.

Also, you mention as an example a degree 0 interacted spline with knots close to a propensity score support point. I had assumed above that these were all cubic spline problems.

Ah, I'm sorry, this is also right. That was indeed a cubic spline.

a-torgovitsky commented 2 years ago

Ok, that makes more sense.

Do you remember why we started doing rescale = FALSE? There was some example where rescaling hurt... Or maybe it was just because we thought we had solved the problem

It seems like rescaling could be exactly what we need for variables like certain third order spline pieces that are going to always be small

jkcshea commented 2 years ago

Do you remember why we started doing rescale = FALSE? There was some example where rescaling hurt...

This looks to be the example. https://github.com/jkcshea/ivmte/issues/209#issuecomment-1133813106

But looking at my code, I don't know if I rescaled things correctly the first time... I'll finish getting the rescale option to work with all the methods and see how the simulation goes.

jkcshea commented 2 years ago

Hello all,

Here is an update on the rescaling. I've only done this for the QCQP problems. Below are a subset of the types of plots from the earlier document, and an explanation of how I rescaled things. I compare the unscaled results to the rescaled results.

decomp-rescale-test.pdf

Unfortunately, rescaling still looks bad. I have yet to fully understand why. One of the problems seems to be that rescaling worsens the range of magnitudes of the coefficients.

When rescaling, I normalize the columns of certain matrices by their $\ell_2$ norms. (This idea was originally proposed here.) The problem is that some of these columns have tiny norms, e.g., of the magnitude 1e-11. So if I divide something of reasonable magnitude by this tiny norm, I'm going to get a huge number. This means the rescaled constraint matrices can have enormous ranges of magnitude. I doubt we want to arbitrarily decide what norm is considered too small to be worth normalizing.

Something I tried was scaling individual constraints. But what I found was that the range of magnitudes in a single constraint can also be large. For example, after rescaling the variables, the range of magnitudes in a single constraint can be as large as 12. So scaling such a constraint isn't going to reduce its range of magnitudes.

Right now we are scaling variables independently of the constraints (well, we aren't even scaling constraints). Maybe there is a way to algorithmically scale them together. I can look into this later if desired.

Also, just to wrap up a loose thread:

We don't often get small entries in the quadratic constraint vector (the linear part).
If we do, it is again because the knots are close to a propensity score.

a-torgovitsky commented 2 years ago

Maybe there is a way to algorithmically scale them together. I can look into this later if desired.

This is what the Gurobi manual suggests in the numerical guidelines: https://www.gurobi.com/documentation/9.5/refman/advanced_user_scaling.html However they don't discuss exactly how besides providing a simple example. My guess would be that if it could be done algorithmically, then Gurobi would already do it...but maybe not. Might be worth asking on their forums.

Anyway, I think the first thing to try is to do the rescaling more like Gurobi suggests in their example above. Taking their example on that page, suppose you multiply all the $x$ coefficients by $10^{5}$ as they suggest, but forget about rescaling the second constraint. Then the range is still going to improve from $[10^{-7}, 10^{4}]$ to $[10^{-2}, 10^{5}]$ In contrast, if you took the $l{2}$ norm of x, you'd be dividing the coefficients of $x$ by something that's basically 1, which wouldn't change things at all. So their example suggests that dividing by the $l{2}$ norm is not the right way to rescale.

I guess one rule of thumb would be to multiply all elements of a column by $10^{k}$ where $k$ is whatever integer would make the smallest magnitude be something like $10^{0}$ or maybe $10^{-2}$.

jkcshea commented 2 years ago

Sure, I'll implement this and see how things look.

a-torgovitsky commented 2 years ago

The Gurobi example linked above says:

Essentially the trick is to simultaneously scale a column and a row to achieve a smaller range in the coefficient matrix.

But simultaneous here doesn't seem like what we think of as "simultaneous" as economists, right? i.e. if you change the order (scale the row first, then the column) then you arrive at the same reformulation? If so, that suggests we could do something sequentially: scale all columns first, then scale all rows. This seems a lot easier. (But again, if this is effective then why isn't Gurobi doing it automatically?)

jkcshea commented 2 years ago

Here's an updated version of the earlier plots. We now have a qp5 approach where ivmte rescales the columns of the constraint matrix to have minimum magnitude of -3. Afterwards, ivmte rescales the rows to have a minimum magnitude of -3. In some examples, this approach is far superior to the earlier decomposition approaches.

decomp-rescale-test-colfirst.pdf

Whether we rescale the columns or rows first matters, though. The figures above rescale the columns first, and then rescales the rows. The results are not as good when we rescale the rows first. decomp-rescale-test-rowfirst.pdf (There's a switch in the code that allows me to reverse the order. But I didn't make it an option in the package.)

Here are some simple examples of $2 \times 2$ matrices where the order of rescaling can matter. rescale-example.pdf

jkcshea commented 2 years ago

Although likely obvious, I should mention that this simple approach of rescaling the columns and then rows is not necessarily optimal. Sometimes, individual columns and rows can be rescaled again to further reduce the range of magnitudes. I tried to see if I could come up with a way to automate this but was unsuccessful. Things get very confusing very quickly since the order of rescaling columns/rows matters.

In case someone wants to look into this, here is how I visualized the problem. Perhaps this is not worth pursuing, though.

a-torgovitsky commented 2 years ago

Hi @jkcshea , thanks for this. What I had in mind was rescaling in combination with decompositions. Here it looks like you are either doing a decomposition + L2 rescaling, or you are doing the row/column rescaling with order of magnitude in qp5 but not doing a decomposition. Is that right?

jkcshea commented 2 years ago

Here it looks like you are either doing a decomposition + L2 rescaling, or you are doing the row/column rescaling with order of magnitude in qp5 but not doing a decomposition.

Yep.

What I had in mind was rescaling in combination with decompositions.

Ahh, okay. So I should implement the decompositions as before, and then rescale the rows and columns so that the minimum order of magnitude is -3?

a-torgovitsky commented 2 years ago

Yes let's try that. I think we can ditch the L2 rescaling entirely---that doesn't really make sense.

jkcshea commented 2 years ago

Sorry this is taking so long. I've run into a lot of issues trying to test this approach. Nevertheless, here are the updated figures from before using the new rescaling/decomposition combination. decomp-rescale-test.pdf

If there are variable substitutions, the rescaling is very unsuccessful. Why this is still eludes me, but I will keep investigating. It is possible I made a coding error and have been unable to find it. But the optimal bounds when rescale = TRUE matches those when rescale = FALSE, so I'm at least passing the simplest of checks.

Page 11 of the PDF shows the minimum magnitude of the entries in the constraint matrix. This figure illustrates two problems I've also been trying to resolve.

The minimum magnitude is often -5 That shouldn't happen, the minimum magnitude should always be -3. This does not happen on my machine but happens on the server. I don't understand why.
The sum of the height of the blue bars is not 200 (the total number of simulations). This is happening because minimizing the criterion results in numerical errors. When this happens, ivmte fails to return the Gurobi model. But this should not be the case. ivmte should always return the Gurobi model, even when there is an error. That way the user can investigate the model.

a-torgovitsky commented 2 years ago

By variable substitutions you mean for the quadratic constraint? Are you maybe trying to rescale the new artificial variables too?

Some of these issues seem like straight bugs. My recommendation is always to get those sorted first before trying to think about the technical/computational/etc. issues. Otherwise you risk spending time on artifacts of coding errors.

jkcshea commented 2 years ago

By variable substitutions you mean for the quadratic constraint?

Yes, so we get the nice identity matrix.

Are you maybe trying to rescale the new artificial variables too?

Ah, I was worried I did this too. But I am not doing this.

Some of these issues seem like straight bugs. My recommendation is always to get those sorted first before trying to think about the technical/computational/etc. issues. Otherwise you risk spending time on artifacts of coding errors.

Sounds good, I'll focus on the two bug-like issues.

jkcshea commented 2 years ago

Okay, I've resolved the bugs now. These were the problems.

Problem: I was not able to access the model in every iteration of the simulation. Why: There are many cases where ivmte does not return an estimate. e.g., numerical issues, infeasibility, unboundedness. I thought I coded ivmte so that it spits out an error message and returns output for the user to investigate in every one of these scenarios. But I was wrong. Note that this did not affect the stability of the package. It just stopped me from extracting the model from every iteration of the simulation. Solution: Expanded error-catching code to cover more foreseeable errors. I also included a catch-all to cover unforeseen errors. This catch-all returns the original error message from the solver, along with the model. That way, the user can investigate the error on their own.

An unforeseen error worth mentioning: When using the QR decomposition, Gurobi sometimes spits out an error saying "the quadratic matrix is not positive semidefinite". I've checked the code many times, we're constructing our matrix correctly and it should be PSD. I believe the tiny entries from the decompositoins are causing problems. This case on the Gurobi support forum is the most similar to ours. The OP has a quadratic matrix that is very close to being PSD but not quite. The Gurobi staff took a look at this and found that Gurobi breaks down when this happens. I tried implementing the solution that the Gurobi staff suggested---didn't work for us.

Problem: The minimum magnitude in the constraint matrix after rescaling should always be -3. But for some iterations, we got -5. Why: I forgot to rescale the new rows of the constraint matrix that get added during the audit procedure. Columns were correctly being rescaled, though. Solution: Also rescale the rows.
Problem: Very large ranges of magnitudes in the linear constraint matrix. Why: The decompositions generate tiny numbers. So when we rescale the columns, we can end up rescaling by large magnitudes. This can make 0s become something tiny-tiny like 1e-25. Then when we rescale the rows, these new tiny-tiny entries make us scale certain rows up by huge magnitudes. Solution: Entries smaller than 1e-13 in magnitude are converted to 0 after every step of rescaling.

Here are the updated figures: decomp-rescale-test.pdf The performance is about the same as before.

Rescaling doesn't seem to make things better.
Combining decompositions with the column/row rescaling seems worse than rescaling by l2 norms.
The matrix decompositions have large ranges of magnitude. So even after resacling, the ranges of magnitude in the constraint matrices are large.

a-torgovitsky commented 2 years ago

Thanks...having a hard time following the thread here...

The rescaling being done here is only for the quadratic constraint, or is it also for the linear constraints? In the QCQP, the decompositions only affect the quadratic constraint, don't they? Or by

Problem: Very large ranges of magnitudes in the linear constraint matrix.

do you mean the linear constraints for the slack variables added when using the decomposition?

jkcshea commented 2 years ago

The rescaling being done here is only for the quadratic constraint, or is it also for the linear constraints?

Oh, but we're rescaling both columns and rows of the linear constraint matrix. If we rescale the columns, that means we're rescaling the unknown variable $x$. Shouldn't both the quadratic and linear constraints always be affected by rescaling $x$?

For instance, consider the linear constraint $Ax \le b$. Let $K$ be a diagonal matrix for rescaling $x$. Then the constraint above is the same as $A K K^{-1} x \le b$. We can rewrite this as $\widetilde{A} \widetilde{x} \le b$, where $\widetilde{A} = AK$ and $\widetilde{x} = K^{-1} x$. So we need to adjust the linear constraint matrix if we want to rescale $x$.

In case of a quadratic constraint $x'Qx + q'x \le b$: We can rewrite this as $x'K^{-1}KQK K^{-1}x + q'K K^{-1}x \le b$. Which is the same as $\widetilde{x}'\widetilde{Q} \widetilde{x} + \widetilde{q}' \widetilde{x} \le b$, where $\widetilde{Q} = KQK$ and $\widetilde{q} = Kq$. So we need to adjust both the matrix $Q$ and the vector $q$ in the quadratic constraint when rescaling $x$.

Or by Problem: Very large ranges of magnitudes in the linear constraint matrix. do you mean the linear constraints for the slack variables added when using the decomposition?

I do believe that these linear constraints for the slack variables that seem to be the problem. The matrix decompositions include very small numbers, and the decompositions become part of the linear constraints. In Test Cases 1, 2, and 3, the rescaled estimates that do not involve decompositions are more stable than the rescaled estimates involving decompositions.

a-torgovitsky commented 2 years ago

Hi @jkcshea , sorry, I'm just really completely lost now. I spent an hour reading through the entire thread again, and there are just too many things going on to keep track of.

I agree with your most recent post that rescaling would affect both the linear and quadratic cases.

But can we take a step back here? Do you have a short summary of what we know so far? Did it appear that qp5 --- new rescaling, but no decompositions --- worked well? Is that the leading option right now?

jkcshea commented 2 years ago

But can we take a step back here? Do you have a short summary of what we know so far?

Of course, sorry for the confusion.

The short summary is that qp5 (new rescaling, no decomposition) worked the best. But it only improves stability if criterion.tol = 0. If criterion.tol > 0, we're better off not making any changes to the QCQP problem.

Here's a longer summary of what we've tried.

Suppose we want to max or min $c'x$ subject to

Linear constraints $Ax \le b$.
Quadratic constraint $x'Qx + q'x \le d$.

We tried two approaches to rescaling. The objective was to improve the range of magnitudes in $A$.

Rescale $x$ using the $\ell_2$ norm of the columns of $A$. Did not improve stability.
Rescale $x$ based on the magnitudes of the rows and columns of $A$. This was able to improve stability when criterion.tol = 0. When criterion.tol > 0, we are better off not rescaling.

We also tried two constructions of slack variables. The objective was to eliminate $Q$ from the quadratic constraint.

Slack variables based on a Cholesky decomposition. It improved stability in one of our test cases. But this is only if criterion.tol = 0.
Slack variables based on a QR decomposition. Did not improve stability.

Finally, we tried combinations of rescaling with the slack variables. But this was always worse than only rescaling or only having slack variables.

a-torgovitsky commented 2 years ago

Thanks When you say this:

The short summary is that qp5 (new rescaling, no decomposition) worked the best. But it only improves stability if criterion.tol = 0. If criterion.tol > 0, we're better off not making any changes to the QCQP problem.

What evidence are you referring to? Is it for example from comparing decomp-rescale-test-3.pdf Figures 2.1 and 2.2?

jkcshea commented 2 years ago

Ah yes that's right. The most recent copy of decomp-rescale-test.pdf from this post is the one I'm referring to (which is indeed the third iteration).

In that document, qp5 corresponds to the red "Basic" bar. We see improvements when comparing Figures 2.1 vs 2.2. We also see improvements when comparing Figures 2.3 vs 2.4.

Note that the "QR, no sub" approach (qp1, green bars in the figure) matches the performance of qp5. But that's because it is essentially the same as qp5. What we did in qp1 was apply the QR decomposition to the design matrix $B$. So if $B = QR$, then $B'B = R'R$. qp1 uses $R'R$ as the matrix in the quadratic constraint. qp5 uses $B'B$ instead.

a-torgovitsky commented 2 years ago

Ok, that makes sense. But then the performance of the decomposition methods is really bad in Case 1 under rescaling. Are you sure that's right?

When doing both a rescale and a decomposition are you rescaling first, then decomposing? Or decomposing first, then rescaling?

jkcshea commented 2 years ago

When decomposing and rescaling, I'm decomposing first and then rescaling. I can try switching the order. Not sure if that will prevent the decompositions from returning matrices with tiny entries and large ranges of magnitude. But we can try and find out.

a-torgovitsky commented 2 years ago

What order were we doing before with the l2 rescaling?

jkcshea commented 2 years ago

The same order: decompose and then rescale.

jkcshea / ivmte

QR decomposition #227

Least squares criterion

Normal equations approach