Fully constraining constant splines

johnnybonney commented 2 years ago

I am working with MTR specifications of the form

~factor(FE)*uSplines(degree = 0, knots = knot_list) +
    x1 + x1:uSplines(degree = 0, knots = knot_list) +
    x2 + x2:uSplines(degree = 0, knots = knot_list)

where FE is a set of fixed effects (states, 50 values), and x1 and x2 are continuous covariates. knot_list is a set of knots that are equally-spaced between 0 and 1 (e.g., knot_list = c(1/3, 2/3)). These MTRs interact linear functions in $x$ with constant splines in $u$.

I want to set up an initial grid and an audit grid that ease the computational burden while still sufficiently constraining the problem. Intuitively, I thought the following would be a good way to do so:

Set initgrid.nu = audit.nu so that there is at least one grid point on each spline segment (so if knots were at 1/3 and 2/3, I need initgrid.nu to be 3 or more, depending on how the grid points are selected)
Set audit.nx equal to the number of unique covariate values in the data (~3000)
Set initgrid.nx high enough for each value of FE to be covered at least once

(1) and (2) should make sure that the audit grid exhaustively covers these MTR specifications. (3) avoids unbounded solutions. The audit should then iterate to take care of violations driven by x1 and x2.

Here's the problem that I'm running into. Consider the case of knot_list = c(1/3, 2/3). I set initgrid.nu = 6 and audit.nu = 10. I would think the initial grid would be more than enough to cover three spline segments. Even if I set initgrid.nx equal to the number of covariate values in the data, the audit still iterates...

Performing audit procedure...
    Solver: Gurobi ('gurobi')
    Generating initial constraint grid...

    Audit count: 1
    Minimum criterion: 0.2234947
    Obtaining bounds...
    Violations:  33 
    Expanding constraint grid to include 33 additional points...

    Audit count: 2
    Minimum criterion: 0.2234947
    Obtaining bounds...

This suggests that the intuition behind (1) is incorrect, or that the audit is iterating when it shouldn't.

I have two questions:

Does my intuition behind how to tackle the constraint grid make sense given the MTR specifications, or am I doing something questionable?
Should the audit procedure be iterating?

I couldn't reproduce an example, so if you want me to post the data and code, just let me know.

a-torgovitsky commented 2 years ago

I'm not sure exactly how the audit grid for u is created, but that's going to be important. Because they are only going to be sufficient to constraint you split if the points happen to line up exactly on the knot points. So you would need to ensure that with audit.nu = 10 and knot_list = c(1/3, 2/3) the audit grid contains the points 0, 1/3, 2/3, 1. (I suppose with a constant spline you could get away without the endpoints 0 and 1 too.) My guess is that audit.nu is going to be in steps of .1 for example, so that 1/3 and 2/3 are not included in it, but I'll leave it to @jkcshea to provide the definitive answer. I guess one feature that might be useful is to allow the user to pass their own grids for the audit.

johnnybonney commented 2 years ago

Do the u grid points have to line up with the knot points? I had thought that you would need at least one point somewhere on each spline segment, but not necessarily at the knots themselves.

For example, with knot_list = c(1/3, 2/3), that you would need one u point somewhere in the interval [0, 1/3], one in (1/3, 2/3], and one in (2/3, 1]. But is that not sufficient to constrain the spline?

a-torgovitsky commented 2 years ago

No you're right for a constant spline, you just need one point on each segment and it doesn't matter where. I was thinking for a linear one for example, where you would want the edges of each segment (the knots) to impose boundedness for example.

jkcshea commented 2 years ago

Hi everyone,

Apologies for the slow reply (at a wedding).

The audit grid for u is created using a Halton sequence. With initgrid.nu = 6, the interior points for u are

> sort(ivmte::rhalton(6))
[1] 0.125 0.250 0.375 0.500 0.625 0.750

(end points 0 and 1 are always included in the grid) So this grid covers all segments of your spline.

I am alarmed by this:

Even if I set initgrid.nx equal to the number of covariate values in the data, the audit still iterates...

I could not generate an example to mimic this, though. Could I please get the data and code to replicate your error?

johnnybonney commented 2 years ago

Sure---here is a folder with the R script and data: example.zip

If I run that script on my machine with that data, the audit output I get starts with what I quoted in my initial post.

jkcshea commented 2 years ago

This looks to be an issue of tolerances. That is, Gurobi spits out a solution that it considers feasible. But when I check for feasibility, I find points being violated in the audit grid. (I thought I dealt with this a long time ago...)

To see what was going on, I set audit.max = 1. When using the generic BLAS/LAPACK packages, I get 2110 violations. But all are small: the maximum violation has a magnitude of 8.446128e-05, the mean is 1.197183e-05. The bounds I got were [-0.1104314, 0.13426].

When using the ATLAS BLAS/LAPACK packages, I only get 1 violation. Again, it is small: 1.018487e-06. The bounds are almost identical to the ones above.

But there are warnings from Gurobi that the model is not scaled well. When I set rescale = TRUE, all the violations disappear and the bounds are effectively unchanged. The warnings about the badly scaled problem also disappear. @johnnybonney see if that works for you.

When I switch to MOSEK, I get no violations. This is regardless of which BLAS/LAPACK libraries I am using, and whether I rescale the problem or not. Also, MOSEK solves the problem almost immediately, whereas Gurobi needs about 5 minutes to obtain each bound.

However, when using MOSEK, the bounds are sensitive to whether I rescale the model. MOSEK estimates the same bounds as above when rescale = FALSE. But it estimates very different bounds when rescale = TRUE: [-0.3528038, 0.3449428]. MOSEK also complains about various elements in the constraint matrix being close to 0, regardless of whether I rescale the problem.

I believe I can use Gurobi and MOSEK to check for feasibility. That should hopefully eliminate these contradictions in the audit. I'll look into how that can best be done. I'll see if there's anything I can do with MOSEK's complaints about elements in the constraint matrix being close to 0.

johnnybonney commented 2 years ago

I see. Yep, setting rescale = TRUE prevents the audit from iterating, and I get bounds!

I actually get different bounds than you get above, [-0.3587889, 0.3645447]---these are closer to the rescale = TRUE bounds you got with MOSEK. It's odd that we get different bounds if we are both using Gurobi + rescale = TRUE, right? (I also get a warning that the solver was unable to satisfy the optimality tolerance for both problems.)

Edit: Thanks for pointing this out:

Also, MOSEK solves the problem almost immediately, whereas Gurobi needs about 5 minutes to obtain each bound.

I hadn't used MOSEK before, but I just installed it, and you are right... MOSEK is wicked fast at this compared to Gurobi.

While we are in this thread, I also wanted to ask about this point brought up by @a-torgovitsky:

I guess one feature that might be useful is to allow the user to pass their own grids for the audit.

For my purposes, this would be super useful, primarily to pass audit grids that are guaranteed to fully cover

models with constant or linear splines, or
models that linearly interact u stuff with a continuous covariate (by passing the min and max values of the covariate to the grid)

Would it be easy to implement?

jkcshea commented 2 years ago

It's odd that we get different bounds if we are both using Gurobi + rescale = TRUE, right?

Indeed, I will try to figure out why... I know I've checked this before in the past. Differences were small, e.g., in the order of 1e-07. Maybe I missed something, or one solver is much more stable than the other. Note that MOSEK requires us to convert QCQPs into conic problems.

(I also get a warning that the solver was unable to satisfy the optimality tolerance for both problems.)

Ah yes, me too, I had forgotten to mention that.

I guess one feature that might be useful is to allow the user to pass their own grids for the audit. Would it be easy to implement?

Sorry I forgot to respond to this point from earlier. Yes, it should not be hard.

If we decide to do this: Maybe the user can pass matrices containing the grids for covariates, and vectors for the grids of u. The current names of arguments pertaining to grids are initgrid.nx, initgrid.nu, audit.nx, audit.nu. So for passing custom grids, we could use initgrid.x, initgrid.u, audit.x, audit.u?

a-torgovitsky commented 2 years ago

Questions/comments:

@jkcshea do we have the option set to force Gurobi to treat the QCQP as convex? (I can't remember what it's called...) Just wondering if maybe it is failing to detect the problem as convex and thus treating it as a non-convex bilinear program, which would take longer. The output would likely be informative of this.
Any idea what's leading to the small elements in the constraint matrix? Seems like what we are seeing here are just symptoms of that.
I like @jkcshea 's suggestions for naming. Assuming the user passes an initial grid, do we have a plan for how to expand it when not passing an audit? (Edit: I guess we just take some points from the audit grid and add them to the constraint grid as usual? Maybe nothing different is needed...)

jkcshea commented 2 years ago

@jkcshea do we have the option set to force Gurobi to treat the QCQP as convex?

Yep, Gurobi knows the problem is convex. We automatically set nonconvex = 0 when solving the QP and QCQP.

Any idea what's leading to the small elements in the constraint matrix? Seems like what we are seeing here are just symptoms of that.

It looks to be the Cholesky decomposition, which is returning values like 8.009545e-18. I suppose we can avoid this by setting those entries to be 0. But I recall we do not want to arbitrarily decide when to round something to 0.

I guess we just take some points from the audit grid and add them to the constraint grid as usual? Maybe nothing different is needed...

Yep, I am thinking of expanding the grid as usual. It should be fine, but we'll see.

a-torgovitsky commented 2 years ago

It looks to be the Cholesky decomposition, which is returning values like 8.009545e-18.

Hmm, but that's below machine precision...? If you try rounding to the nearest 1e-15 does that get reflected in Gurobi's assessment of the scaling?

jkcshea commented 2 years ago

Gurobi still says the problem is poorly scaled after the rounding. Part of this is because there is a age^2 variable in the MTR. But even after removing that, there are still some large entries in the Cholesky decomposition matrix. Here's what Gurobi reports:

  Matrix range     [1e-13, 3e+03]
  Objective range  [8e-06, 2e+01]
  QObjective range [2e-04, 2e-04]
  Bounds range     [0e+00, 0e+00]
  RHS range        [1e+00, 1e+00]
Warning: Model contains large matrix coefficient range
         Consider reformulating model or setting NumericFocus parameter
         to avoid numerical issues.

a-torgovitsky commented 2 years ago

What if you just replace small values like 8.009545e-18 with 0?

jkcshea commented 2 years ago

The warnings seem to be because of this matrix range:

  Matrix range     [1e-13, 3e+03]

So I need to replace entries with magnitudes 1e-12 or less with 0 in order to get rid of the warnings. If I only replace entries with magnitudes 1e-13 or less with 0, I still get the warnings.

a-torgovitsky commented 2 years ago

Ok, I guess what I'm trying to ask is whether 1e-12 (or 1e-13 or 1e-18) is materially different for Gurobi than 0. Is there any guidance about this from Gurobi discussions of numerical issues?

jkcshea commented 2 years ago

I had forgotten this feature about Gurobi: "Note that Gurobi will treat any constraint coefficient with absolute value under $10^{-13}$ as zero." That is from Advanced User Scaling section of their Guidline for Numerical Issues. There doesn't seem to be an option to change this tolerance level.

The guideline simply suggests we rescale the problem to avoid extremely large or small coefficients. "We recommend that you scale the matrix coefficients so that their range is contained in six orders of magnitude or less, and hopefully within $[10^{-3},10^{6}]$."

On this thread from the Gurobi Support Portal, the Gurobi Staff suggest that the poster set small coefficients to be 0 or rescale the problem. The responses are essentially the same in this, this, and this thread, all from the Gurobi Support Portal.

a-torgovitsky commented 2 years ago

From the guide:

Essentially the trick is to simultaneously scale a column and a row to achieve a smaller range in the coefficient matrix.

Emphasis added. I guess this is what we were doing when we put in rescaling, which seems to work well here. But we're not rescaling the rows, right? We are just rescaling the columns? Can you remind me exactly how our rescaling works?

jkcshea commented 2 years ago

We do two things:

We divide each column of the design matrix by its Euclidean norm. Here is the write-up on rescaling from last year (from this post): stability-condition-number.pdf
We divide the quadratic objective/constraint by the number of observations and the norm of the outcome vector.

So we do not scale the rows, just the columns.

We had a similar discussion about this a long time ago here At the time, we thought it would be too difficult to write code to automatically rescale the problem to avoid any scaling issues. But it looks like we're halfway there. Here are some options Gurobi suggests to deal with scaling issues.

presolve: Determines whether Gurobi "crushes" the model into something smaller that is easier to solve.
Aggregate, AggFill: Controls how variables and constraints are dropped during presolve.
NumericFocus: Controls how Gurobi manages numerical issues. Not much detail on this provided by Gurobi---perhaps it is proprietary?
scaleFlag: Automatically scales rows and columns of constraint matrix to improve numerical stability.
objScale: Rescales the objective vector.

a-torgovitsky commented 2 years ago

Hmm, maybe we should take another look at whether it's possible to automatically rescale things in a better way.

Would help to have some diagnostics for this example.

With both scale = FALSE and scale = TRUE what are the largest and smallest magnitude elements of each column and each row? Do the smallest/largest elements seem to be associated with one variable/constraint in particular?

jkcshea commented 2 years ago

I'm going to focus on the linear constraint matrix. I cannot report the largest/smallest elements by row and column since:

The MTR specification above has over 300 terms, so the linear constraint matrix has over 300 columns.
The shape constraints for the initial grid require over 30,000 rows to impose. So I'll report the largest/smallest elements for the whole matrix. But feel free to ask for more details if you think it will be helpful.

I'm ignoring the matrix defining the quadratic objective/constraint because it contains only a constant along a portion of the diagonal (thanks to the Cholesky decomposition), i.e. it has no scaling issues.

I'm also ignoring all entries less than 1e-13 since Gurobi treats those as 0.

`rescale = FALSE`

The smallest element in the constraint matrix is 1.035984e-13. In terms of which variable (column) this entry corresponds to: It corresponds to the interaction between factor(state)25 and one of the spline basis functions for the untreated group. There isn't anything particularly special about state 25, though. In the data, it is the 11th most popular state in the data (out of 50) and accounts for 2.5% of the observations. There are 10 more interactions between factor(state) and spline basis functions for the untreated group with small entries in the range [1e-13, 1e-12]. In terms of which constraints (rows) all these small entries correspond to: They all correspond to the constraints related to the Cholesky decomposition.

The maximum entry is 87025.72. The entry corresponds to the age^2 variable for the treated group. There are 5 more interactions between age^2 and spline terms for both treated and untreated groups exceeding 1e4. All of these entries are again related to the Cholesky decomposition.

`rescale = TRUE`

The smallest element in the constraint matrix is 1.068914e-13. It corresponds to the variable factor(state)36 for the untreated group. There are more small entries corresponding to factor(state) for the untreated group in the range [1e-13, 1e-12]. Unlike before, these variables are not interacted with spline terms. They are also divided between Cholesky and shape constraints.

The largest element is 20.34285. It corresponds to the interaction between factor(state)41 and one of the spline terms for the treated group. Again, nothing special about state 41. It is the 27th most popular state and accounts for 1.2% of the observations. There are over 100 other rows with entries exceeding 20. But unlike before, these are all related to the shape constraints.

a-torgovitsky commented 2 years ago

I cannot report the largest/smallest elements by row and column since:

How about some histograms? x-axis is order of magnitude y-axis is number of rows/columns that have minimum/maximum element of that magnitude one plot for each combo of rows/columns and min/max

I suppose the Cholesky constraints are hard to interpret.

So let's focus on this:

rescale = TRUE The smallest element in the constraint matrix is 1.068914e-13. It corresponds to the variable factor(state)36 for the untreated group. There are more small entries corresponding to factor(state) for the untreated group in the range [1e-13, 1e-12]. Unlike before, these variables are not interacted with spline terms. They are also divided between Cholesky and shape constraints.

What shape constraint in particular? Is the 1e-13 term genuinely that small is there some sort of numerical artifact where a number that should be zero is treated as nearly but not exactly zero? That's what I'm trying to get at here...I'm not sure I see how a number like 1e-13 can come out of what we're doing unless it's through some aspect of the program (or one of the libraries we are calling) spitting it out as a number that we know should be exactly zero.

jkcshea commented 2 years ago

Here are the histograms focusing on the case rescale = TRUE.

Histograms by rows/constraints

Sorry, I was wrong about shape constraints having small entries. All small entries are to do with the Cholesky decomposition. There are only 3 rows/constraints whose minimum entry has magnitude 1e-13,

There are some rows with minimum magnitudes between 1e-13 and 1e-4. Here is data behind the second plot:

Histograms by columns/variables

a-torgovitsky commented 2 years ago

Ok useful to know that this is concentrated on the Cholesky decomposition.

@jkcshea does the tolerance option in chol have any impact? I think that it is controlling how close to 0 the 0 elements of the Cholesky decomposition are. So if they are being set to 1e-12 or 1e-13 instead of 1e-14 or 1e-15 (ignored by Gurobi) then that could be what is causing the issue.

Alternatively, I wonder if an LDL decomposition might be more stable? https://en.wikipedia.org/wiki/Cholesky_decomposition#LDL_decomposition https://search.r-project.org/CRAN/refmans/fastmatrix/html/ldl.html

The idea would be to decompose the Gram matrix as $A'A = LDL'$ Then set $y = L'x$ The quadratic part would be $y'Dy$ Perhaps this distributes the various orders of magnitudes across the constraints in a more balanced way?

jkcshea commented 2 years ago

@jkcshea does the tolerance option in chol have any impact?

Sadly not. The tol parameter controls when the Cholesky algorithm terminates. Here's what the R manual says:

The value of tol is passed to LAPACK, with negative values selecting the default tolerance of (usually) nrow(x) .Machine$double.neg.eps max(diag(x)). The algorithm terminates once the pivot is less than tol.

But from what I understand, the "pivot" represents a permutation matrix. I don't understand how it can be less than some tolerance... The LAPACK code is available here, but I haven't gone through it yet.

Alternatively, I wonder if an LDL decomposition might be more stable?

Sure, I can implement that and see how it performs.

jkcshea commented 2 years ago

Sorry, I was previously trying tolerances from 1e-14 to 1e-1. But I realized I should've also tried some larger tolerances, e.g. tol = 0.5, 1, 2. When doing this, we do see the small entries disappear from the Cholesky decomposition. (Have yet to figure out what tol is doing, though.)

For reference, under the default tolerance of 1e-14, the bounds are [-0.3436182, 0.3392346].

When trying tol = 0.5, the small entries in the Cholesky decomposition disappear. The constraint matrix range is now [3e-07, 2e+01]. But then it is no longer the case that case that $A'A = \mathrm{chol}(A'A)' \mathrm{chol}(A'A)$. For 95% of the entries, the differences are in the order of 1e-16. For the remaining 5%, the order ranges from 1e-6, to 1e-1. The bounds are quite close to the original estimate, though: [-0.3537302, 0.3926399]. (However, Gurobi still warns that both bounds are suboptimal.)

When tol = 1, things get a bit weird. The constraint matrix range becomes [1e-03, 2e+01]. The difference between $A'A$ and $\mathrm{chol}(A'A)'\mathrm{chol}(A'A)$ grows. Now, the matrices match for on only 73% of the entries. For the remaining 27%, the difference range from 1e-3 to 1. The audit doesn't seem to end and the number of violations range from 4,000 to 14,000.

When tol = 2, things get weirder. The difference between $A'A$ and $\mathrm{chol}(A'A)'\mathrm{chol}(A'A)$ doesn't actually change by much. The constraint matrix range is still [1e-03, 2e+01]. But in one of the audits, the minimum criterion was negative. Again, the audit does not seem to end and there are always thousands of violations.

If we do want to play with the tolerances, then I'll try to figure out what exactly tol is. Maybe we want to automate the choice of tol. Or maybe we want to introduce another option for the user to adjust tol.

a-torgovitsky commented 2 years ago

Maybe there's something in the LAPACK manual that will explain?

Just to be clear here on the big picture:

We are still having scaling problems.
It seems to be caused by very small terms that show up due to the Cholesky factorization.
There's a question as to whether these are "meaningfully" small terms, or just numerical artifacts.
If they are meaningfully small, then I guess we're stuck. But if they are numerical artifacts there may be a graceful way to get rid of them.

jkcshea / ivmte