This solve takes a really long time (does it actually finish?)

a-torgovitsky commented 4 years ago

library("ivmte")
args <- list(data = ivmteSimData,
             ivlike =  c(y ~ z + x,
                         y ~ d + x,
                         y ~ d | z),
             subset = l(x <= 8, 1 == 1, z %in% c(1,3)),
             target = "ate",
             m0 = ~ uSplines(degree = 3, knots = c(.25, .5, .75)) + x,
             m1 = ~ uSplines(degree = 3, knots = c(.25, .5, .75)) + x,
             propensity = d ~ z + x,
             noisy = TRUE)
r <- do.call(ivmte, args)

Any idea what's going on here?

a-torgovitsky commented 4 years ago

I think it actually crashes R...

jkcshea commented 4 years ago

Hm, I am actually unable to recreate the error. This example runs for me, and the subsetting is working as it should. Below is some more detailed output.

> ## Check the size of the data after subsetting
> nrow(ivmteSimData[ivmteSimData$x <= 8, ])
[1] 4932
> nrow(ivmteSimData)
[1] 5000
> nrow(ivmteSimData[ivmteSimData$z %in% c(1, 3), ])
[1] 2517
> 
> ## Perform estimate
> args <- list(data = ivmteSimData,
+              ivlike =  c(y ~ z + x,
+                          y ~ d + x,
+                          y ~ d | z),
+              subset = l(x <= 8, 1 == 1, z %in% c(1,3)),
+              target = "ate",
+              m0 = ~ uSplines(degree = 3, knots = c(.25, .5, .75)) + x,
+              m1 = ~ uSplines(degree = 3, knots = c(.25, .5, .75)) + x,
+              propensity = d ~ z + x,
+              noisy = TRUE)
> r <- do.call(ivmte, args)
Obtaining propensity scores...
Generating target moments...
    Integrating terms for control group...
    Integrating terms for treated group...
Generating IV-like moments...
[1] "Subset condition"
x <= 8
[1] "Number of observations after subsetting"
[1] 4932
[1] "Dist. of X"

   1    2    3    4    5    6    7    8 
  13   98  357  873 1364 1234  744  249 
[1] "Dist. of Z"

   0    1    2    3 
 594 1851 1853  634 
    Moment 1...
    Moment 2...
    Moment 3...
[1] "Subset condition"
1 == 1
[1] "Number of observations after subsetting"
[1] 5000
[1] "Dist. of X"

   1    2    3    4    5    6    7    8    9   10 
  13   98  357  873 1364 1234  744  249   60    8 
[1] "Dist. of Z"

   0    1    2    3 
 607 1876 1876  641 
    Moment 4...
    Moment 5...
    Moment 6...
[1] "Subset condition"
z %in% c(1, 3)
[1] "Number of observations after subsetting"
[1] 2517
[1] "Dist. of X"

  1   2   3   4   5   6   7   8   9  10 
  6  50 183 441 685 632 369 119  29   3 
[1] "Dist. of Z"

   1    3 
1876  641 
    Moment 7...
    Moment 8...
Performing audit procedure...
    Generating audit grid...
    Generating initial constraint grid...

    Audit count: 1
    Minimum criterion: 0.006529669
    Obtaining bounds...
    Violations: 0
    Audit finished.

Bounds on the target parameter: [-0.5115238, -0.4698897]

a-torgovitsky commented 4 years ago

How long does it take for you after "Obtaining bounds..."? I think I let it run for like 5-10 minutes before giving up. Not clear to me why this solve would take so long with only 8 moments and relatively few parameters.

a-torgovitsky commented 4 years ago

Tried this on a different computer...still stuck at "Obtaining bounds..." after an hour

jkcshea commented 4 years ago

It looks like the issue is with lpsolveapi, still investigating... With Gurobi and CPLEX, everything runs in under 4 seconds for me.

a-torgovitsky commented 4 years ago

Ah brilliant... yes I am using lpSolveAPI Might be a bug on their part

jkcshea commented 4 years ago

It is indeed looking like the problem is simply the package lpSolveAPI. I do have to input the models in a particular way for each package, but based on some tests I ran with smaller examples, it looks like that is being done correctly. It looks likelpSolveAPI simply isn't able to solve the LP problem.

Here is one example where lpSolveAPI seems to be failing.

args <- list(data = ivmteSimData,
             ivlike =  c(y ~ z + x,
                         y ~ d + x,
                         y ~ d | z),
             target = "ate",
             m0 = ~ uSplines(degree = 3, knots = c(.25, .5, .75)) + x,
             m1 = ~ uSplines(degree = 3, knots = c(.25, .5, .75)) + x,
             propensity = d ~ z + x,
             noisy = TRUE)

In the example above, gurobi and cplexapi require two rounds of audits. When using lpsolveapi, the results from the first audit match those of gurobi and cplexapi. But in the second round of the audit, lpsolveapi has unbounded estimates. Below is some debugging output, showing the objective values from the various LP problems.

   Audit count: 1
LP solve obj:  0 
LP solve status:  1 
    Minimum criterion: 0
    Obtaining bounds...
LP solve obj:  -0.6959481 
LP solve status:  1 
LP solve obj:  -0.2065187 
LP solve status:  1 
    Violations:  19 
    Expanding constraint grid to include 19 additional points...

    Audit count: 2
LP solve obj:  0 
LP solve status:  1 
    Minimum criterion: 0
    Obtaining bounds...
LP solve obj:  1e+30 
LP solve status:  0 
LP solve obj:  -1e+30 
LP solve status:  0 
    LP solutions are unbounded.
    Expanding constraint grid, restarting audit.
    Generating audit grid...
    Generating initial constraint grid...

By default, the initial grid is automatically expanded up to 3 iterations whenever the results are unbounded. lpSolveApi is unable to obtain a solution by then.

Here's another example:

args <- list(data = ivmteSimData,
             ivlike =  c(y ~ (z == 1) + (z == 2) + (z == 3) + x,
                         y ~ d + x,
                         y ~ d | z),
             target = "ate",
             m0 = ~ uSplines(degree = 1, knots = c(.25, .5, .75)) + x,
             m1 = ~ uSplines(degree = 1, knots = c(.25, .5, .75)) + x,
             propensity = d ~ z + x,
             noisy = TRUE,
             m0.lb = 0.7)

gurobi and cplexapi solve this without any problems again. While lpSolveAPI is able to obtain an upper bound, it is unable to attain a lower bound. But with 2 expansions of the initial grid, it eventually does. The bounds obtained match those of gurobiand cplexapi.

    Audit count: 1
LP solve obj:  0 
LP solve status:  1 
    Minimum criterion: 0
    Obtaining bounds...
LP solve obj:  -0.6194033 
LP solve status:  1 
LP solve obj:  -1e+30 
LP solve status:  0 
    LP solutions are unbounded.
    Expanding constraint grid, restarting audit.
    Generating audit grid...
    Generating initial constraint grid...

    Audit count: 1
LP solve obj:  0 
LP solve status:  1 
    Minimum criterion: 0
    Obtaining bounds...
LP solve obj:  -0.6194033 
LP solve status:  1 
LP solve obj:  -1e+30 
LP solve status:  0 
    LP solutions are unbounded.
    Expanding constraint grid, restarting audit.
    Generating audit grid...
    Generating initial constraint grid...

    Audit count: 1
LP solve obj:  0 
LP solve status:  1 
    Minimum criterion: 0
    Obtaining bounds...
LP solve obj:  -0.6194033 
LP solve status:  1 
LP solve obj:  -1e+30 
LP solve status:  0 
    LP solutions are unbounded.
    Expanding constraint grid, restarting audit.
    Generating audit grid...
    Generating initial constraint grid...

    Audit count: 1
LP solve obj:  0 
LP solve status:  1 
    Minimum criterion: 0
    Obtaining bounds...
LP solve obj:  -0.6194033 
LP solve status:  1 
LP solve obj:  -0.4676609 
LP solve status:  1 
    Violations: 0
    Audit finished.

Bounds on the target parameter: [-0.6194033, -0.4676609]

Is there anything we can do about this other than warn the user about using lpsolveapi?

a-torgovitsky commented 4 years ago

The behavior you reported isn't cause for concern because ivmte is handling things correctly: Trying to resolve with an expanded grid, and after a set number of failures just quitting. That would be informative enough that a user would know they should try a different solver.

The behavior I'm seeing is different -- R just freezes completely with no output from ivmte. That's not what you are getting? Maybe we should compare versions of lpsolveapi...mine is 5.5.2.0-17.4

jkcshea commented 4 years ago

Ah, we're using the same version of lpsolveapi, and I do indeed get the same problem with the original example you posted. That is, I'm stuck with this:

 Audit count: 1
    Minimum criterion: 0.006529651
    Obtaining bounds...
[1] "running lpsolve minmization"

(The extra line is debugging code) I can't break the run, and have to restart the session.

It looks like lpsolveapi isn't able to able to perform the minimization problem?

a-torgovitsky commented 4 years ago

Ok yeah then we're getting the same behavior now.

Is there any way to debug it? If not, probably just best to save the matrices out to disk, then post it as an issue in the lpsolveapi repository.

jkcshea commented 4 years ago

Just heard back from the author of lpSolveAPI. Here's what he had to say:

I would recommend not using lpSolveAPI. The R package is just a wrapper for lp_solve which is a separate open source solver. Development on lp_solve effectively stopped ~2007. At this point I don't think it is even worth making bug reports to lp_solve. At best they will be acknowledged but most likely not fixed.

I followed up and asked if he had any suggestions on a good open source LP solver.

If we have to switch to another LP solver, there is GLPK. It looks to still be maintained, and is available for Windows, Mac and Linux.

a-torgovitsky commented 4 years ago

Ahhh...ok. So my primary concern here is that the user can just easily install the solver with install.packages Rather than screw around with licenses/dependencies/etc. with Gurobi/CPLEX. Is that the case for GLPK (or any other relatively maintained solver)?

jkcshea commented 4 years ago

So my primary concern here is that the user can just easily install the solver with install.packages

Hm, almost. Unix users need to download the tarball and install the solver. But apparently Mac users also have access it to it via their Homebrew package manager. Windows users can download the executable to install the solver. Afterwards, they can install the Rglpk package to interface with the solver.

Looking at the manual, Rglpk can read CPLEX models. So if we want to give it a shot, its incorporation will hopefully be very straightforward. (Edit: actually this reasoning does not make sense, but it still should not be hard)

a-torgovitsky commented 4 years ago

Are there any other options that can be installed with just one command?

If we can't find anything that's really easy to install, then maybe we just add a warning in the vignette and elsewhere that lpsolveapi is buggy and outdated.

jkcshea commented 4 years ago

I came across this paper:

comparison-of-lp-solvers.pdf

Here are the LP solvers they consider, after some screening:

Only CVXOPT is potentially suitable. Or rather, CVXR, which looks to be an offshoot. It can be installed using the install.packages command. Stephen Boyd, the author of Convex Optimization, is on the package paper ("To apper, Journal of Statistical Software, 2019"), so I assume that means this actually works. So I can try incorporating this.

For documenting the problems with the other LP solvers in the table:

MINOS No longer free.

CLP Requires separate installation of solver. R API here: https://rdrr.io/cran/clpAPI/man/clpAPI-package.html

GLPK Requires separate installation of solver. R API: Rglpk package.

PCx Only for Unix. Looks very out of date. Not for R.

PPL Doesn't interface with R.

JOptimizer Seems like its for Java? Website is dead, can't find more information.

LiPS Doesn't interface with R.

a-torgovitsky commented 4 years ago

Ok, thanks -- glad you stopped there because there are 100s of LP solvers. Maybe just look at CRAN on the package index page? Searching for "Linear Prog" or "Convex Opt" should be sufficient. I think CVXR is a good solution if we need to switch to that. Or, like I said before, we just stay with lpsolveapi and make a note that it's got some problems...

Let me know what you think the time cost would be. If it's a lot of effort to add a new solver then it's probably not a good use of your time.

jkcshea commented 4 years ago

Searching for "Linear Prog" or "Convex Opt" should be sufficient.

I should've done this from the beginning. There are surprisingly few packages, actually. The only one that I had not come across already was linprog, but that depends on lpSolve, which is the high-level API to lp_solve.

I think implementing CVXR may take an hour or so. I just need to learn the syntax, and then adjust the matrices/vectors defining the LP problem accordingly.

a-torgovitsky commented 4 years ago

Ok, let's give that a shot.

It's not a bad use of time actually. CVX has many implementations and is based on this beautiful, beautiful book, so learning how it works is a good human capital investment.

Just make sure to set up a number of test cases so that it's easy to confirm afterwards that RCVX, Gurobi, and CPLEX are all reproducing the same thing.

jkcshea commented 4 years ago

Almost done! I just need to include some test cases, and deal with a cosmetic issue (i.e. messages).

So my primary concern here is that the user can just easily install the solver with install.packages

While the website says `install.packages("CVXR") is all you need to do , that was not actually true for me.

I had to install the following libraries on my Linux machine:

libgmp-dev
libmpfr-dev
libmpc-dev

If we decide to keep cvxr as one of our solvers, then perhaps we should include a note on those libraries in the manual or paper.

I don't know what the installation procedure is for Windows and Mac users, though.

@johnnybonney If I remember, you are a Windows user? If so, could you see if you can install the package CVXR simply using install.packages("CVXR")?

@cblandhol Are you by chance a Mac user?

Does that problematic example above run for cvxr?

Yes, but the bounds do not perfectly match those of gurobi and cplexapi. gurobi and cplexapi spit out [-0.5115238, -0.4698897]. cvxr spits out [-0.4710606, -0.4709039]. Still better than lpsolveapi, though.

Do we want to get rid of lpsolveapi?

Here is a tweaked version of that problematic example that began this thread. It runs for lpsolveapi.

args <- list(data = ivmteSimData,
             ivlike =  c(y ~ z + x,
                         y ~ d + x,
                         y ~ d),
             subset = l(x <= 8, 1 == 1, z %in% c(1,3)),
             target = "ate",
             m0 = ~ uSplines(degree = 3, knots = c(.25, .5, .75)) + x,
             m1 = ~ uSplines(degree = 3, knots = c(.25, .5, .75)) + x,
             propensity = d ~ z + x)
r <- do.call(ivmte, args)

Again, lpsolveapi seems to be wrong: lpsolveapi spits out a bound of [-0.4682947, -0.4682947]. gurobi and cplex spit out [-0.5123849, -0.4681759] cvxr spits out [-0.5123864, -0.468186].

But lpsolveapi is fast. This example runs in less than 3 seconds using gurobi, lpsolveapi, and cplexapi. With cvxr, this takes 51 seconds.

Note: I don't recall running into this problem before, where there are large, noticeable differences between the solutions obtained by the proprietary solvers and lpsolveapi. So I don't think lpsolveapi is always wrong.

johnnybonney commented 4 years ago

@johnnybonney If I remember, you are a Windows user? If so, could you see if you can install the package CVXR simply using install.packages("CVXR")?

That's right, I use Windows. install.packages("CVXR") seems to be sufficient to install the package on my computer. (And yes, @cblandhol is a Mac user)

a-torgovitsky commented 4 years ago

Hmm, the long run time with cvxr is a bit concerning. This is probably because it is not designed for linear programming per se (but rather convex programs more generally). Maybe the better thing is to stick with lpsolveapi and just note to the user that it can be problematic, so they are strongly encouraged to switch to gurobi or cplex.

a-torgovitsky commented 4 years ago

Let's just stick with what we had. I'll make a note in the documentation to be a bit wary of lpsolveapi.

jkcshea commented 4 years ago

Sure. So just to clarify, we want to remove support for CVXR? (which is completely fine)

a-torgovitsky commented 4 years ago

Might as well, for the same reason as we removed lpsolve before. The more solvers we have, the more potential things to maintain. So if cvxr just doesn't work well, I don't see any point of having it around.

jkcshea commented 4 years ago

Done!

jkcshea / ivmte

This solve takes a really long time (does it actually finish?) #136