markmfredrickson / optmatch

Functions for optimal matching in R
https://markmfredrickson.github.io/optmatch
Other
47 stars 14 forks source link

LEMON behavior with fullmatch recovery #216

Closed josherrickson closed 2 years ago

josherrickson commented 2 years ago
> data(nuclearplants)
> mm <- match_on(pr ~ cost + t1 + t2, data=nuclearplants)
> options("fullmatch_try_recovery" = TRUE) 
> m1l <- fullmatch(mm, data=nuclearplants, max.controls = 2, solver = "LEMON")
> m1r <- fullmatch(mm, data=nuclearplants, max.controls = 2, solver = "RELAX-IV")
>
> summary(m1l)
Structure of matched sets:
1:1 1:2 0:1 
  7   3   9 
Effective Sample Size:  11 
(equivalent number of matched pairs).
>
> all.equal(m1l, m1r, check.attributes = FALSE)
[1] TRUE

When we have recovery, both LEMON and RELAX-IV produce the same match. However, if we disable recovery, things are different.

> options("fullmatch_try_recovery" = FALSE)
> m2l <- fullmatch(mm, data=nuclearplants, max.controls = 2, solver = "LEMON")
> m2r <- fullmatch(mm, data=nuclearplants, max.controls = 2, solver = "RELAX-IV")]
Warning message:
In fullmatch.matrix(mm, data = nuclearplants, max.controls = 2,  :
  Matching failed. (Restrictions impossible to meet?)
 Enter ?matchfailed for more info.
>
> summary(m2l)
Structure of matched sets:
1:2 0:1 
 10   2 
Effective Sample Size:  13.3 
(equivalent number of matched pairs).

On the plus side, LEMON appears to have "recovery" baked in automatically. However, because we do some preprocessing, the matches differ quite a lot.

The preprocessing for recovery first re-fits without max.controls, then figures out how many controls to drop to ensure all matched sets are within max.controls. The unrestricted match is:

> stratumStructure(fullmatch(mm, data = nuclearplants))
1:1 1:3 1:4 1:8 
  7   1   1   1 

(Note that there are 22 controls). So what's happening here is that:

What's the best move forward here:

  1. LEMON never goes through recovery. Pros are that 1) we maximize effective sample size, 2) LEMON always returns the same match and 3) we avoid one call to the solver, but one con is that LEMON and RELAX will produce different matches.
  2. LEMON always goes through recovery. We get consistent results for RELAX, but are throwing away easy gains in effective sample size.
  3. Leave it as is. Downside is we get different results for same solver depending on the fullmatch_try_recovery option, but it feels the most true to the existing codebase. We'll need to expand the documentation to explain that the fullmatch_try_recovery option functions slightly differently by solver.

I'm leaning towards 1. LEMON's "recovery" appears to me to be preferred over our hacky version. I don't think its a big deal that the two solvers produce different results in some scenarios; they're different implementations overall.

benthestatistician commented 2 years ago

I agree that 1. is the preferable behavior, and am also open to supplanting the logic of .fullmatch.with.recovery(). But first I'd like to understand more specifically how LEMON's implicit recovery is working.

It may in effect be implementing a version of #200, which would upgrade to the current .fullmatch.with.recovery() logic. I.e.,

  1. Translate matching problem into its network flow representation, a min-cost flow problem. (I.e., problem of moving specified amount of flow through network in such a way as to minimize cost.)
  2. Before attempting to solve the min-cost flow problem, solve associated max flow problem for the network. If the computed max flow exceeds the total flow specified in the min-cost flow problem, proceed to 4 below. If not,
  3. Change the min-cost flow problem by reducing its total flow to equal the max possible flow as determined in step 2.
  4. Proceed with solution of max-flow problem, back-translating results into a match in the usual way.

Does this sound like what's going on? If so, then perhaps it's what we should be doing even in cases where we're going to use the RELAX-IV min-cost flow solver at step 4. As an alternative and improvement over the current recovery logic.

If something like this is what's going on, I'd like to understand more about the details. Ordinarily, reductions of the flow parameter in (3) above would correspond to reducing omit.fraction, but in extreme cases they might also necessitate reduction of the number of treatment group subjects to be matched. Perhaps in those cases the problem should just be declared infeasible. I'd like to understand whether those extreme cases can arise and how they're handled.

josherrickson commented 2 years ago

Ben and I spoke offline, and I did more research. There are four algorithms implemented in LEMON's mincostflow: Cycle Cancelling, Capacity Scaling, Cost Scaling, and Network Simplex. At the point of running the above, the default was Network Simplex as LEMON describes it as the fastest. However, as I found and Ben discussed, it does some sort of manipulation when presented with an infeasible problem.

On the other hand, the Cycle Cancelling algorithm produces results that are most similar to RELAX and cruicially for the issue at hand, simply errors when presented with an infeasible problem.

Therefore we are changing the default to Cycle Cancelling. (For the record, when calling fullmatch, the solver argument can take in either "RELAX-IV" or "LEMON", or the LEMON function, e.g. LEMON("CostScaling"), so all four algoriths are available.)