markmfredrickson / optmatch

Functions for optimal matching in R
https://markmfredrickson.github.io/optmatch
Other
47 stars 14 forks source link

Improve `tolerance` logic when `omit.fraction` >0 #85

Open benthestatistician opened 10 years ago

benthestatistician commented 10 years ago

SubDivStrat seems to draw no distinction between problems in which we're matching all available controls and problems where we're matching a small number of them. I have the impression that this is gumming up problems where we're selecting a small fraction of controls from a large reservoir: they take a long time, and if you look at summary applied to the result you're often told that the error of the optimization routine is much larger in magnitude than the eventual sum of matched distances.

The first task is to think through what's going on here to decide if it makes any sense, and how it might be improved. Quite likely the follow up will include (a) extending users the option of setting per-subproblem tol parameters, (b) rewriting this part of the code.

benthestatistician commented 5 years ago

Presently we have (in fullmatch.matrix()):

 total.n <- sum(dim(x))

  TOL <- tol * total.n

    tol.frac <- (nrow + ncol - 2)/(total.n - 2 * np)

where x is the distance "matrix", np is the number of suproblems and (nrow, ncol) describe a current subproblem. This does seem a poor formula for problems with a small number of available controls: The supposition seems to be that the sum of row and column dimensions is a good stand-in for the total number of matches, either by subproblem or overall; whereas of course it isn't.

If omf is an omit fraction between 0 and 1, then a better proxy for the number of in the subproblem matches would be the largest number of matches that full matching might make in for that subproblem, namely nrow + ncol*omf -2. We'd make a corresponding adjustment for cases where the problem is "flipped," i.e. this block of fullmatch.matrix() is invoked:

 {
      maxc <- min(1/mnctl, ncol)
      minc <- max(1/mxctl, 1/nrow)
      omf.calc <- -1 * omf
      d <- t(d)
    }
benthestatistician commented 5 years ago

Notes about implementation of this change:

  1. As of now, tolerance related calculations are split in an awkward way between the body of fullmatch.matrix and .fullmatch(). Better to allocate tolerances to subproblems all in one go, within the body of fullmatch.matrix(), then pass the answers as arguments of .fullmatch()/.fullmatch.with.recovery().
  2. tol may or may not survive as an argument to fullmatch() and pairmatch(). If it doesn't (my pref) it's still likely to be a part of update.fullmatch()/update.pairmatch() (#167 ). So work done on this issue won't have been lost, particularly if the work tidies the code up in addition to fixing it. But that's a different task w/ different inputs (as we can presume a previously worked solution). Probably better to start from scratch.
  3. I think fullmatch() and pairmatch() should take a resolution, or grid.width, argument instead of tolerance. If so that can just supplant all of the tol- related calculations.
benthestatistician commented 3 years ago

I suspect this was closed inadvertently, so reopening. (The code listed above as being in need of change doesn't appear to have been updated in any relevant branch.) New work on this to happen in the proj1-nodeprices branch.

benthestatistician commented 3 years ago

For reference, here's how a tolerance should might be used to allocate an error tolerance budget among subproblems: There are two methods, a compatibility (with versions 0.9 and prior) method and a new method (for version 1.0).

  1. In the absence of a hint, survey the subproblems and estimate for each one the number of matches that a solution to it would involve. Under the compatibility method, use the estimate implied above, nrow + ncol -2 (ie n -2); under the new method, instead use nrow + ncol*omf -2 as indicated above. Divide error tolerance budget among subproblems in proportion to these numbers.
  2. In the presence of a hint, check to see whether problem parameters other than tolerance have changed. If so, proceed as in the absence of a hint, above.
  3. In the presence of a hint, with tolerance as the only aspect of the problem to have changed from the problem solved by the hint arguments other than tolerance all the same as in the solved problem provided with the hint: a. Perform initial allocation of given regret budget, allocating among among subproblems in proportion to their estimated numbers of matches, as above.
    b. Check to see which subproblems are already solved, according to this criterion: the hint problem records them as feasible, with a regret bound that's at or lower than the regret budget they're currently allocated. These don't get any error budget. These subproblems aren't passed down to the solver; we just use the solutions to them (primal and dual) that are contained within the hint. b. For the remaining subproblems, use the number of active arcs in the hint solution to estimate the number of active arcs we'll wind up with for that subproblem. Divide error budget in proportion to these. c. The combined regret/tolerance budget for as-yet unsolved problems is set to the given tolerance budget minus the sum of regret bounds for subproblems whose existing solutions we've accepted. The final allocation of regret to subproblems divides this budget among unsolved subproblems in proportion to their estimated numbers of matches. (Note that this procedure can furnish a new match in cases where the provided hint was the result of a fullmatch or pairmatch call with precisely the same arguments as in the present call.)
  4. To map subproblem tolerance allocations to subproblem values of epsilon: under the compatibility mode, calculate as is currently done in solve_reg_fm_prob(); under the new mode, total active arcs m across the subproblem and set epsilon to tolerance/(1.5 * m). (Why 1.5? See these notes: minorization.pdf.)

(Side note: markmfredrickson and I have been considering ...active arcs.) (Edited & updated, Aug 2024.)

benthestatistician commented 3 months ago

The solution proposed just above calls for counting active arcs, perhaps also active nodes by subproblem. @adamrauh is working on code that would do these calcs shortly after calling findSubproblems(), so it has a list of subproblems & corresponding distances available. Presently these subproblem-specific distances might either be ISMs or DenseMatrix-es. For current purposes, perhaps also others, it would be better only to have to deal with one or the other.

Proposal: have findSubproblems() always return an ISM for each subproblem, even if was given a DenseMatrix or optmatch.dlist. Do you recall a reason not to do this, @markmfredrickson? Other comments, Mark or @josherrickson ?

benthestatistician commented 3 months ago

I spoke with Mark who talked me out of the proposal in the above comment. Specifically, he persuaded me that the thing to do is to define an internal generic function, "num_allowable_pairings()" or similar, that counts arcs in ISMs and counts finite entries in DenseMatrices. There are other such helper functions defined in utilities.R. @adamrauh

benthestatistician commented 3 months ago

NB: Just now I rewrote my 2021 "For reference here's how" comment above. The rewrite somewhat obviates the two comments I made between it and this one.

benthestatistician commented 3 months ago

Inside of fmatch(), the integer ncol*omf is referenced as "n.mc". In the nodes table that it builds and returns, -n.mc is the supply associated with the node labelled "(_Sink_)" (usually the last node listed). We might grab this quantity from there when we need it, rather than storing it as a separate entry in the subproblem info table (as Adam and I discussed in a ftf.) However, this would call for us to more consistently return an MCFInfo out of fmatch(). See this comment to no 187.