Open benthestatistician opened 10 years ago
Presently we have (in fullmatch.matrix()
):
total.n <- sum(dim(x))
TOL <- tol * total.n
tol.frac <- (nrow + ncol - 2)/(total.n - 2 * np)
where x is the distance "matrix", np
is the number of suproblems and (nrow
, ncol
) describe a current subproblem. This does seem a poor formula for problems with a small number of available controls: The supposition seems to be that the sum of row and column dimensions is a good stand-in for the total number of matches, either by subproblem or overall; whereas of course it isn't.
If omf
is an omit fraction between 0 and 1, then a better proxy for the number of in the subproblem matches would be the largest number of matches that full matching might make in for that subproblem, namely nrow + ncol*omf -2
. We'd make a corresponding adjustment for cases where the problem is "flipped," i.e. this block of fullmatch.matrix()
is invoked:
{
maxc <- min(1/mnctl, ncol)
minc <- max(1/mxctl, 1/nrow)
omf.calc <- -1 * omf
d <- t(d)
}
Notes about implementation of this change:
fullmatch.matrix
and .fullmatch()
. Better to allocate tolerances to subproblems all in one go, within the body of fullmatch.matrix()
, then pass the answers as arguments of .fullmatch()
/.fullmatch.with.recovery()
. tol
may or may not survive as an argument to fullmatch()
and pairmatch()
. If it doesn't (my pref) it's still likely to be a part of update.fullmatch()
/update.pairmatch()
(#167 ). fullmatch()
and pairmatch()
should take a resolution
, or grid.width
, argument instead of tolerance. If so that can just supplant all of the tol
- related calculations. I suspect this was closed inadvertently, so reopening. (The code listed above as being in need of change doesn't appear to have been updated in any relevant branch.) New work on this to happen in the proj1-nodeprices branch.
For reference, here's how a tolerance
should might be used to allocate an error tolerance budget among subproblems: There are two methods, a compatibility (with versions 0.9 and prior) method and a new method (for version 1.0).
nrow + ncol -2
(ie n -2); under the new method, instead use nrow + ncol*omf -2
as indicated above. Divide error tolerance budget among subproblems in proportion to these numbers.tolerance
have changed. If so, proceed as in the absence of a hint, above. tolerance
as the only aspect of the problem to have changed from the problem solved by the hinttolerance
all the same as in the solved problem provided with the hint:
a. Perform initial allocation of given regret budget, allocating among among subproblems in proportion to their estimated numbers of matches, as above.solve_reg_fm_prob()
; under the new mode, total active arcs m
across the subproblem and set epsilon to tolerance/(1.5 * m)
. (Why 1.5? See these notes: minorization.pdf.)(Side note: markmfredrickson and I have been considering ...active arcs.)
(Edited & updated, Aug 2024.)
The solution proposed just above calls for counting active arcs, perhaps also active nodes by subproblem. @adamrauh is working on code that would do these calcs shortly after calling findSubproblems()
, so it has a list of subproblems & corresponding distances available. Presently these subproblem-specific distances might either be ISMs or DenseMatrix-es. For current purposes, perhaps also others, it would be better only to have to deal with one or the other.
Proposal: have findSubproblems()
always return an ISM for each subproblem, even if was given a DenseMatrix or optmatch.dlist. Do you recall a reason not to do this, @markmfredrickson? Other comments, Mark or @josherrickson ?
I spoke with Mark who talked me out of the proposal in the above comment. Specifically, he persuaded me that the thing to do is to define an internal generic function, "num_allowable_pairings()
" or similar, that counts arcs in ISMs and counts finite entries in DenseMatrices. There are other such helper functions defined in utilities.R. @adamrauh
NB: Just now I rewrote my 2021 "For reference here's how" comment above. The rewrite somewhat obviates the two comments I made between it and this one.
Inside of fmatch()
, the integer ncol*omf
is referenced as "n.mc
". In the nodes table that it builds and returns, -n.mc
is the supply associated with the node labelled "(_Sink_)
" (usually the last node listed). We might grab this quantity from there when we need it, rather than storing it as a separate entry in the subproblem info table (as Adam and I discussed in a ftf.) However, this would call for us to more consistently return an MCFInfo out of fmatch()
. See this comment to no 187.
SubDivStrat
seems to draw no distinction between problems in which we're matching all available controls and problems where we're matching a small number of them. I have the impression that this is gumming up problems where we're selecting a small fraction of controls from a large reservoir: they take a long time, and if you look atsummary
applied to the result you're often told that the error of the optimization routine is much larger in magnitude than the eventual sum of matched distances.The first task is to think through what's going on here to decide if it makes any sense, and how it might be improved. Quite likely the follow up will include (a) extending users the option of setting per-subproblem
tol
parameters, (b) rewriting this part of the code.