markmfredrickson / optmatch

Functions for optimal matching in R
https://markmfredrickson.github.io/optmatch
Other
47 stars 14 forks source link

factor not character cols in MCFSolutions objects #166

Closed benthestatistician closed 3 years ago

benthestatistician commented 5 years ago

As of this writing ( [issue54-hinting fff2597] ), columns titled groups, upstream, downstream, start and end are to be character. But it turns out we'd have greater compatibility with setup inside of fmatch() if these were factors instead. That's probably much better memory-wise as well. (Cf. "MCFSolutions" files in R/, tests/testthat/ and vignettes/, on issue54-hinting branch.)

It could become trickier to combine the objects: I expect the "c" methods will need updating. I'd also be worried about inadvertent collapsing of levels; there should be tests of this.

Something to ponder before starting: how to handle situations in which the same unit name (level of upstream, downstream, start or end) appears in distinct subproblems (groups)? The main reason I had spec'd these as character was to postpone resolving such concerns, until well past the moment of combining subproblems.

markmfredrickson commented 5 years ago

Here's some quick testing of factor behavior that shows for data.frames at least, they work how we would expect. I can make up some similar for the MFCSolutions.

> a <- as.factor(sample(c("a", "b", "c"), 20, replace = TRUE))
> class(a)
[1] "factor"
> b <- as.factor(sample(c("x", "b", "c"), 30, replace = TRUE))
> table(a)
a
a b c 
7 5 8 
> table(b)
b
 b  c  x 
12 13  5 
> both <- c(a,b)
> table(both)
both
 1  2  3 
19 18 13 
> class(both)
[1] "integer"
> dfa <- data.frame(x = a)
> dfb <- data.frame(x = b)
> dfboth <- rbind(dfa, dfb)
> class(dfboth$x)
[1] "factor"
> levels(dfboth$x)
[1] "a" "b" "c" "x"
> table(dfboth$x)

 a  b  c  x 
 7 17 21  5 
markmfredrickson commented 5 years ago

I've made updates to use factors in 1284f656. One class I didn't touch was SubProbInfo as here groups seemed to be unique per character (so factors wouldn't matter). Anywhere I saw a "name" column, I also left it as character.

I've added some tests for the c() methods that check to make sure that if one object has factor levels "a" and "b"; and another has "b" and "c", then the result will have "a", "b", "c".

benthestatistician commented 5 years ago

lgtm! a60bfcf updates the vignette to reflect.

benthestatistician commented 5 years ago

Mark says:

we might want to rethink factors vs. characters. dplyr does not like to join on two factors that don't have identical label sets. It may be that when we create the factors we could make the label sets identical, but this might end up being a big PITA.... something to revisit before we release.

(For now I'll leave the issue closed.)

benthestatistician commented 5 years ago

I had some thoughts that I think will, among other benefits, make dplyr happier about these joins. Perhaps not immediately, as it may be that a corresponding change is necessary to the "edgelist" concept/implementation in order to fully realize the benefit. (Or, better yet, to the ISM spec, with the operations in complementarySlackness.R modified to work with ISMs, not reshaped copies of ISMs.) Anyway I created a branch off of the issue54-hinting branch to record them, namely issue166-factors-in-MCFSolutions. So far w/ a single commit, a106861, that just touches the planning vignette MCFSolutions.pdf. Main changes it calls for are in the MCFSolutions class:

Cf. #161 .

benthestatistician commented 5 years ago

a552fb4 simplifies the plan summarized in last comment:

Once that's done, we can get rid of the suppressWarnings() calls in R/complementarySlackness.R as follows:

setMethod("edgelist", c(x = "InfinitySparseMatrix", y="character"), function(x) {
    row_crosswalk <- match(x@rownames, y)
    col_crosswalk <- match(x@colnames, y)
    i <- factor(row_crosswalk, levels=1L:length(y), labels=y)
    j - factor(col_crosswalk, levels=1L:length(y), labels=y)
    return(data.frame(i,  j, dist = x@.Data))
})
anyflipped  <- any(solution@subproblems[, "flipped"])

Then the if (flipped) {<...>} else {<...>} alternations used lower in these functions for left_join() ops can be simplified to the alternative currently invoked under the !flipped eventuality, provided that we replace

 eld <- edgelist(distances)

with

 eld <- edgelist(distances)
if (anyflipped)
    eld <- rbind(eld, data.frame(i=eld$j, j=eld$i, dist=eld$dist))

In this way we'd take care of the main outstanding piece of #164, making evaluate_* functions work just as well with a (distance, MCFSolutions) pair combining multiple subproblems, some flipped & some not.

benthestatistician commented 4 years ago

Minor amendment to the above: in next iteration of the plan (MCFSolutions.pdf at b1b3efc on issue166-factors-in-MCFSolutions, yet to be pushed up), default location for common levels-set of @arcs@matches@upstream, ..., @arcs@bookkeeping$end -- the "node labels" -- is to be the row names of the corresponding @nodes table. In this way @nodes$name can carry provided names, even when not distinct, while when c()-ing MCFSolutions objects, de-duplication of node labels can be passed off to base::rbind.data.frame().

benthestatistician commented 4 years ago

All implemented now in i166-factors-in-MCFsolutions branch. I expect that I'll merge that branch with the issue54-hinting branch next, but first I'd like to resolve one lingering minor incompatibility (the merge at c62e982 left out one test that seemed to be failing b/c the test itself needed updates).
More changes than I had expected were necessary to make this work; ideally I'll turn to it later with an eye to testing what was added. In particular, I have yet to add tests of matching problems combining flipped and non-flipped subproblems: are the node prices being handled appropriately in those cases?

benthestatistician commented 4 years ago

Test coverage as of 61526e8:

>  covr::package_coverage()
optmatch Coverage: 72.54%                   R/abs.optmatch.dlist.R: 0.00%
R/boxplotMethods.R: 0.00%                   R/deprecated.R: 0.00%
R/mdist.R: 0.00%                            R/min.controls.cap.R: 0.00%
R/Ops.optmatch.dlist.R: 0.00%               R/print.optmatch.dlist.R: 0.00%
R/relaxinfo.R: 0.00%                        src/cuseful.cc: 0.00%
R/zzz.R: 25.00%                             R/makedist.R: 45.60%
R/summary.ism.R: 48.70%                     R/utilities.R: 56.36%
R/summary.optmatch.R: 56.84%                src/relax4s.f: 71.51%
R/scores.R: 72.73%                          R/caliper.R: 80.00%
R/fill.NAs.R: 80.88%                        R/complementarySlackness.R: 83.45%
R/stratumStructure.R: 84.13%                R/fmatch.R: 84.94%
src/map.cc: 85.71%                          R/zzzDistanceSpecification.R: 86.25%
R/MCFSolutions.R: 86.82%                    R/max.controls.cap.R: 87.76%
R/fullmatch.R: 88.89%                       R/pairmatch.R: 89.69%
R/print.optmatch.R: 90.91%                  R/InfinitySparseMatrix.R: 91.58%
src/r_smahal.cc: 91.67%                     R/feasible.R: 91.76%
R/match_on.R: 92.57%                        R/edgelist.R: 94.29%
R/matched.distances.R: 94.44%               R/solve_reg_fm_prob.R: 95.62%
src/smahal.cc: 95.81%                       R/distUnion.R: 96.77%
R/exactMatch.R: 97.01%                      R/optmatchS3.R: 97.06%
R/DenseMatrix.R: 100.00%                    R/matched.R: 100.00%
R/matchfailed.R: 100.00%                    src/distances.cc: 100.00%
src/ism.cc: 100.00%                         src/optmatch_init.c: 100.00%
src/subsetInfSparseMatrix.cc: 100.00%
benthestatistician commented 4 years ago

3faee0c is a very minor increment on testing, introducing a simple full matching problem. Next steps: flip the problem, verify CS calculations are as they should be; join the flipped version w/ an ordinary subproblem & see that the CS calculations continue to make sense.

benthestatistician commented 3 years ago

As of 4843963 and e840009, this may well be ready to go: I've done the things noted in this issue as having needed to be done.

Another look-through wouldn't hurt, even if by me, but I think it's about ready to merge into master. So that it's definitively no longer a blocker for other issues in the node prices Project (e.g. #164, #167 or #178; also relevant to #161).

benthestatistician commented 3 years ago

I ran /vignettes/performance/matching.R against this branch and against the master branch. master looks quite a bit better at the moment. Master: performance-011.pdf The issue166 branch, at 4843963: performance-011.pdf

The first performance hit comes from a call to cbind() within fmatch(). Looking further, the performance deficit is traceable to operations having to do with inspecting row names for duplication.

The majority of the deficit comes from the evaluate_* functions.

benthestatistician commented 3 years ago

24424a2 addresses the row names to-do's in the last comment. To-do's relating to evaluate_* remain.

benthestatistician commented 3 years ago

NB: If we move the evaluate_* functions out of the fullmatch() call stack and into summary.optmatch(), somewhere in the fullmatch() call stack we'll need to generate subproblem-specific regret estimates of some other type if we want to address #167. (Because without the evaluate_* calculations we won't have access to the duality-based regret estimates when updating an optmatch object in light of a decreased tolerance.) In the current code base there are "crude" estimates of "exceedance" being generated, and it would appear that they could fill the role I have in mind w/r/t/ #167. But I've lost faith in those -- we'd need to replace them with something we can trust. Comments to #85 are a likely place for further later notes about this.

To summarize: quite likely the issue I'm flagging would be addressed in the process of addressing #85. If so, then moving the evaluate_* material out of the fullmatch call stack and into summary routines would not create a blocker for #167. If not, then moving the evaluate_* calculations into summary routines may mean giving up on #167, making hinting less effective. Perhaps #85 should have higher priority than this issue for the moment, at least pending sufficient progress there to decide whether #167 can be addressed without a duality-based regret bound.

benthestatistician commented 3 years ago

The outstanding matters in this comment thread falling better under other issues (#164, #173, #85), I'm closing this one. Also removing from GH the branch associated with this issue; the new branch proj1-nodeprices being derived from it.