factor not character cols in MCFSolutions objects

benthestatistician commented 5 years ago

As of this writing ( [issue54-hinting fff2597] ), columns titled groups, upstream, downstream, start and end are to be character. But it turns out we'd have greater compatibility with setup inside of fmatch() if these were factors instead. That's probably much better memory-wise as well. (Cf. "MCFSolutions" files in R/, tests/testthat/ and vignettes/, on issue54-hinting branch.)

It could become trickier to combine the objects: I expect the "c" methods will need updating. I'd also be worried about inadvertent collapsing of levels; there should be tests of this.

Something to ponder before starting: how to handle situations in which the same unit name (level of upstream, downstream, start or end) appears in distinct subproblems (groups)? The main reason I had spec'd these as character was to postpone resolving such concerns, until well past the moment of combining subproblems.

markmfredrickson commented 5 years ago

Here's some quick testing of factor behavior that shows for data.frames at least, they work how we would expect. I can make up some similar for the MFCSolutions.

> a <- as.factor(sample(c("a", "b", "c"), 20, replace = TRUE))
> class(a)
[1] "factor"
> b <- as.factor(sample(c("x", "b", "c"), 30, replace = TRUE))
> table(a)
a
a b c 
7 5 8 
> table(b)
b
 b  c  x 
12 13  5 
> both <- c(a,b)
> table(both)
both
 1  2  3 
19 18 13 
> class(both)
[1] "integer"
> dfa <- data.frame(x = a)
> dfb <- data.frame(x = b)
> dfboth <- rbind(dfa, dfb)
> class(dfboth$x)
[1] "factor"
> levels(dfboth$x)
[1] "a" "b" "c" "x"
> table(dfboth$x)

 a  b  c  x 
 7 17 21  5

markmfredrickson commented 5 years ago

I've made updates to use factors in 1284f656. One class I didn't touch was SubProbInfo as here groups seemed to be unique per character (so factors wouldn't matter). Anywhere I saw a "name" column, I also left it as character.

I've added some tests for the c() methods that check to make sure that if one object has factor levels "a" and "b"; and another has "b" and "c", then the result will have "a", "b", "c".

benthestatistician commented 5 years ago

lgtm! a60bfcf updates the vignette to reflect.

benthestatistician commented 5 years ago

Mark says:

we might want to rethink factors vs. characters. dplyr does not like to join on two factors that don't have identical label sets. It may be that when we create the factors we could make the label sets identical, but this might end up being a big PITA.... something to revisit before we release.

(For now I'll leave the issue closed.)

benthestatistician commented 5 years ago

I had some thoughts that I think will, among other benefits, make dplyr happier about these joins. Perhaps not immediately, as it may be that a corresponding change is necessary to the "edgelist" concept/implementation in order to fully realize the benefit. (Or, better yet, to the ISM spec, with the operations in complementarySlackness.R modified to work with ISMs, not reshaped copies of ISMs.) Anyway I created a branch off of the issue54-hinting branch to record them, namely issue166-factors-in-MCFSolutions. So far w/ a single commit, a106861, that just touches the planning vignette MCFSolutions.pdf. Main changes it calls for are in the MCFSolutions class:

MatchablesInfo objects go away
Instead, each MCFSolutions carries around an additional character vector slot, @nodelevels.
This @nodelevels serves as the levels set for each of the several factors variables used in @arcs and @nodes to describe nodes.
MCFSolutions methods, in particular c(), adjusted accordingly.

Cf. #161 .

benthestatistician commented 5 years ago

a552fb4 simplifies the plan summarized in last comment:

MatchablesInfo objects go away
not to be replaced by anything (no @nodelevels/@nodenames slot)
Instead, MCFSolutions validity checks enforce that each of @nodes$name, @arcs@matches@upstream, @arcs@matches@downstream, @arcs@bookkeeping$start and @arcs@bookkeeping$end have precisely the same levels.

Once that's done, we can get rid of the suppressWarnings() calls in R/complementarySlackness.R as follows:

edgelist() becomes a function of 2 arguments, the second being the level set of @arcs@matches@upstream etc. Its return value continues to be a data frame with columns i, j and dist, but now i and j are factors not character vectors, sharing the level set of @arcs@matches@upstream et al.
to achieve this, the primary edgelist method might look like

setMethod("edgelist", c(x = "InfinitySparseMatrix", y="character"), function(x) {
    row_crosswalk <- match(x@rownames, y)
    col_crosswalk <- match(x@colnames, y)
    i <- factor(row_crosswalk, levels=1L:length(y), labels=y)
    j - factor(col_crosswalk, levels=1L:length(y), labels=y)
    return(data.frame(i,  j, dist = x@.Data))
})

now the suppressWarnings()-es shouldn't be needed. (They muffle a complaint about having to coerce factors to character b/c their level sets don't match.) Perhaps there's some speedup as well, particularly in larger problems. More important, however...
At this point the R/complementarySlackness.R functions could handle setups with flipped subproblems via mechanisms such as the following. instead of a flag flipped, define

anyflipped  <- any(solution@subproblems[, "flipped"])

Then the if (flipped) {<...>} else {<...>} alternations used lower in these functions for left_join() ops can be simplified to the alternative currently invoked under the !flipped eventuality, provided that we replace

 eld <- edgelist(distances)

with

 eld <- edgelist(distances)
if (anyflipped)
    eld <- rbind(eld, data.frame(i=eld$j, j=eld$i, dist=eld$dist))

In this way we'd take care of the main outstanding piece of #164, making evaluate_* functions work just as well with a (distance, MCFSolutions) pair combining multiple subproblems, some flipped & some not.

benthestatistician commented 4 years ago

Minor amendment to the above: in next iteration of the plan (MCFSolutions.pdf at b1b3efc on issue166-factors-in-MCFSolutions, yet to be pushed up), default location for common levels-set of @arcs@matches@upstream, ..., @arcs@bookkeeping$end -- the "node labels" -- is to be the row names of the corresponding @nodes table. In this way @nodes$name can carry provided names, even when not distinct, while when c()-ing MCFSolutions objects, de-duplication of node labels can be passed off to base::rbind.data.frame().

benthestatistician commented 4 years ago

All implemented now in i166-factors-in-MCFsolutions branch. I expect that I'll merge that branch with the issue54-hinting branch next, but first I'd like to resolve one lingering minor incompatibility (the merge at c62e982 left out one test that seemed to be failing b/c the test itself needed updates).
More changes than I had expected were necessary to make this work; ideally I'll turn to it later with an eye to testing what was added. In particular, I have yet to add tests of matching problems combining flipped and non-flipped subproblems: are the node prices being handled appropriately in those cases?

benthestatistician commented 4 years ago

Test coverage as of 61526e8:

>  covr::package_coverage()
optmatch Coverage: 72.54%                   R/abs.optmatch.dlist.R: 0.00%
R/boxplotMethods.R: 0.00%                   R/deprecated.R: 0.00%
R/mdist.R: 0.00%                            R/min.controls.cap.R: 0.00%
R/Ops.optmatch.dlist.R: 0.00%               R/print.optmatch.dlist.R: 0.00%
R/relaxinfo.R: 0.00%                        src/cuseful.cc: 0.00%
R/zzz.R: 25.00%                             R/makedist.R: 45.60%
R/summary.ism.R: 48.70%                     R/utilities.R: 56.36%
R/summary.optmatch.R: 56.84%                src/relax4s.f: 71.51%
R/scores.R: 72.73%                          R/caliper.R: 80.00%
R/fill.NAs.R: 80.88%                        R/complementarySlackness.R: 83.45%
R/stratumStructure.R: 84.13%                R/fmatch.R: 84.94%
src/map.cc: 85.71%                          R/zzzDistanceSpecification.R: 86.25%
R/MCFSolutions.R: 86.82%                    R/max.controls.cap.R: 87.76%
R/fullmatch.R: 88.89%                       R/pairmatch.R: 89.69%
R/print.optmatch.R: 90.91%                  R/InfinitySparseMatrix.R: 91.58%
src/r_smahal.cc: 91.67%                     R/feasible.R: 91.76%
R/match_on.R: 92.57%                        R/edgelist.R: 94.29%
R/matched.distances.R: 94.44%               R/solve_reg_fm_prob.R: 95.62%
src/smahal.cc: 95.81%                       R/distUnion.R: 96.77%
R/exactMatch.R: 97.01%                      R/optmatchS3.R: 97.06%
R/DenseMatrix.R: 100.00%                    R/matched.R: 100.00%
R/matchfailed.R: 100.00%                    src/distances.cc: 100.00%
src/ism.cc: 100.00%                         src/optmatch_init.c: 100.00%
src/subsetInfSparseMatrix.cc: 100.00%

benthestatistician commented 4 years ago

3faee0c is a very minor increment on testing, introducing a simple full matching problem. Next steps: flip the problem, verify CS calculations are as they should be; join the flipped version w/ an ordinary subproblem & see that the CS calculations continue to make sense.