Closed benthestatistician closed 3 years ago
Here's some quick testing of factor behavior that shows for data.frame
s at least, they work how we would expect. I can make up some similar for the MFCSolutions.
> a <- as.factor(sample(c("a", "b", "c"), 20, replace = TRUE))
> class(a)
[1] "factor"
> b <- as.factor(sample(c("x", "b", "c"), 30, replace = TRUE))
> table(a)
a
a b c
7 5 8
> table(b)
b
b c x
12 13 5
> both <- c(a,b)
> table(both)
both
1 2 3
19 18 13
> class(both)
[1] "integer"
> dfa <- data.frame(x = a)
> dfb <- data.frame(x = b)
> dfboth <- rbind(dfa, dfb)
> class(dfboth$x)
[1] "factor"
> levels(dfboth$x)
[1] "a" "b" "c" "x"
> table(dfboth$x)
a b c x
7 17 21 5
I've made updates to use factors in 1284f656. One class I didn't touch was SubProbInfo
as here groups seemed to be unique per character (so factors wouldn't matter). Anywhere I saw a "name" column, I also left it as character.
I've added some tests for the c()
methods that check to make sure that if one object has factor levels "a" and "b"; and another has "b" and "c", then the result will have "a", "b", "c".
lgtm! a60bfcf updates the vignette to reflect.
Mark says:
we might want to rethink factors vs. characters. dplyr does not like to join on two factors that don't have identical label sets. It may be that when we create the factors we could make the label sets identical, but this might end up being a big PITA.... something to revisit before we release.
(For now I'll leave the issue closed.)
I had some thoughts that I think will, among other benefits, make dplyr happier about these joins. Perhaps not immediately, as it may be that a corresponding change is necessary to the "edgelist" concept/implementation in order to fully realize the benefit. (Or, better yet, to the ISM spec, with the operations in complementarySlackness.R modified to work with ISMs, not reshaped copies of ISMs.) Anyway I created a branch off of the issue54-hinting branch to record them, namely issue166-factors-in-MCFSolutions. So far w/ a single commit, a106861, that just touches the planning vignette MCFSolutions.pdf. Main changes it calls for are in the MCFSolutions class:
@nodelevels
.@nodelevels
serves as the levels set for each of the several factors variables used in @arcs
and @nodes
to describe nodes. c()
, adjusted accordingly.Cf. #161 .
a552fb4 simplifies the plan summarized in last comment:
@nodelevels
/@nodenames
slot)@nodes$name
, @arcs@matches@upstream
, @arcs@matches@downstream
, @arcs@bookkeeping$start
and @arcs@bookkeeping$end
have precisely the same levels. Once that's done, we can get rid of the suppressWarnings()
calls in R/complementarySlackness.R as follows:
edgelist()
becomes a function of 2 arguments, the second being the level set of @arcs@matches@upstream
etc. Its return value continues to be a data frame with columns i
, j
and dist
, but now i
and j
are factors not character vectors, sharing the level set of @arcs@matches@upstream
et al. edgelist
method might look likesetMethod("edgelist", c(x = "InfinitySparseMatrix", y="character"), function(x) {
row_crosswalk <- match(x@rownames, y)
col_crosswalk <- match(x@colnames, y)
i <- factor(row_crosswalk, levels=1L:length(y), labels=y)
j - factor(col_crosswalk, levels=1L:length(y), labels=y)
return(data.frame(i, j, dist = x@.Data))
})
suppressWarnings()
-es shouldn't be needed. (They muffle a complaint about having to coerce factors to character b/c their level sets don't match.) Perhaps there's some speedup as well, particularly in larger problems. More important, however...flipped
, defineanyflipped <- any(solution@subproblems[, "flipped"])
Then the if (flipped) {<...>} else {<...>}
alternations used lower in these functions for left_join()
ops can be simplified to the alternative currently invoked under the !flipped
eventuality, provided that we replace
eld <- edgelist(distances)
with
eld <- edgelist(distances)
if (anyflipped)
eld <- rbind(eld, data.frame(i=eld$j, j=eld$i, dist=eld$dist))
In this way we'd take care of the main outstanding piece of #164, making evaluate_* functions work just as well with a (distance, MCFSolutions) pair combining multiple subproblems, some flipped & some not.
Minor amendment to the above: in next iteration of the plan (MCFSolutions.pdf at b1b3efc on issue166-factors-in-MCFSolutions, yet to be pushed up), default location for common levels-set of @arcs@matches@upstream
, ..., @arcs@bookkeeping$end
-- the "node labels" -- is to be the row names of the corresponding @nodes
table. In this way @nodes$name
can carry provided names, even when not distinct, while when c()
-ing MCFSolutions objects, de-duplication of node labels can be passed off to base::rbind.data.frame()
.
All implemented now in i166-factors-in-MCFsolutions branch. I expect that I'll merge that branch with the issue54-hinting branch next, but first I'd like to resolve one lingering minor incompatibility (the merge at c62e982 left out one test that seemed to be failing b/c the test itself needed updates).
More changes than I had expected were necessary to make this work; ideally I'll turn to it later with an eye to testing what was added. In particular, I have yet to add tests of matching problems combining flipped and non-flipped subproblems: are the node prices being handled appropriately in those cases?
Test coverage as of 61526e8:
> covr::package_coverage()
optmatch Coverage: 72.54% R/abs.optmatch.dlist.R: 0.00%
R/boxplotMethods.R: 0.00% R/deprecated.R: 0.00%
R/mdist.R: 0.00% R/min.controls.cap.R: 0.00%
R/Ops.optmatch.dlist.R: 0.00% R/print.optmatch.dlist.R: 0.00%
R/relaxinfo.R: 0.00% src/cuseful.cc: 0.00%
R/zzz.R: 25.00% R/makedist.R: 45.60%
R/summary.ism.R: 48.70% R/utilities.R: 56.36%
R/summary.optmatch.R: 56.84% src/relax4s.f: 71.51%
R/scores.R: 72.73% R/caliper.R: 80.00%
R/fill.NAs.R: 80.88% R/complementarySlackness.R: 83.45%
R/stratumStructure.R: 84.13% R/fmatch.R: 84.94%
src/map.cc: 85.71% R/zzzDistanceSpecification.R: 86.25%
R/MCFSolutions.R: 86.82% R/max.controls.cap.R: 87.76%
R/fullmatch.R: 88.89% R/pairmatch.R: 89.69%
R/print.optmatch.R: 90.91% R/InfinitySparseMatrix.R: 91.58%
src/r_smahal.cc: 91.67% R/feasible.R: 91.76%
R/match_on.R: 92.57% R/edgelist.R: 94.29%
R/matched.distances.R: 94.44% R/solve_reg_fm_prob.R: 95.62%
src/smahal.cc: 95.81% R/distUnion.R: 96.77%
R/exactMatch.R: 97.01% R/optmatchS3.R: 97.06%
R/DenseMatrix.R: 100.00% R/matched.R: 100.00%
R/matchfailed.R: 100.00% src/distances.cc: 100.00%
src/ism.cc: 100.00% src/optmatch_init.c: 100.00%
src/subsetInfSparseMatrix.cc: 100.00%
3faee0c is a very minor increment on testing, introducing a simple full matching problem. Next steps: flip the problem, verify CS calculations are as they should be; join the flipped version w/ an ordinary subproblem & see that the CS calculations continue to make sense.
As of 4843963 and e840009, this may well be ready to go: I've done the things noted in this issue as having needed to be done.
Another look-through wouldn't hurt, even if by me, but I think it's about ready to merge into master. So that it's definitively no longer a blocker for other issues in the node prices Project (e.g. #164, #167 or #178; also relevant to #161).
I ran /vignettes/performance/matching.R against this branch and against the master branch. master looks quite a bit better at the moment. Master: performance-011.pdf The issue166 branch, at 4843963: performance-011.pdf
The first performance hit comes from a call to cbind()
within fmatch()
. Looking further, the performance deficit is traceable to operations having to do with inspecting row names for duplication.
distance
, the EdgeList
defined within fmatch()
. data.frame(...,row.names=seq_len(narcs))
instead of cbind()
. The majority of the deficit comes from the evaluate_*
functions.
summary.optmatch()
. evaluate_lagrangian()
and evaluate_dual()
. 24424a2 addresses the row names to-do's in the last comment. To-do's relating to evaluate_*
remain.
NB: If we move the evaluate_*
functions out of the fullmatch()
call stack and into summary.optmatch()
, somewhere in the fullmatch()
call stack we'll need to generate subproblem-specific regret estimates of some other type if we want to address #167. (Because without the evaluate_*
calculations we won't have access to the duality-based regret estimates when updating an optmatch object in light of a decreased tolerance.) In the current code base there are "crude" estimates of "exceedance" being generated, and it would appear that they could fill the role I have in mind w/r/t/ #167. But I've lost faith in those -- we'd need to replace them with something we can trust. Comments to #85 are a likely place for further later notes about this.
To summarize: quite likely the issue I'm flagging would be addressed in the process of addressing #85. If so, then moving the evaluate_*
material out of the fullmatch call stack and into summary routines would not create a blocker for #167. If not, then moving the evaluate_*
calculations into summary routines may mean giving up on #167, making hinting less effective. Perhaps #85 should have higher priority than this issue for the moment, at least pending sufficient progress there to decide whether #167 can be addressed without a duality-based regret bound.
The outstanding matters in this comment thread falling better under other issues (#164, #173, #85), I'm closing this one. Also removing from GH the branch associated with this issue; the new branch proj1-nodeprices being derived from it.
As of this writing ( [issue54-hinting fff2597] ), columns titled
groups
,upstream
,downstream
,start
andend
are to be character. But it turns out we'd have greater compatibility with setup inside offmatch()
if these were factors instead. That's probably much better memory-wise as well. (Cf. "MCFSolutions" files in R/, tests/testthat/ and vignettes/, on issue54-hinting branch.)It could become trickier to combine the objects: I expect the "c" methods will need updating. I'd also be worried about inadvertent collapsing of levels; there should be tests of this.
Something to ponder before starting: how to handle situations in which the same unit name (level of
upstream
,downstream
,start
orend
) appears in distinct subproblems (groups
)? The main reason I had spec'd these as character was to postpone resolving such concerns, until well past the moment of combining subproblems.