Closed adamrauh closed 5 years ago
On the subject of dropped nodes in the node.data
object, can we just NA
the relevant data for these? Sounds like that might make things a little easier. There is some stuff in the full- and pairmatch functions to get and keep a data object. Previously, this was mostly about making sure that the final optmatch object was in the same order as the original data (as otherwise the order of the results was not guaranteed). Would it be useful to push this data argument along in order to initialize the node.data
object?
I wouldn't worry too much about getter
and setter
stuff just yet (unless Ben has said otherwise). That's generally good programming, but most of this functionality will be internal usage for now.
match
function to preserve orderings (example in getZfromMatch
here ), that would be very nice.s4 <- stratumStructure(fm.psd.cal <- fullmatch(psd/(psd<.25), data=nuclearplants))
expect_equal(names(s4), c("1:0", "3:1", "2:1", "1:2", "1:7", "0:1"))
expect_equal(as.vector(s4), c(3,1,1,1,1,11))
expect_true(all.equal(attr(s4, "comparable.num.matched.pairs"), 5.9166666))
from test.stratumStructure.old.R
and the first test case in test.summary.optmatch.R
should be useful in demonstrating the problem. If you step through these, you should see my hack-job solutions in stratumStructure and summary get called into action-- specifically, the changes from this commit, along with the changes from around line 212 here
I'm thinking of picking this up, starting with what I have listed as tasks 2 and 4 -- ironing out some bugs and finishing the new warm start functionality. It shouldn't be too hard to update those down the line to make their implementations align with any further decisions about the node.data
questions discussed above. Any thoughts on better places to start, or does that seem reasonable?
@adamrauh , additions of the type you mention would be welcome. I agree w/ your takeaway from the node.data
discussion: good chance we'll stick with precisely the approach you had been taking; in the unlikely event that we don't, we'll be further along to whatever the modified goal is if we have a more complete implementation of the current concept.
Regarding failing tests, if you've got questions you can't resolve yourself about whether this or that test will continue to be necessary, don't hesitate to ask.
I was looking at testing a form of the improvements related to warm starts yesterday, which led me to some scattered conclusions.
The bad news is that, in the form I have it now, the new code is significantly slower than the master branch version. I haven't looked into this too much yet, as this is something I just stumbled into last night. Based on some brief investigation/profiling, it looks like there is one particular chunk of code causing a pretty large bottleneck, and I don't think it should be too awful to come up with a workaround. This is my next order of business, I think.
That said, some other code that is intended to generate warm starts for new nodes introduced into a problem seems to be working as hoped. Unfortunately, any impact of this functionality is lost at this point because of the other code bottlenecks introduced with the refactoring. I think I ought to spend a bit of time looking to see if there are obvious pieces of newly introduced code that can be optimized to at least get things in the ballpark of the current master version. However, I would also like to get some indication as to whether or not the new warm start improvements actually appear to do anything, since its been a question that I've been dealing with for awhile (as @benthestatistician can attest), but also, more practically, it would be helpful in prioritizing subsequent tasks. I can see some ways to carry over specific pieces of the new framework into the old codebase to come up with some tests. I'm on the fence as to how worthwhile that would be.
some other code that is intended to generate warm starts for new nodes introduced into a problem seems to be working as hoped. Unfortunately, any impact of this functionality is lost at this point because of the other code bottlenecks introduced with the refactoring.
At a number of points along the way we've had the experience that the first version of change intended to improve performance made it worse. In all cases I can think of it ultimately made things better....
I think I ought to spend a bit of time looking to see if there are obvious pieces of newly introduced code that can be optimized to at least get things in the ballpark of the current master version.
One way bottlenecks get exposed, in my experience, is to step through a relatively hard problem and see where it lags. sometimes this happens at unexpected points, and once you identify them you can readily fix them. In other cases, however, a more formal profiling process is needed. I think we're going to want to do this en route to processing this merge. Do you have test cases demonstrating the slowdowns?
However, I would also like to get some indication as to whether or not the new warm start improvements actually appear to do anything, since its been a question that I've been dealing with for awhile (as @benthestatistician https://github.com/benthestatistician can attest).... I'm on the fence as to how worthwhile that would be.
Au contraire! I recall clear evidence of improvement. Not always as dramatic an improvement as we were hoping for - but that's consistent with the presence of as-yet undiscovered bottlenecks. Identifying those test cases is a next step.
I've been using our old friend, the sat coaching data to test. The test case isn't particularly elaborate:
w <- match_on(Coach ~ psatv + psatm + presatv + presatm + I23 + I24 + I25, data = satcoach)
res <- fullmatch(w, data = satcoach
Based on the profiling, it looks like one major problem is in the .determine_group
helper function in nodecode.R
-- mostly because of all the calls to all() and %in%...I think figuring out a way to take these out should see some notable improvement just with that change. This looked like a pretty serious bottleneck from the profvis
output, and doesn't really have a counterpart in the master branch. I think that's definitely my immediate next move.
My initial test didn't suggest consistent improvement-- but that was on one test, with a small data set and largely untested code in certain parts, so I'm not particularly concerned yet.
Rather than merging this into the older "hinting" branch, I've created a new branch issue54-hinting and merged it over there. (If no objections emerge I may well remove the hinting branch.) We can refer back to this pull request for the comments and outstanding to-do's, but I'm going to go ahead and close it out for how.
[ ] Write tests for solveMatches. solveMatches is intended to replace (and hopefully improve) subDivStrat. It reorganizes that code to hopefully make a little bit more sense, but also integrates some "warm start" functionality and works with the new S4 data structures. Tests need to be written to make sure that things are working according to expectation-- I don't expect too many problems here because problems would almost certainly show up in some of the other tests related to fullmatch/pairmatch.
[ ] Fix, or otherwise adjust tests for remaining bugs. There are a handful of tests that still fail. Some of these are just subDivStrat tests which are going to be deprecated. I think there are two remaining bugs with summary.optmatch, which I can look into...I think they are just related to the tests themselves as a result of the S4 changes vs. code within the function.
[ ] Establish expectation for unused nodes (that is, units from a data set that do not end up in matches) in node.data structure. Currently that structure leaves them out completely -- there is no corresponding row in the resulting data structure for units not used in a match. This creates some problems in stratumStructure -- and in summary.optmatch. I currently have a something of a hacked together solution that gets around the problem, but it certainly shouldn't act as a permanent solution. Currently, I am creating a second data frame called "dropped.nodes" and adding it as an attribute to the node.data slot. Then, it gets accessed at appropriate times. Long term, I would propose incorporating the missing/dropped nodes into the node.data slot. This will cause the current warm start functionality (and code in a few other places) to break because of the fact that it changes the expected dimensions of node.data. However, I think some functions could probably be written to access various pieces of node-related information that would solve this. For instance, a get.contrast_group function that would pull this information in the expected order within stratumStructure and summary.optmatch or a get.warm.nodes function that extracts information about nodes from a previously existin problem, wherever possible. This second function is roughly the idea I am currently following for warm start functionality.
[ ] Finish adding "warm start" functionality. There are a few different possibilities that were intended to be caught.
1) Problems that differ only by tolerance (or perhaps some other parameter) 2) Problems with nodes that were either previously unused or were not in the previously solved problem
Code handling [1] should, for the most part, already exist and be in reasonably stable shape, although some adjustments will be needed, depending on the decisions surrounding how to handle the node.data/missing nodes situation. I was initially intending to tackle 2, but ultimately spending the majority of my time working on the s3 to s4 refactoring work and necessary debugging. The code I have for 2 is actually fairly far along and seemed to work well in initial testing, but it still needs more work before being fully integrated. I'll write up some more documentation on the processes I have outlined/partially implemented for doing 2 somewhere else. Simply put, my goal was to generate node prices that satisfied complementary slackness, holding some other pieces of the problem constant and relaxing a few constraints. Progress so far toward 2 can be found in prep_warm_nodes.
[ ] Write tests for any code involving warm start functionality.
[ ] Write more tests specific to S4 Optmatch object. Some basic tests exist, but need to be expanded upon more preferably...probably not an immediate priority.
[ ] Write "getter" style functions for Optmatch objects, rather than forcing users to make use of slots directly.
[ ] General code refinement-- better comments, removing commented out code, notes to self, updating docs where necessary etc. (I'm planning on doing this asap, but wanted to make the request anyway)
notable changes in this pull request, put briefly:
1) Deprecation of subDivStrat, in favor of refactored solveMatches
2) Some functionality for generating and using "warm start" values to speed up matches (in progress)
3) Refactoring of s3 optmatch class into S4 Optmatch class (also in progress)
Overview of notable changes in S4 Optmatch class
node.data data frame
With the addition of node prices/reduced costs being associated with nodes returned from problems, it made some sense to create a dedicated structure for storing information about nodes all together. This is a data frame with information about node names, prices, treatment/control group information and information about which subproblem the nodes/units were in. This offered a relatively nice, compact way of storing this type of information. It also contains information about end and sink nodes from each subproblem. It also is useful for extracting storing information about node prices in procedures related to the warm start functionality.
prob.data data frame
This is a data frame with information about subproblems -- exceedance values, tolerance values, min/mean/max numbers of controls. These can be tied to individual subproblems that should match the subproblemids in the node.data frame structure. This could potentially help with improvements down the line related to adjusting parameters on a subprobelm level.
Other aspects of the old optmatch class were transferred pretty directly into the new version, like the "names" and "call" slots.
I think the top priority is ironing out the problems with the S4 version of the class, at the moment.