markmfredrickson / optmatch

Functions for optimal matching in R
https://markmfredrickson.github.io/optmatch
Other
47 stars 14 forks source link

S4 Optmatch class, warm start functionality, other updates #153

Closed adamrauh closed 5 years ago

adamrauh commented 6 years ago

1) Problems that differ only by tolerance (or perhaps some other parameter) 2) Problems with nodes that were either previously unused or were not in the previously solved problem

Code handling [1] should, for the most part, already exist and be in reasonably stable shape, although some adjustments will be needed, depending on the decisions surrounding how to handle the node.data/missing nodes situation. I was initially intending to tackle 2, but ultimately spending the majority of my time working on the s3 to s4 refactoring work and necessary debugging. The code I have for 2 is actually fairly far along and seemed to work well in initial testing, but it still needs more work before being fully integrated. I'll write up some more documentation on the processes I have outlined/partially implemented for doing 2 somewhere else. Simply put, my goal was to generate node prices that satisfied complementary slackness, holding some other pieces of the problem constant and relaxing a few constraints. Progress so far toward 2 can be found in prep_warm_nodes.

notable changes in this pull request, put briefly:

1) Deprecation of subDivStrat, in favor of refactored solveMatches

2) Some functionality for generating and using "warm start" values to speed up matches (in progress)

3) Refactoring of s3 optmatch class into S4 Optmatch class (also in progress)

Overview of notable changes in S4 Optmatch class

node.data data frame

With the addition of node prices/reduced costs being associated with nodes returned from problems, it made some sense to create a dedicated structure for storing information about nodes all together. This is a data frame with information about node names, prices, treatment/control group information and information about which subproblem the nodes/units were in. This offered a relatively nice, compact way of storing this type of information. It also contains information about end and sink nodes from each subproblem. It also is useful for extracting storing information about node prices in procedures related to the warm start functionality.

prob.data data frame

This is a data frame with information about subproblems -- exceedance values, tolerance values, min/mean/max numbers of controls. These can be tied to individual subproblems that should match the subproblemids in the node.data frame structure. This could potentially help with improvements down the line related to adjusting parameters on a subprobelm level.

Other aspects of the old optmatch class were transferred pretty directly into the new version, like the "names" and "call" slots.

I think the top priority is ironing out the problems with the S4 version of the class, at the moment.

markmfredrickson commented 6 years ago

On the subject of dropped nodes in the node.data object, can we just NA the relevant data for these? Sounds like that might make things a little easier. There is some stuff in the full- and pairmatch functions to get and keep a data object. Previously, this was mostly about making sure that the final optmatch object was in the same order as the original data (as otherwise the order of the results was not guaranteed). Would it be useful to push this data argument along in order to initialize the node.data object?

I wouldn't worry too much about getter and setter stuff just yet (unless Ben has said otherwise). That's generally good programming, but most of this functionality will be internal usage for now.

benthestatistician commented 6 years ago
  1. A procedural question: @adamrauh , will you be able to find, say, a couple hours a week for this over the next several weeks? If so would you be willing to use some of that time to implement localized code suggestions along the lines of @markmfredrickson's comments on R/OptmatchS4.R and R/nodecode.R over the past day or so (which appear in the GH thread of discussion about this merge request)? (If the answer to either is no then just say so; partly I'm trying to figure out how we'll need to divide the work in order to make progress.)
  2. A procedural suggestion about this review-and-merge: I propose that we use the checkboxes in Adam's pull request to record when a matter has been farmed out into one or more self-standing issues or closed/dismissed. Then we should be about ready to merge this work (into hinting, not master) when all the checkboxes are checked.
  3. On the specific matter of dropped nodes in node.data, my high-level assessment is that Adam's approach of simply leaving dropped nodes out of the node.data table is the correct one, that resulting failures of alignment between the node.data table and the factor vector recording matches call for remedies other than populating the node.data table with idle entries. (Here "high-level" = "as of yet uninformed by direct engagement w/ code proposal".)
  4. @adamrauh , where should I look for the "problems in stratumStructure -- and in summary.optmatch" that are being caused by the absence of unmatched nodes in the node.data table? Once we've got a clear sense of that I expect we'll be able to spin the "Establish expectations for unused nodes" bullet point off into a specific issue or two, pertaining to those helper functions rather than to dropped nodes and the node.data table. (That's assuming that "my high-level assessment" immediately above withstands the scrutiny of other contributors to this thread, not necessarily a given.)
  5. On the specific matter of "getter" and "setter" functions, I agree with @markmfredrickson that this doesn't sound like a near-term action item. Once we approach the point of having users get and set node prices and related properties, we'll want to have gotten those functions set up; but our present goals of enabling users to do warm starts in certain scenarios don't seem to necessitate direct manipulation of node prices on the user's part.
adamrauh commented 6 years ago
  1. Yes-- I probably can't commit to anything long term, but I'm happy to contribute on this where needed for sure. EDIT: latest commit should remove some of the debugging code
  2. This sounds fine to me...I believe that I originally started working on a copy of the master branch when I started out, (as opposed to the hinting branch) so all of my additions were built on top of that, as opposed to hinting branch code. I'm not sure if that makes anything easier or harder. (Edit: just saw that this indeed makes things more complicated!)
  3. I don't really have a great solution in mind to this issue-- nothing obvious jumps out at me. Leaving them out certainly simplifies the data structure, but creates some extra work and perhaps the need to add some additional bookkeeping information somewhere about these dropped nodes. If there is a way that we could work a solution for this problem that would also eliminate having to use the match function to preserve orderings (example in getZfromMatch here ), that would be very nice.
  4. To get better acquainted with what I'm referring to with this "dropped nodes" problem @markmfredrickson @benthestatistician :
    s4 <- stratumStructure(fm.psd.cal <- fullmatch(psd/(psd<.25), data=nuclearplants))
    expect_equal(names(s4), c("1:0", "3:1", "2:1", "1:2", "1:7", "0:1"))
    expect_equal(as.vector(s4), c(3,1,1,1,1,11))
    expect_true(all.equal(attr(s4, "comparable.num.matched.pairs"), 5.9166666))

    from test.stratumStructure.old.R

and the first test case in test.summary.optmatch.R should be useful in demonstrating the problem. If you step through these, you should see my hack-job solutions in stratumStructure and summary get called into action-- specifically, the changes from this commit, along with the changes from around line 212 here

  1. Agreed that this is not a high priority.
adamrauh commented 6 years ago

I'm thinking of picking this up, starting with what I have listed as tasks 2 and 4 -- ironing out some bugs and finishing the new warm start functionality. It shouldn't be too hard to update those down the line to make their implementations align with any further decisions about the node.data questions discussed above. Any thoughts on better places to start, or does that seem reasonable?

benthestatistician commented 6 years ago

@adamrauh , additions of the type you mention would be welcome. I agree w/ your takeaway from the node.data discussion: good chance we'll stick with precisely the approach you had been taking; in the unlikely event that we don't, we'll be further along to whatever the modified goal is if we have a more complete implementation of the current concept.

Regarding failing tests, if you've got questions you can't resolve yourself about whether this or that test will continue to be necessary, don't hesitate to ask.

adamrauh commented 6 years ago

I was looking at testing a form of the improvements related to warm starts yesterday, which led me to some scattered conclusions.

The bad news is that, in the form I have it now, the new code is significantly slower than the master branch version. I haven't looked into this too much yet, as this is something I just stumbled into last night. Based on some brief investigation/profiling, it looks like there is one particular chunk of code causing a pretty large bottleneck, and I don't think it should be too awful to come up with a workaround. This is my next order of business, I think.

That said, some other code that is intended to generate warm starts for new nodes introduced into a problem seems to be working as hoped. Unfortunately, any impact of this functionality is lost at this point because of the other code bottlenecks introduced with the refactoring. I think I ought to spend a bit of time looking to see if there are obvious pieces of newly introduced code that can be optimized to at least get things in the ballpark of the current master version. However, I would also like to get some indication as to whether or not the new warm start improvements actually appear to do anything, since its been a question that I've been dealing with for awhile (as @benthestatistician can attest), but also, more practically, it would be helpful in prioritizing subsequent tasks. I can see some ways to carry over specific pieces of the new framework into the old codebase to come up with some tests. I'm on the fence as to how worthwhile that would be.

benthestatistician commented 6 years ago

some other code that is intended to generate warm starts for new nodes introduced into a problem seems to be working as hoped. Unfortunately, any impact of this functionality is lost at this point because of the other code bottlenecks introduced with the refactoring.

At a number of points along the way we've had the experience that the first version of change intended to improve performance made it worse. In all cases I can think of it ultimately made things better....

I think I ought to spend a bit of time looking to see if there are obvious pieces of newly introduced code that can be optimized to at least get things in the ballpark of the current master version.

One way bottlenecks get exposed, in my experience, is to step through a relatively hard problem and see where it lags. sometimes this happens at unexpected points, and once you identify them you can readily fix them. In other cases, however, a more formal profiling process is needed. I think we're going to want to do this en route to processing this merge. Do you have test cases demonstrating the slowdowns?

However, I would also like to get some indication as to whether or not the new warm start improvements actually appear to do anything, since its been a question that I've been dealing with for awhile (as @benthestatistician https://github.com/benthestatistician can attest).... I'm on the fence as to how worthwhile that would be.

Au contraire! I recall clear evidence of improvement. Not always as dramatic an improvement as we were hoping for - but that's consistent with the presence of as-yet undiscovered bottlenecks. Identifying those test cases is a next step.

adamrauh commented 6 years ago

I've been using our old friend, the sat coaching data to test. The test case isn't particularly elaborate:

w <- match_on(Coach ~ psatv + psatm + presatv + presatm + I23 + I24 + I25, data = satcoach)
res <- fullmatch(w, data = satcoach

Based on the profiling, it looks like one major problem is in the .determine_group helper function in nodecode.R -- mostly because of all the calls to all() and %in%...I think figuring out a way to take these out should see some notable improvement just with that change. This looked like a pretty serious bottleneck from the profvis output, and doesn't really have a counterpart in the master branch. I think that's definitely my immediate next move.

My initial test didn't suggest consistent improvement-- but that was on one test, with a small data set and largely untested code in certain parts, so I'm not particularly concerned yet.

benthestatistician commented 5 years ago

Rather than merging this into the older "hinting" branch, I've created a new branch issue54-hinting and merged it over there. (If no objections emerge I may well remove the hinting branch.) We can refer back to this pull request for the comments and outstanding to-do's, but I'm going to go ahead and close it out for how.