imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
768 stars 191 forks source link

Tree-wise case.weights #142

Open dwmmclaughlin opened 7 years ago

dwmmclaughlin commented 7 years ago

Hi,

This is not so much an issue as a suggestion.

I have been using ranger for a remote sensing problem, and it has been working great. However, I want to be able to further select a random subset of geographies (e.g. counties, administrative areas) and/or years from which to randomly sample observations for each decision tree in the forest. The primary reason for this addition would be to increase the out-of-sample predictive power of the model (predicting out of geography or out of year). The process, in theory,could go as:

  1. Randomly subset data by geography (m counties from the universe M)
  2. Randomly sample observations from n counties (n observations from m counties).
  3. Randomly subset variables (k variables out of universe K variables).
  4. Grow a decision tree.

Is there any way to incorporate this into the ranger package? Ideally, it would be nice to include an argument which is a character vector of variables which will be used to randomly subset the groups used in growing each decision tree.

Best,

Dave Grad Student at UC Berkeley

mnwright commented 7 years ago

If I understand correctly, that should be possible if you could specify weights for each observation and each variable per tree. Then you would generate these two matrices before growing the forest to induce the first level of randomization, e.g., using 0 for non-used observations/variables and equal weights for all others. The second level is then the usual bootstrapping and split variable selection.

For the variables this is already possible with split.select.weights (see ranger R help). For the observation there is case.weights, but it's not accepting per-tree weights yet.

Would that work?

dwmmclaughlin commented 7 years ago

Thanks mnwright. Case weights could be a way to randomly sample from a subset of the full set observations, but right now I think it applies to the whole forest. I am definitely thinking about per tree weights, where each tree is grown on a random sample of a different subset of observations, where the subset used for each tree is defined by excluding a certain geography. I think this would improve out of sample predictions.

dwmmclaughlin commented 7 years ago

Another way to do this would be to grow a small forest on a subset of the data, and add trees to the forest each time changing the subset of observations it is using to grow the trees. This would achieve the same result. Can additional trees be added to a ranger forest object?

Best,

Dave

rfcv commented 7 years ago

@dwmmclaughlin As it stands, no, however because all trees in a random forest are independent of each other, and the final forest is the average of these trees' predictions, you could build multiple forests (models) and then average their predictions.

sheffe commented 6 years ago

Looks like this is already on the long-term roadmap, but I'll add that it would be a very useful adaptation of ranger for some problems beyond spatial sampling. We could treat anything with time-series dependence or hierarchical sampling this way too, which I tend to run into in combination with spatial dependence often.

One note on @dwmmclaughlin 's suggested implementation above:

I am definitely thinking about per tree weights, where each tree is grown on a random sample of a different subset of observations, where the subset used for each tree is defined by excluding a certain geography. I think this would improve out of sample predictions.

I believe this would capture almost every type of space/time/spacetime/hierarchical/etc case I have run into, just using a matrix of weights N rows by num.trees columns. One extension of that idea -- with no idea about how difficult this kind of implementation would be -- would be to allow passing in a function that defines this matrix on the fly. I tend to wrap ranger inside the caret package for tuning, where there would ideally be a 2-step process: (1) a first resampling step that is repeated for tuning parameter selection, and (2) the tree-specific case weights are applied using a weights matrix generated from the arbitrary sample taken in the first loop.

So far I've been using caret with a custom train/test index list to do spacetime or hierarchical splits, and then I run normal ranger inside it to look at tuning param performance. Then for final models I followed @rfcv 's strategy of training many small ranger models and averaging predictions. It isn't quite the same process, and whether the tuning param step works comparably to the final model depends on how strong the dependence is. Often for space-time problems, I do get fairly different results.

(My first time writing in on an issue for ranger -- let me also say, you've all saved me hundreds of hours and EC2 dollars with this package!)

mnwright commented 5 years ago

I've added an argument inbag to perform the sampling manually outside of ranger. Example:

library(ranger
inbag <- replicate(5, round(runif(nrow(iris), 0, 5)), simplify = FALSE)
rf <- ranger(Species ~ ., iris, num.trees = 5, inbag = inbag)

With this, all kinds of stratified sampling can simply be done in R.

sheffe commented 5 years ago

@mnwright this solution looks both elegant and pretty much complete for the applications I had in mind above. Thanks, as always, for all your work here!

dwmmclaughlin commented 5 years ago

Hi Marvin and John,

This is a great implementation! Thanks so much for making this happen.

Best,

Dave

On Fri, Aug 10, 2018 at 9:27 AM, John Sheffield notifications@github.com wrote:

@mnwright https://github.com/mnwright this solution looks both elegant and pretty much complete for the applications I had in mind above. Thanks, as always, for all your work here!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/imbs-hl/ranger/issues/142#issuecomment-412134929, or mute the thread https://github.com/notifications/unsubscribe-auth/ABwVm5TRQohZ0NfVnUpN5aVbj9WLT-Fhks5uPbRYgaJpZM4KyzQ8 .

-- David McLaughlin Ph.D. Candidate University of California at Berkeley College of Natural Resources Department of Agricultural and Resource Economics direct: 516-840-1530 email: d.wm.mclaughlin@berkeley.edu

sheffe commented 5 years ago

@mnwright after some time to work with it, I'm finding this feature very useful on practical problems with block bootstrap designs. Thanks again. I have a question about the implementation to make sure I'm understanding some finer points on inbag and downstream OOB error calculations, and perhaps a proposed change.

Quoting your illustration code here:

library(ranger)
inbag <- replicate(5, round(runif(nrow(iris), 0, 5)), simplify = FALSE)
rf <- ranger(Species ~ ., iris, num.trees = 5, inbag = inbag)

which gives us a data structure like:

screenshot 2018-10-01 11 07 58

This is how I understand inbag works now. (Corrections welcome!)

If that's all correct, here are a few ideas -- if you'd like to give some direction on the approach you like best, I'd be happy to attempt a PR myself in the next 7-10 days.

As far as I can see, one extra piece of information is needed to make OOB calculations correct, which boils down to an indicator for "should this value be treated as OOB, even if n = 0 in the inbag list?" That could be done in a few ways I've considered:

  1. Permitting meaningful NAs in addition to integer counts in the inbag list elements. n = 0 retains the meaning "OOB" and n = NA means don't include in the tree or in OOB. This would probably work fine but I think it's tricky to document/communicate.
  2. An option to include a parallel oobag list with specific inclusion criteria. It would also be of length num.trees, each element a vector of length nrow(data), containing 0/1 weights for the OOB calculation for that tree. This could give users an extra chance to make mistakes, e.g. including 1s for OOB where a count>0 is sampled for that tree, but I like the opportunity to be explicit. ( and a related thought -- by permitting nonnegative continuous weights when specifying the oobag list, this setup could also enable case weighting of OOB predictions for weighted MSE calculations, but that's a whole separate feature).
  3. caret has a conceptually related feature that I use often -- this setup might be worth a look if you haven't used it. With caret::trainControl (docu link), it's possible to use the arguments index and indexOut to do conceptually similar things to specifying inbag and oobag. Each argument takes a named list of integer vectors, which contain the row-indices of data that should be in a specific training/evaluation set pair -- index for in-sample training rows, indexOut for calculating model performance metrics. If e.g. a specific data row had an in-bag count of 4, that row index would be included 4 times. (You can also duplicate rows in the validation set to weight error calculations this way.) This approach would represent two changes to ranger instead of one, but I thought the documentation was worth including.

One other idea, rejected before I wrote in, was simply requiring in my outside-ranger sampling setup that all rows from sampled groups have >0 counts. That did make up/down-sampling for imbalanced classes fairly difficult.

From my perspective after a few months of using it, the best aspect of your inbag implementation is that it's infinitely flexible and lets users write any kind of custom stratified sampling they can dream up, and ranger never takes on those (hundreds of) edge cases. I hope those proposals keep that spirit. I am starting to think about a ranger companion package that implements common stratified samplers, though -- I've found so many ways to use this already.

BTW - I found the attached paper (and especially this diagram) helpful in understanding the types, value, and complexities of block bootstrap designs. It's in an ecology journal that might be off the beaten path for your field; hope it's useful.

screenshot 2018-09-29 11 26 23

Roberts_et_al-2017-Ecography-Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure.pdf

mnwright commented 5 years ago

I'm not sure I understand. With inbag we don't do any sampling in ranger. Simply all observations with n > 0 for a tree are used n times in that tree and all observations with n = 0 are OOB.

sheffe commented 5 years ago

Hi @mnwright thanks for the response / sorry for the reply latency! It’s a tricky concept. Here’s more context and two examples.

At a high level: inbag allows us to do any kind of block bootstrapping inside a random forest. Block bootstraps usually happen in two steps.

Step 1 — draw groups of observations according to their dependence structure. If it’s spatial dependence, a block arrangement could be “cities A, B, and C are in bag; cities D, E, F are out of bag.” If the problem has time dependence, we could block by this rule: “observations with year <= 2002 are in; observations after 2006 are out; we discard observations in the middle.”

Step 2 — sample observations within the blocks defined in step 1. This could happen using an ordinary random sample with replacement, a RSWR using some row-level weights, or no subsampling (we take all rows in a selected group with count=1).

Both of these steps happen outside ranger, which is great because the complexity is endless, and any analyst can build her own sampling rules.

ranger fits the RF accurately when using inbag — for each tree, extract >=1 copies of the rows in data as specified in the inbag count vector for that tree. That part is working smoothly.

The edge case that seems incorrect occurs when ranger calculates OOB performance metrics, after the tree is fit. Here’s where that could go wrong.

Example 1: we left some groups entirely out of the dataset. This happened in the Step 1 example of time series dependence, where we fit the model on rows with year <= 2002, and discarded rows between years 2003-2006. This needs to happen if you’re using a label that has autocorrelation, like a time-shifted outcome.

What happens in the current implementation: because rows with years 2003-2006 had a zero count in inbag, they weren’t used in training (correct). However, because a zero count for inbag is the definition of out-of-bag, the out-of-bag MSE reported by ranger does include the observations we’d wanted to discard because of the autocorrelation to the training data. It makes our MSE look a little better — how much better depends on the case/ how much autocorrelation there is.

Example 2: if we use a random sample of any kind for the second step, sampling observations within the chosen blocks, we can have the following problem occur. We have 6 cities (A, B, C, D, E, F) total in data with 10k rows corresponding to each. For a single block configuration in step 1, we selected three cities (A, B, C) and left out three (D, E, F). In step 2, we did a random sample with replacement of the 10k rows in cities A, B, C, which leaves some rows with counts >1 and some rows with count 0.

Again, the RF fit happens correctly, but the OOB performance figure is biased. Rows in A, B, C that had count 0 (our rf never saw them) can still share spatial dependence with training rows where count > 1, because they were in the same city. That makes our OOB MSE look better than it should be.

The overall principle — when we need block bootstrapping, it’s because one row can give you information about other rows because of proximity or shared group membership (time, space, hierarchy, etc). When we fit a tree, observations should be in one of three categories:

  1. In-bag: the row's group is in-bag (before 2002, city A) and it was randomly sampled with count>=1
  2. Discarded: the row's group is in-bag, but it was not randomly sampled; or the row was otherwise deliberately discarded.
  3. Out-of-bag: the row shares no dependence with in-bag rows based on the group membership.

So far, I have a workaround written as a wrapper around ranger. This is what happens:

That workaround is super hacky. In addition to being slow, it requires you to carry around two large lists in memory -- the inbag list required by ranger and the oobag list required to reconstruct true OOB performance after the fit. My first instinct was to keep these modifications outside ranger entirely, but I haven't thought of a way for keeping it outside ranger without incurring those performance hits.

mnwright commented 5 years ago

OK, I think I understand. We could use something like -1 in the inbag argument to completely discard observations for a tree. We would have -1 for discard, 0 for OOB and >=1 for inbag. Would that work? Also sounds a little hacky but should be easy to implement.

sheffe commented 5 years ago

Agreed -- that would work well for everything I can think up. In this framing, a row can be (a) used for trees, (b) used for performance estimation, or (c) ignored. The exact scheme is left to the user. I like that simplicity.

(If it's all the same to you, I think specifying completely-discarded rows as NA is a bit more intuitive than -1, but that's a vanishingly minor point.)

sebrauschert commented 4 years ago

@sheffe I am facing a similar "issue" at the moment. This post is ~ 2 years old now, but I was wondering if you had your "hacky" workaround uploaded somewhere:

So far, I have a workaround written as a wrapper around ranger. This is what happens:

create an inbag object as described above, by 1) sampling groups and then 2) sampling rows within inbag groups create an oobag list. That contains one element per row of the data frame, and each element is the vector of tree numbers where the row was truly OOB. after the model fit, use predict(rf, data = training_data, predict.all = TRUE)to get the matrix of predictions per tree. use the oobag list to pull out the columns of the matrix where the corresponding row was truly OOB, and average those to create a custom OOB prediction. recalculate performance metrics like MSE etc. on those custom predictions. (As soon as I get a chance to clean up that code, I'll extract it out and post here/in a gist if useful.)

Thanks!

gse-cc-git commented 3 years ago

Hello @sheffe @Hobbeist @mnwright , Interested by this also here !

This post is ~ 2 years old now, but I was wondering if you had your "hacky" workaround uploaded somewhere

So far, testing this makes R crashing

We would have -1 for discard, 0 for OOB and >=1 for inbag.

Tested with ranger v0.12.1 Any chance to find an implementation somewhere please ?

This would allow using tuning on OOB with tuneRanger (if my understanding is correct), for example

Thank you all !

gyanakiev commented 1 month ago

Great disucssion. Using inbag with -1 or NA is a natural way to deal with temporal/spatial blocking. I wasn't sure whether it was implemented and tried both -1 and NA just now. RStudio crashed on both. Can I please confirm that this hasn't been implemented yet? Is there any plan to?

Thank you rangers

mnwright commented 3 weeks ago

Sorry, this is not implemented yet.