harrysouthworth / gbm

Gradient boosted models
Other
106 stars 27 forks source link

New Parameter - mFeature #20

Closed Neil-Schneider closed 10 years ago

Neil-Schneider commented 10 years ago

mFeature is similar to MTry in the randomForest package. It is an integer number of features to consider at each node. This is unlike randomForest which considers a set of feature for an entire tree.

It is not necessary to consider every feature at every node. This will increase the variability of each tree and may require an increase in n.trees to find the optimal number. The speed saved by reducing the number of features tested at each node is generally greater than the speed lost by requiring more trees.

In theory, there could be improved prediction from these more randomized trees, but the biggest advantage is the reduction in time required without a loss of predictive power. .

gravesee commented 10 years ago

Hi, does this work with var.monotone? If so, I am very interested in seeing this PR merged into master. Great work by all on the phenomenal GBM package!

harrysouthworth commented 10 years ago

I started work on the merge but it wasn't straightforward because other changes had happened since the fork.

I then stopped working on the merge.

I do hope to complete it, but have been busy and can only find so much time to give to gbm. Sorry not to be faster.

Harry

On 17 June 2014 00:46, Zelazny7 notifications@github.com wrote:

Hi, does this work with var.monotone? If so, I am very interested in seeing this PR merged into master. Great work by all on the phenomenal GBM package!

— Reply to this email directly or view it on GitHub https://github.com/harrysouthworth/gbm/pull/20#issuecomment-46252034.

Neil-Schneider commented 10 years ago

My fork now includes all commits posted after the last time I merged in your master branch. This should make it easier to merge.

@Zelazny7 This should have no effect on var.monotone. It just enables the user to try fewer features are each node of the GBM.

gravesee commented 10 years ago

Hi, I've been using your GBM package with the mFeature enhancement and it really does speed things along. I am a pretty decent R/Python programmer, but know little to nothing about C++. I was wondering if it would be difficult to add another randomization feature that selects a random subset of candidate break-points for continuous features? I have a lot of continuous data in my applications and I think GBM is slowing down quite a bit when it considers every potential break point. If I was more familiar with C++ I would do this myself. Just curious how much of an effort it would take.

Thanks again for your work!

harrysouthworth commented 10 years ago

The answer is almost certainly "too much". The underlying c code dates back to 2003 and has been modified and added to by many other authors. It's in serious need of a huge overhaul. An alternative would be to start from scratch. The gbt package I uploaded to github might be a starting point, but I'm not sure. I need to find time to deal with all this. Sorry not to be more helpful.

Harry On 23 Jun 2014 22:09, "Zelazny7" notifications@github.com wrote:

Hi, I've been using your GBM package with the mFeature enhancement and it really does speed things along. I am a pretty decent R/Python programmer, but know little to nothing about C++. I was wondering if it would be difficult to add another randomization feature that selects a random subset of candidate break-points for continuous features? I have a lot of continuous data in my applications and I think GBM is slowing down quite a bit when it considers every potential break point. If I was more familiar with C++ I would do this myself. Just curious how much of an effort it would take.

Thanks again for your work!

— Reply to this email directly or view it on GitHub https://github.com/harrysouthworth/gbm/pull/20#issuecomment-46902375.

Neil-Schneider commented 10 years ago

@Zelazny7 This was the next feature I was going to be work on in my free time. I was hoping to add the functionality you described for continuous features and explore a similar technique for the categorical features.

The continuous feature logic I know where to begin, but have not looked at the problem in depth yet.

The categorical feature logic This would be much more complicated in nature. There are two schools of thought and maybe they can both be implemented.

  1. One vs. Many: Instead of testing all possible combinations of levels, just test one level vs the rest.
  2. Use the current methodology of looking at combinations of levels, but instead of testing them all, test a random sub-sample of combinations.

This task is second on my "free" time GBM priority list. My first task is to seed the gbms in our Amazon cluster, so we can get back consistent results.I am not sure static seeding is needed for the gbms n.cores, but I will find that out when I begin. While, thiese are a priorities of mine, I wouldn't expect anything functional until the end of the year.

harrysouthworth commented 10 years ago

Sorry for the delay. I've merged, but done zero testing. Once I've done some of that and passed R CMD check, I'll do a release.

Thanks, Harry