mlandry22 / kaggle

Public Kaggle Code and Info
43 stars 47 forks source link

Request for tips on Train.csv for GBM #5

Open The-Peaceful-Learner opened 7 years ago

The-Peaceful-Learner commented 7 years ago

Hi mlandry22, First of all, I see that you're quite an advanced programmer from your fluent codes in R, Python and MATLAB :+1:

Can you please guide me as to how I can create/ get the Train.csv and other input files that you have used in your GBM code(s)?

P.S.: I'm new to GBM, R and GitHub repository. Thanks.

mlandry22 commented 7 years ago

Hi @The-Peaceful-Learner. most of the things you are seeing are paired with competition data sets on Kaggle. They tend to host the data sets for a while after they end, so you can probably still obtain many of them. For example, here are a few: https://www.kaggle.com/c/inria-bci-challenge/data https://www.kaggle.com/c/how-much-did-it-rain/data https://www.kaggle.com/c/avazu-ctr-prediction

Between comments in the code and/or the naming of the directories combined with Kaggle's competitions, you can hopefully put together where the code came from.

I haven't put much in here lately, but GBM is still the dominant algorithm and a good one to learn. It has been actively developed in H2O and XGBoost, so there are better ways to run them now than were available with most of my code (most importantly column sampling). I don't keep a very useful repository for learning GitHub--it's just a collection of independent code chunks shown in one place. More recent repositories are specific to a single problem.

Best,

Mark

The-Peaceful-Learner commented 7 years ago

Thank you very much for explaining it so clearly Mark. Could you please tell me/give me the link as to which competition does your following file refer to: GBM_talk_Austin_R_Users_20140724.R

Is it the Blue Book for Bulldozers competition 4 years ago? The link of the competition is: https://www.kaggle.com/c/bluebook-for-bulldozers/data

What, according to you, is an apt starting point for learning GBM?

Thanks and Regards, Nishant

mlandry22 commented 7 years ago

Yes, you got it right. And as you noticed it's quite old now, so my earlier comment about GBM progressing over time is certainly true. The gbm package in R is the one used there, which I think has not changed as much as others. And if that is correct, I would certainly recommend using one of the implementations that has progressed lately (e.g. h2o or xgboost).

As for learning it, it depends, of course. I believe decision tree fundamentals help me know how GBM is working. And those fundamentals are quite intuitive, but thinking through the limitations of them and how the GBM parameters control the individual decision trees and again the limitations of those choices (limitations are sometimes a good thing--gbm improves standard decision trees through regularization via those limitations/constrictions). There isn't a lot to learn about GBM really, which is why it's a great algorithm and gotten so popular lately. I would ensure you understand the performance curve of GBM. It will continue to fit forever, so you will eventually overfit on almost all data sets. Therefore, you do need to be aware of your performance on an external validation set (one GBM did not use to fit any part of its model) and when to stop. H2O and XGBoost both include parameters to help you with that, right there in the model fit function. This is no different from how machine learning should always be done, but it's as important as ever with GBM; random forest, by comparison, tends to never overfit. Again, H2O and XGBoost both have early stopping criteria so that this is automatic, but if you're just getting started, try throwing way too many trees in and watching how your train/validation error starts to diverge more and more. Train error will continue to improve, but validation (sometimes called test) error will get worse. I just look at validation error and don't get too worried about the divergence--I'm just looking for the point at which validation error starts to get worse (number of trees).

Last bit is that when I did a talk about the bluebook for bulldozers, I remembered that handling categoricals in the R gbm package implementation is required once you surpass 1024 levels. In H2O that limitation is not present, and in XGBoost, you have to encode them numerically (most people go with one-hot encoding/binarizing). I like that GBM is hands off. And while I may be a bit biased, H2O (in R or python, or others) is the definitely the simplest way to get going with a GBM since you don't have to handle anything ever. While you may want to re-encode some of your data, you never have to, which I like so I can work on problem solving rather than model fitting (within reason).

Specifically, I found the Wikipedia page pretty helpful about GBM basics, especially their section on hyperparameter tuning, as it mentions common values found through practice (citing Elements of Statistical Learning, I think, which is written by H2O's advisors). If you found this repository helpful, perhaps some of the items discussed in my own talk on GBM and random forest would help: https://www.youtube.com/watch?v=9wn1f-30_ZY&t=1s And here is a similar discussion from the author of the scikit-learn GBM implementation that I found useful: https://www.youtube.com/watch?v=IXZKgIsZRm0

Good luck!

Mark