greenelab / deep-review

A collaboratively written review paper on deep learning, genomics, and precision medicine
https://greenelab.github.io/deep-review/
Other
1.25k stars 271 forks source link

Gradient Boosted Trees (XGBoost) #146

Open traversc opened 8 years ago

traversc commented 8 years ago

In the same line of thought of algorithms which claim... or have beaten deep learning methods (Issue 144), Gradient Boosted Trees is one of them.

http://xgboost.readthedocs.io/en/latest/model.html

XGBoost is short for “Extreme Gradient Boosting”, where the term “Gradient Boosting” is proposed in the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. XGBoost is based on this original model.

This method has won some recent Machine Learning competitions (http://www.kdnuggets.com/2016/03/xgboost-implementing-winningest-kaggle-algorithm-spark-flink.html)

And has also been applied to EHR data, recently (http://www.aclweb.org/anthology/W/W16/W16-29.pdf#page=13)

It is a method similar to Random Forest, based on an ensemble of trees. However, unlike RF which randomly bootstraps tree features and samples, XGBoost selects specific trees based on "gradient boosting". It purports to be faster and have equivalent or better performance with much fewer trees.

akundaje commented 8 years ago

A few clarifications

traversc commented 8 years ago

Thanks for your insight. It is really appreciated as I learn about new algorithms. I didn't mean to imply that #144 used boosted trees, but was trying to continue the discussion on algorithms that have state of the art performance, but are not based on deep learning methods.

According to the second article I linked to, XGBoost has "More than half of the winning solutions in machine learning challenges" (in Kaggle). I am not entirely sure why this is, but I suspect some factors may be time/computational constraints and also the type of datasets used in those competitions.