R语言xgboost超参数调优

R语言xgboost超参数调优 by 医学和生信笔记

关注公众号，发送R语言或python，可获取资料

💡专注R语言在🩺生物医学中的使用

设为“星标”，精彩不错过

xgboost作为一种基于GBDT（Gradient Boosting Decision Tree，梯度提升树）发展而来的提升算法，其优秀之处不必多说，我们上次已经介绍了它在R语言中的简单使用，这次我们介绍下xgboost的各种参数的意义以及超参数调优。

xgboost本身是基于梯度提升树（GBDT）实现的集成算法，所以它的参数整体来说可以分为三个部分：集成算法本身，用于集成的弱评估器（决策树），以及应用中的其他过程。

准备数据

使用皮玛印第安人糖尿病数据集。

rm(list = ls())
library(MASS)
library(xgboost)

data("Pima.tr")
data("Pima.te")
pima <- rbind(Pima.tr, Pima.te)

set.seed(502)
ind <- sample(2, nrow(pima), replace = T, prob = c(0.7, 0.3))
pima.train <- pima[ind == 1,]
pima.test <- pima[ind == 2,]

dim(pima.train)
## [1] 385   8
dim(pima.test)
## [1] 147   8

训练集有385行，8列，测试集有147行，8列，其中type列是结果变量，Yes表示有糖尿病，No表示没有糖尿病。

参数解释

xgboost()是xgb.train()的简单封装，xgb.train()是训练xgboost模型的高级接口。xgboost模型的参数非常多，我们介绍其中一部分，有些参数在上次的推文介绍过了。

nrounds：最大迭代次数（最终模型中树的数量）。
early_stopping_rounds：一个正整数，表示在验证集中经过K次训练如果模型表现还是没有提高就停止训练。
print_every_n：如果verbose>0，这个参数表示每多少次迭代打印一次日志信息。

params是xgb.train()中最重要的参数了，params接受一个列表，列表内包含超多参数，这些参数主要分为3大类，也是我们调参需要重点关注的参数：

通用参数

booster：提升器类型，gbtree(默认)或者gblinear。多数情况下都是gbtree的效果更好，但是如果你的预测变量和结果变量呈现明显的线性关系，可能gblinear更好，但也不是绝对的，开发者建议都试一下。

booster相关的参数2.1 tree booster相关的参数

eta：学习率η，每棵树在最终解中的贡献，默认为0.3。
gamma：在树中新增一个叶子分区时所需的最小减损。
max_depth：单个树的最大深度。
min_child_weight：对树进行提升时使用的最小权重，默认为1。
subsample：子样本数据占整个观测的比例，默认值为1（100%）。
colsample_bytree：建立树时随机抽取的特征数量，用一个比率表示，默认值为1（使用100%的特征)。
lambda：L2正则化的比例，默认是1，也就是lasso。
alpha：L1正则化的比例，默认是0。
...2.2 linear booster相关的参数
...

任务相关的参数

objective：指定任务类型和目标函数，支持自定义函数，默认的有以下类型，主要是回归、分类、生存、排序等：

reg:squarederror：均方根误差（默认值）。
reg:squaredlogerror：均方根对数误差。
reg:logistic：logistic函数。
reg:pseudohubererror：Pseudo Huber损失函数。
binary:logistic：二分类逻辑回归，输出概率值。
binary:logitraw：二分类逻辑回归，输出logistic转换之前的值。
binary:hinge：二分类hinge loss,输出0或者1。
count:poisson：计数数据的泊松回归
survival:cox：右删失生存数据的cox回归，返回风险比HR。
survival:aft：加速失效模型。
...

base_score：
eval_metric：验证集的评价指标。

下面我们使用默认参数拟合模型，看看模型效果。顺便学习下如果准备这些参数。

注意，所有的预测变量都需要是数值型（这和我们前面介绍过的xgboost输入数据的格式有关，矩阵需要都是数值型的），所以分类变量需要进行一些转换，比如哑变量、独热编码等。

# 选择参数的值
param <- list(objective = "binary:logistic",
              booster = "gbtree",
              eval_metric = "error",
              eta = 0.3,
              max_depth = 3,
              subsample = 1,
              colsample_bytree = 1,
              gamma = 0.5)

# 准备预测变量和结果变量
x <- as.matrix(pima.train[, 1:7])
y <- ifelse(pima.train$type == "Yes", 1, 0)

# 放进专用的格式中
train.mat <- xgb.DMatrix(data = x, label = y)
train.mat
## xgb.DMatrix  dim: 385 x 7  info: label  colnames: yes

这样参数和数据就都准备好了，下面开始训练即可。

set.seed(1)
xgb.fit <- xgb.train(params = param, 
                     data = train.mat, 
                     nrounds = 100)

有了这个结果后你可以查看变量重要性，查看每棵树的信息，得出预测类别的概率，画出ROC曲线等，详情请参考上一篇，这里就不再重复演示了。

超参数调优

下面就是对这些参数进行调整，我们就使用caret进行演示。

caret作为R语言中经典的机器学习综合性R包，使用起来非常简单，我们也写过非常详细的系列教程了，后台回复caret即可获取caret系列推文合集。

library(caret)

# 选择参数范围
grid <- expand.grid(nrounds = c(75, 100),
                    colsample_bytree = 1,
                    min_child_weight = 1,
                    eta = c(0.01, 0.1, 0.3),
                    gamma = c(0.5, 0.25),
                    subsample = 0.5,
                    max_depth = c(2, 3))

# 一些控制参数
cntrl <- trainControl(method = "cv",
                      number = 5,
                      verboseIter = F,
                      returnData = F,
                      returnResamp = "final")

# 开始调优
set.seed(1)
train.xgb <- train(x = pima.train[, 1:7],
                   y = pima.train[, 8],
                   trControl = cntrl,
                   tuneGrid = grid,
                   method = "xgbTree")

train.xgb
## eXtreme Gradient Boosting 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 308, 309, 308, 307, 308 
## Resampling results across tuning parameters:
## 
##   eta   max_depth  gamma  nrounds  Accuracy   Kappa    
##   0.01  2          0.25    75      0.7865932  0.4700801
##   0.01  2          0.25   100      0.7971195  0.5003081
##   0.01  2          0.50    75      0.7971528  0.4945394
##   0.01  2          0.50   100      0.8101407  0.5317302
##   0.01  3          0.25    75      0.7971537  0.5018986
##   0.01  3          0.25   100      0.7893948  0.4854910
##   0.01  3          0.50    75      0.8050476  0.5221906
##   0.01  3          0.50   100      0.7997853  0.5136332
##   0.10  2          0.25    75      0.7896980  0.5002401
##   0.10  2          0.25   100      0.7921605  0.5073063
##   0.10  2          0.50    75      0.7947579  0.5167947
##   0.10  2          0.50   100      0.7791717  0.4828563
##   0.10  3          0.25    75      0.7922612  0.5044835
##   0.10  3          0.25   100      0.7896297  0.5011073
##   0.10  3          0.50    75      0.7845023  0.4923603
##   0.10  3          0.50   100      0.7766418  0.4779629
##   0.30  2          0.25    75      0.7557592  0.4362591
##   0.30  2          0.25   100      0.7609198  0.4504284
##   0.30  2          0.50    75      0.7635864  0.4475264
##   0.30  2          0.50   100      0.7740093  0.4729577
##   0.30  3          0.25    75      0.7636881  0.4478178
##   0.30  3          0.25   100      0.7637547  0.4534616
##   0.30  3          0.50    75      0.7583574  0.4403953
##   0.30  3          0.50   100      0.7427721  0.3975159
## 
## Tuning parameter 'colsample_bytree' was held constant at a value of 1
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## 
## Tuning parameter 'subsample' was held constant at a value of 0.5
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 2, eta
##  = 0.01, gamma = 0.5, colsample_bytree = 1, min_child_weight = 1 and
##  subsample = 0.5.

结果中给出了最优的超参数：nrounds = 100, max_depth = 2, eta = 0.01, gamma = 0.5, colsample_bytree = 1, min_child_weight = 1, subsample = 0.5。

这个结果可以探索可视化的地方非常多，比如：

plot(train.xgb)

也是支持ggplot2的。

ggplot(train.xgb)

更多方法大家可以探索我们的caret合集。

模型拟合

接下来就是使用最优的超参数重新拟合模型。

# 选择最优的参数值
param <- list(objective = "binary:logistic",
              booster = "gbtree",
              eval_metric = "error",
              eta = 0.01,
              max_depth = 2,
              subsample = 0.5,
              colsample_bytree = 1,
              gamma = 0.5)


# 拟合模型
set.seed(1)
xgb.fit <- xgb.train(params = param, 
                     data = train.mat, 
                     nrounds = 100)

画个ROC曲线，先计算一下训练集的预测概率，再画ROC曲线即可，没有任何难度：

pred_train <- predict(xgb.fit, newdata = train.mat)
head(pred_train)
## [1] 0.3161996 0.3288908 0.2387750 0.5719132 0.5012382 0.5890657

library(ROCR)
pred <- prediction(pred_train, pima.train$type)
perf <- performance(pred, "tpr", "fpr")
auc <- round(performance(pred, "auc")@y.values[[1]],digits = 4)

plot(perf, 
     main = paste("ROC curve (", "AUC = ",auc,")"), 
     col = 2, 
     lwd = 2)
abline(0,1, lty = 2, lwd = 2)

后台回复ROC即可获取ROC曲线合集，回复最佳截点即可获取ROC曲线的最佳截点合集。

计算混淆矩阵等请参考上一篇关于xgboost的推文，无非就是把概率转换为硬类别而已。

测试集

首先需要把测试集的格式转换一下。

# 专用的格式中
test.mat <- xgb.DMatrix(data = as.matrix(pima.test[, 1:7]), 
                         label = ifelse(pima.test$type == "Yes", 1, 0))

pred_test <- predict(xgb.fit, newdata = test.mat)
head(pred_test)
## [1] 0.2405169 0.6161979 0.6299443 0.2367187 0.5688603 0.5854614

library(ROCR)
pred <- prediction(pred_test, pima.test$type)
perf <- performance(pred, "tpr", "fpr")
auc <- round(performance(pred, "auc")@y.values[[1]],digits = 4)

plot(perf, 
     main = paste("ROC curve (", "AUC = ",auc,")"), 
     col = 2, 
     lwd = 2)
abline(0,1, lty = 2, lwd = 2)

easy！

有些指标是基于预测概率的，有些指标是基于预测列别的，xgboost只能给出预测概率，我们自己转换一下即可计算各种基于类别的指标了。

参考资料

帮助文档
https://blog.csdn.net/weixin_43217641/article/details/126599474
精通机器学习基于R

联系我们，关注我们
免费QQ交流群1：613637742（已满）
免费QQ交流群2：608720452
公众号消息界面关于作者获取联系方式
知乎、CSDN、简书同名账号
哔哩哔哩：阿越就是我

ixxmu / mp_duty