topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 632 forks source link

Adding XGBOOST to Caret Model List #147

Closed larry77 closed 9 years ago

larry77 commented 9 years ago

I cannot be of much help here apart from suggesting that the xgboost library would be a very good tool to add to caret.

zachmayer commented 9 years ago

I'm already looking into it! xgboost models are a little funky, as the R object is little more than an external pointer to a C++ object. Also, xgboost isn't on CRAN, which adds more issues.

That said, we can write the modeling functions, and leave the package installation up to the user (I think devtools::install_github works)

tqchen commented 9 years ago

The problem of save/load R object was an issue that has already been fixed, so I assume the obstacles are now cleaned.

The CRAN issue was a bit pain for us. This was our mistake caused by being not responsive to the updated due to timeline issue and out of sync. We are still investigating to resolve this. The quality of the tool and guarantee on cross-platform ability, however, steadily improved over time and is going to keep improving.

larry77 commented 9 years ago

Thanks for the prompt reply. So, it looks like there is a strong push in the direction of integrating xgboost into the caret framework. I wonder: when this is done (even at an experimental level and outside cran at the beginning) will this be automatically notified to the interested users (for instance by email)? Cheers

Lorenzo

On Sat, Apr 25, 2015 at 05:26:09PM -0700, Tianqi Chen wrote:

The problem of save/load R object was an issue that has already been fixed, so I assume the obstacles are now cleaned.

The CRAN issue was a bit pain for us. This was our mistake caused by being not responsive to the updated due to timeline issue and out of sync. We are still investigating to resolve this. The quality of the tool and guarantee on cross-platform ability, however, steadily improved over time and is going to keep improving.


Reply to this email directly or view it on GitHub: https://github.com/topepo/caret/issues/147#issuecomment-96298833

zachmayer commented 9 years ago

This is probably the best place to keep track of the task. You can also check the CRAN page for updates and look at the changelog.

But this thread should email you when there's an update.

On Sun, Apr 26, 2015 at 6:50 AM, larry77 notifications@github.com wrote:

Thanks for the prompt reply. So, it looks like there is a strong push in the direction of integrating xgboost into the caret framework. I wonder: when this is done (even at an experimental level and outside cran at the beginning) will this be automatically notified to the interested users (for instance by email)? Cheers

Lorenzo

On Sat, Apr 25, 2015 at 05:26:09PM -0700, Tianqi Chen wrote:

The problem of save/load R object was an issue that has already been fixed, so I assume the obstacles are now cleaned.

The CRAN issue was a bit pain for us. This was our mistake caused by being not responsive to the updated due to timeline issue and out of sync. We are still investigating to resolve this. The quality of the tool and guarantee on cross-platform ability, however, steadily improved over time and is going to keep improving.


Reply to this email directly or view it on GitHub: https://github.com/topepo/caret/issues/147#issuecomment-96298833

— Reply to this email directly or view it on GitHub https://github.com/topepo/caret/issues/147#issuecomment-96364294.

topepo commented 9 years ago

Here is some code to test for trees. I'll get the linear model version soon:

modelInfo <- list(label = "eXtreme Gradient Boosting",
                  library = c("xgboost", "plyr"),
                  type = c("Regression", "Classification"),
                  parameters = data.frame(parameter = c('nrounds', 'max_depth', 'eta'),
                                          class = rep("numeric", 3),
                                          label = c('# Boosting Iterations', 'Max Tree Depth', 
                                                    'Shrinkage')),
                  grid = function(x, y, len = NULL) expand.grid(max_depth = seq(1, len),
                                                                nrounds = floor((1:len) * 50),
                                                                eta = .3),
                  loop = function(grid) {     
                    loop <- ddply(grid, c("eta", "max_depth"),
                                  function(x) c(nrounds = max(x$nrounds)))
                    submodels <- vector(mode = "list", length = nrow(loop))
                    for(i in seq(along = loop$nrounds)) {
                      index <- which(grid$max_depth == loop$max_depth[i] & 
                                       grid$eta == loop$eta[i])
                      trees <- grid[index, "nrounds"] 
                      submodels[[i]] <- data.frame(nrounds = trees[trees != loop$nrounds[i]])
                    }    
                    list(loop = loop, submodels = submodels)
                  },
                  fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                    if(is.factor(y)) {
                      if(length(lev) == 2) {
                        y <- ifelse(y == lev[1], 1, 0) 
                        dat <- xgb.DMatrix(as.matrix(x), label = y)
                        out <- xgb.train(list(eta = param$eta, 
                                              max_depth = param$max_depth), 
                                         data = dat,
                                         nrounds = param$nrounds,
                                         objective = "binary:logistic",
                                         ...)
                      } else {
                        y <- as.numeric(y) - 1
                        dat <- xgb.DMatrix(as.matrix(x), label = y)
                        out <- xgb.train(list(eta = param$eta, 
                                              max_depth = param$max_depth), 
                                         data = dat,
                                         num_class = length(lev),
                                         nrounds = param$nrounds,
                                         objective = "multi:softprob",
                                         ...)
                      }     
                    } else {
                      dat <- xgb.DMatrix(as.matrix(x), label = y)
                      out <- xgb.train(list(eta = param$eta, 
                                            max_depth = param$max_depth), 
                                       data = dat,
                                       nrounds = param$nrounds,
                                       objective = "reg:linear",
                                       ...)
                    }
                    out

                  },
                  predict = function(modelFit, newdata, submodels = NULL) {
                    newdata <- xgb.DMatrix(as.matrix(newdata))
                    out <- predict(modelFit, newdata)
                    if(modelFit$problemType == "Classification") {
                      if(length(modelFit$obsLevels) == 2) {
                        out <- ifelse(out >= .5, 
                                      modelFit$obsLevels[1], 
                                      modelFit$obsLevels[2])
                      } else {
                        out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
                        out <- modelFit$obsLevels[apply(out, 1, which.max)]
                      }
                    }

                    if(!is.null(submodels)) {
                      tmp <- vector(mode = "list", length = nrow(submodels) + 1)
                      tmp[[1]] <- out
                      for(j in seq(along = submodels$nrounds)) {
                        tmp_pred <- predict(modelFit, newdata, ntreelimit = submodels$nrounds[j])
                        if(modelFit$problemType == "Classification") {
                          if(length(modelFit$obsLevels) == 2) {
                            tmp_pred <- ifelse(tmp_pred >= .5, 
                                               modelFit$obsLevels[1], 
                                               modelFit$obsLevels[2])
                          } else {
                            tmp_pred <- matrix(tmp_pred, ncol = length(modelFit$obsLevels), byrow = TRUE)
                            tmp_pred <- modelFit$obsLevels[apply(tmp_pred, 1, which.max)]
                          }
                        }
                        tmp[[j+1]]  <- tmp_pred
                      }
                      out <- tmp
                    }
                    out  
                  },
                  prob = function(modelFit, newdata, submodels = NULL) {
                    newdata <- xgb.DMatrix(as.matrix(newdata))
                    out <- predict(modelFit, newdata)
                    if(length(modelFit$obsLevels) == 2) {
                      out <- cbind(out, 1 - out)
                      colnames(out) <- modelFit$obsLevels
                    } else {
                      out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
                      colnames(out) <- modelFit$obsLevels
                    }
                    out <- as.data.frame(out)

                    if(!is.null(submodels)) {
                      tmp <- vector(mode = "list", length = nrow(submodels) + 1)
                      tmp[[1]] <- out
                      for(j in seq(along = submodels$nrounds)) {
                        tmp_pred <- predict(modelFit, newdata, ntreelimit = submodels$nrounds[j])
                        if(length(modelFit$obsLevels) == 2) {
                          tmp_pred <- cbind(tmp_pred, 1 - tmp_pred)
                          colnames(tmp_pred) <- modelFit$obsLevels
                        } else {
                          tmp_pred <- matrix(tmp_pred, ncol = length(modelFit$obsLevels), byrow = TRUE)
                          colnames(tmp_pred) <- modelFit$obsLevels
                        }
                        tmp_pred <- as.data.frame(tmp_pred)
                        tmp[[j+1]]  <- tmp_pred
                      }
                      out <- tmp
                    }
                    out  
                  },
                  predictors = function(x, ...) {
                    imp <- xgb.importance(x$xNames, model = x)
                    x$xNames[x$xNames %in% imp$Feature]
                  },
                  varImp = function(object, numTrees = NULL, ...) {
                    imp <- xgb.importance(x$xNames, model = x)
                    imp <- as.data.frame(imp)[, 1:2]
                    rownames(imp) <- as.character(imp[,1])
                    imp <- imp[,2,drop = FALSE]
                    colnames(imp) <- "Overall"
                    imp   
                  },
                  levels = function(x) x$obsLevels,
                  tags = c("Tree-Based Model", "Boosting", "Ensemble Model", "Implicit Feature Selection"),
                  sort = function(x) {
                    # This is a toss-up, but the # trees probably adds
                    # complexity faster than number of splits
                    x[order(x$nrounds, x$max_depth, x$eta),] 
                  })

For example:

library(caret)
library(pROC)

grid <- expand.grid(nrounds = seq(1, 201, by = 25),
                    max_depth = 1:6,
                    eta = (1:4)/10)

set.seed(1)
reg1 <- SLC14_1(200)
reg2 <- SLC14_1(200)

set.seed(2)
mod1 <- train(y ~ ., data = reg1, 
              method = modelInfo,
              tuneGrid = grid,
              trControl = trainControl(method = "cv"))
postResample(predict(mod1, reg2), reg2$y)

set.seed(3)
class1 <- twoClassSim(200)
class2 <- twoClassSim(200)

set.seed(4)
mod2 <- train(Class ~ ., data = class1, 
              method = modelInfo,
              tuneGrid = grid,
              trControl = trainControl(method = "cv"))
confusionMatrix(predict(mod2, class2), class2$Class)

set.seed(4)
mod3 <- train(Class ~ ., data = class1, 
              method = modelInfo,
              tuneGrid = grid,
              metric = "ROC",
              trControl = trainControl(method = "cv", 
                                       summaryFunction = twoClassSummary,
                                       classProbs = TRUE))
probs <- predict(mod3, class2, type = "prob")
roc(class2$Class, probs[, "Class1"], levels = rev(levels(class2$Class)))
topepo commented 9 years ago

Here is some code for the linear models:

modelInfo <- list(label = "eXtreme Gradient Boosting",
                  library = c("xgboost"),
                  type = c("Regression", "Classification"),
                  parameters = data.frame(parameter = c('nrounds', 'lambda', 'alpha'),
                                          class = rep("numeric", 3),
                                          label = c('# Boosting Iterations', 'L2 Regularization', 
                                                    'L2 Regularization')),
                  grid = function(x, y, len = NULL) 
                    expand.grid(lambda = c(0, 10 ^ seq(-1, -4, length = len - 1)),
                                alpha = c(0, 10 ^ seq(-1, -4, length = len - 1))),
                  loop = NULL,
                  fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                    if(is.factor(y)) {
                      if(length(lev) == 2) {
                        y <- ifelse(y == lev[1], 1, 0) 
                        dat <- xgb.DMatrix(as.matrix(x), label = y)
                        out <- xgb.train(list(lambda = param$lambda, 
                                              alpha = param$alpha), 
                                         data = dat,
                                         nrounds = param$nrounds,
                                         objective = "binary:logistic",
                                         ...)
                      } else {
                        y <- as.numeric(y) - 1
                        dat <- xgb.DMatrix(as.matrix(x), label = y)
                        out <- xgb.train(list(lambda = param$lambda, 
                                              alpha = param$alpha), 
                                         data = dat,
                                         num_class = length(lev),
                                         nrounds = param$nrounds,
                                         objective = "multi:softprob",
                                         ...)
                      }     
                    } else {
                      dat <- xgb.DMatrix(as.matrix(x), label = y)
                      out <- xgb.train(list(lambda = param$lambda, 
                                            alpha = param$alpha), 
                                       data = dat,
                                       nrounds = param$nrounds,
                                       objective = "reg:linear",
                                       ...)
                    }
                    out
                  },
                  predict = function(modelFit, newdata, submodels = NULL) {
                    newdata <- xgb.DMatrix(as.matrix(newdata))
                    out <- predict(modelFit, newdata)
                    if(modelFit$problemType == "Classification") {
                      if(length(modelFit$obsLevels) == 2) {
                        out <- ifelse(out >= .5, 
                                      modelFit$obsLevels[1], 
                                      modelFit$obsLevels[2])
                      } else {
                        out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
                        out <- modelFit$obsLevels[apply(out, 1, which.max)]
                      }
                    }
                    out  
                  },
                  prob = function(modelFit, newdata, submodels = NULL) {
                    newdata <- xgb.DMatrix(as.matrix(newdata))
                    out <- predict(modelFit, newdata)
                    if(length(modelFit$obsLevels) == 2) {
                      out <- cbind(out, 1 - out)
                      colnames(out) <- modelFit$obsLevels
                    } else {
                      out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
                      colnames(out) <- modelFit$obsLevels
                    }
                    as.data.frame(out)
                  },
                  predictors = function(x, ...) {
                    imp <- xgb.importance(x$xNames, model = x)
                    x$xNames[x$xNames %in% imp$Feature]
                  },
                  varImp = function(object, numTrees = NULL, ...) {
                    imp <- xgb.importance(x$xNames, model = x)
                    imp <- as.data.frame(imp)[, 1:2]
                    rownames(imp) <- as.character(imp[,1])
                    imp <- imp[,2,drop = FALSE]
                    colnames(imp) <- "Overall"
                    imp   
                  },
                  levels = function(x) x$obsLevels,
                  tags = c("Linear Classifier Models", 
                           "Linear Regression Models",
                           "L1 Regularization Models",
                           "L2 Regularization Models",
                           "Boosting", "Ensemble Model", "Implicit Feature Selection"),
                  sort = function(x) {
                    # This is a toss-up, but the # trees probably adds
                    # complexity faster than number of splits
                    x[order(x$nrounds, x$alpha, x$lambda),] 
                  })
larry77 commented 9 years ago

Thanks for the excellent work! If I install the development version of caret, have these functions already been incorporated? Cheers

Lorenzo

On Mon, Apr 27, 2015 at 07:04:27PM -0700, topepo wrote:

Here is some code for the linear models:

modelInfo <- list(label = "eXtreme Gradient Boosting",
                 library = c("xgboost"),
                 type = c("Regression", "Classification"),
                 parameters = data.frame(parameter = c('nrounds', 'lambda', 'alpha'),
                                         class = rep("numeric", 3),
                                         label = c('# Boosting Iterations', 'L2 Regularization',
                                                   'L2 Regularization')),
                 grid = function(x, y, len = NULL)
                   expand.grid(lambda = c(0, 10 ^ seq(-1, -4, length = len - 1)),
                               alpha = c(0, 10 ^ seq(-1, -4, length = len - 1))),
                 loop = NULL,
                 fit = function(x, y, wts, param, lev, last, classProbs, ...) {
                   if(is.factor(y)) {
                     if(length(lev) == 2) {
                       y <- ifelse(y == lev[1], 1, 0)
                       dat <- xgb.DMatrix(as.matrix(x), label = y)
                       out <- xgb.train(list(lambda = param$lambda,
                                             alpha = param$alpha),
                                        data = dat,
                                        nrounds = param$nrounds,
                                        objective = "binary:logistic",
                                        ...)
                     } else {
                       y <- as.numeric(y) - 1
                       dat <- xgb.DMatrix(as.matrix(x), label = y)
                       out <- xgb.train(list(lambda = param$lambda,
                                             alpha = param$alpha),
                                        data = dat,
                                        num_class = length(lev),
                                        nrounds = param$nrounds,
                                        objective = "multi:softprob",
                                        ...)
                     }
                   } else {
                     dat <- xgb.DMatrix(as.matrix(x), label = y)
                     out <- xgb.train(list(lambda = param$lambda,
                                           alpha = param$alpha),
                                      data = dat,
                                      nrounds = param$nrounds,
                                      objective = "reg:linear",
                                      ...)
                   }
                   out
                 },
                 predict = function(modelFit, newdata, submodels = NULL) {
                   newdata <- xgb.DMatrix(as.matrix(newdata))
                   out <- predict(modelFit, newdata)
                   if(modelFit$problemType == "Classification") {
                     if(length(modelFit$obsLevels) == 2) {
                       out <- ifelse(out >= .5,
                                     modelFit$obsLevels[1],
                                     modelFit$obsLevels[2])
                     } else {
                       out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
                       out <- modelFit$obsLevels[apply(out, 1, which.max)]
                     }
                   }
                   out
                 },
                 prob = function(modelFit, newdata, submodels = NULL) {
                   newdata <- xgb.DMatrix(as.matrix(newdata))
                   out <- predict(modelFit, newdata)
                   if(length(modelFit$obsLevels) == 2) {
                     out <- cbind(out, 1 - out)
                     colnames(out) <- modelFit$obsLevels
                   } else {
                     out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
                     colnames(out) <- modelFit$obsLevels
                   }
                   as.data.frame(out)
                 },
                 predictors = function(x, ...) {
                   imp <- xgb.importance(x$xNames, model = x)
                   x$xNames[x$xNames %in% imp$Feature]
                 },
                 varImp = function(object, numTrees = NULL, ...) {
                   imp <- xgb.importance(x$xNames, model = x)
                   imp <- as.data.frame(imp)[, 1:2]
                   rownames(imp) <- as.character(imp[,1])
                   imp <- imp[,2,drop = FALSE]
                   colnames(imp) <- "Overall"
                   imp
                 },
                 levels = function(x) x$obsLevels,
                 tags = c("Linear Classifier Models",
                          "Linear Regression Models",
                          "L1 Regularization Models",
                          "L2 Regularization Models",
                          "Boosting", "Ensemble Model", "Implicit Feature Selection"),
                 sort = function(x) {
                   # This is a toss-up, but the # trees probably adds
                   # complexity faster than number of splits
                   x[order(x$nrounds, x$alpha, x$lambda),]
                 })

Reply to this email directly or view it on GitHub: https://github.com/topepo/caret/issues/147#issuecomment-96875650

topepo commented 9 years ago

No, you would just source the two model lists and use the code shown at the bottom of the first post.

They will eventually be part of the package but not before I write tests and some documentation.

On Tue, Apr 28, 2015 at 6:46 AM, larry77 notifications@github.com wrote:

Thanks for the excellent work! If I install the development version of caret, have these functions already been incorporated? Cheers

Lorenzo

On Mon, Apr 27, 2015 at 07:04:27PM -0700, topepo wrote:

Here is some code for the linear models:

modelInfo <- list(label = "eXtreme Gradient Boosting",
library = c("xgboost"),
type = c("Regression", "Classification"),
parameters = data.frame(parameter = c('nrounds', 'lambda', 'alpha'),
class = rep("numeric", 3),
label = c('# Boosting Iterations', 'L2 Regularization',
'L2 Regularization')),
grid = function(x, y, len = NULL)
expand.grid(lambda = c(0, 10 ^ seq(-1, -4, length = len - 1)),
alpha = c(0, 10 ^ seq(-1, -4, length = len - 1))),
loop = NULL,
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
if(is.factor(y)) {
if(length(lev) == 2) {
y <- ifelse(y == lev[1], 1, 0)
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(lambda = param$lambda,
alpha = param$alpha),
data = dat,
nrounds = param$nrounds,
objective = "binary:logistic",
...)
} else {
y <- as.numeric(y) - 1
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(lambda = param$lambda,
alpha = param$alpha),
data = dat,
num_class = length(lev),
nrounds = param$nrounds,
objective = "multi:softprob",
...)
}
} else {
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(lambda = param$lambda,
alpha = param$alpha),
data = dat,
nrounds = param$nrounds,
objective = "reg:linear",
...)
}
out
},
predict = function(modelFit, newdata, submodels = NULL) {
newdata <- xgb.DMatrix(as.matrix(newdata))
out <- predict(modelFit, newdata)
if(modelFit$problemType == "Classification") {
if(length(modelFit$obsLevels) == 2) {
out <- ifelse(out >= .5,
modelFit$obsLevels[1],
modelFit$obsLevels[2])
} else {
out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
out <- modelFit$obsLevels[apply(out, 1, which.max)]
}
}
out
},
prob = function(modelFit, newdata, submodels = NULL) {
newdata <- xgb.DMatrix(as.matrix(newdata))
out <- predict(modelFit, newdata)
if(length(modelFit$obsLevels) == 2) {
out <- cbind(out, 1 - out)
colnames(out) <- modelFit$obsLevels
} else {
out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
colnames(out) <- modelFit$obsLevels
}
as.data.frame(out)
},
predictors = function(x, ...) {
imp <- xgb.importance(x$xNames, model = x)
x$xNames[x$xNames %in% imp$Feature]
},
varImp = function(object, numTrees = NULL, ...) {
imp <- xgb.importance(x$xNames, model = x)
imp <- as.data.frame(imp)[, 1:2]
rownames(imp) <- as.character(imp[,1])
imp <- imp[,2,drop = FALSE]
colnames(imp) <- "Overall"
imp
},
levels = function(x) x$obsLevels,
tags = c("Linear Classifier Models",
"Linear Regression Models",
"L1 Regularization Models",
"L2 Regularization Models",
"Boosting", "Ensemble Model", "Implicit Feature Selection"),
sort = function(x) {
# This is a toss-up, but the # trees probably adds
# complexity faster than number of splits
x[order(x$nrounds, x$alpha, x$lambda),]
})

Reply to this email directly or view it on GitHub: https://github.com/topepo/caret/issues/147#issuecomment-96875650

— Reply to this email directly or view it on GitHub https://github.com/topepo/caret/issues/147#issuecomment-97012797.

rsuhada commented 9 years ago

Thank you very much for this quick implementation!

It seems to work well, but unfortunately I can't seem to be able to coax it to work do a CV with a log-loss in a multi-class classification problem.

If I run it with trainControl method='none' I get the expected result, but if I set it to CV, it switches from mlogloss to Accuracy. If I write my own summaryFunction with the log-loss calculation, it crashes because the the summaryFunction receives only class labels not class probabilities. I've tried to blindly add type='prob' to all predict calls in your modelInfo, but no luck...

By any chance do you happen know what I'm doing wrong? Thank you very much!

zachmayer commented 9 years ago

You probably need to add classProbs = TRUE to your trainControl function.

On Thu, Apr 30, 2015 at 3:02 PM, rsuhada notifications@github.com wrote:

Thank you very much for this quick implementation!

It seems to work well, but unfortunately I can't seem to be able to coax it to work do a CV with a log-loss in a multi-class classification problem.

If I run it with trainControl method='none' I get the expected result, but if I set it to CV, it switches from mlogloss to Accuracy. If I write my own summaryFunction with the log-loss calculation, it crashes because the the summaryFunction receives only class labels not class probabilities. I've tried to blindly add type='prob' to all predict calls in your modelInfo, but no luck...

By any chance do you happen know what I'm doing wrong? Thank you very much!

— Reply to this email directly or view it on GitHub https://github.com/topepo/caret/issues/147#issuecomment-97928462.

rsuhada commented 9 years ago

Yes, I have classProbs=TRUE too (and it works as long method='none'...).

topepo commented 9 years ago

Can you send us the code and an example for testing?

On Thu, Apr 30, 2015 at 3:26 PM, rsuhada notifications@github.com wrote:

Yes, I have classProbs=TRUE too (and it works as long method='none'...).

— Reply to this email directly or view it on GitHub https://github.com/topepo/caret/issues/147#issuecomment-97936617.

rsuhada commented 9 years ago

A small grid for testing purposes:

## small grid for test
grid.0 <- expand.grid(nrounds = c(10, 20),
                      max_depth = 6,
                      eta = 0.3)

Case A:

## train with cv (number/repeats low just for testing speed)
train.ctrl <- trainControl(method = 'cv', number = 3, repeats = 1,
                           classProbs = TRUE,
                           verboseIter=TRUE)

This traininig defaults to accuracy instead of the required log-loss: The metric "mlogloss" was not in the result set. Accuracy will be used instead.

Case B: I define a summary function:

mc_logloss <- function(observed, predicted, model, eps=1e-15){
    ## multi-class log-loss
    predicted <- pmin(pmax(predicted, eps), 1-eps)
    out <- -1.0*sum(observed*log(predicted))/nrow(observed)
    names(out) <- 'mlogloss'
    return(out)
}

The corresponding trainControl:

train.ctrl <- trainControl(method = 'cv', number = 3, repeats = 1,
                           classProbs = TRUE,
                           summaryFunction = mc_logloss,
                           verboseIter=TRUE)

This leads to crash due to not passing the probabilities: Error in log(predicted) (from #4) : non-numeric argument to mathematical function

The train function for Case A and Case B is the same:

xgb0 <- train(target ~ ., data = train,
              type = "Classification",
              metric = 'mlogloss',
              objective = 'multi:softprob',
              num.class = 9,
              nthread = 4,
              method = modelInfo,
              tuneGrid = grid.0,
              trControl = train.ctrl)

As mentioned method='none' trains on mlogloss and returns class probabilities as required, it just with CV I can't get it to do both. Thank you for your support!

topepo commented 9 years ago

The class probabilities are there but your function is referencing them incorrectly. See this page for more details. Basically, the class probabilities come in as different columns and the predicted column is the predicted class (a factor).

Also, you can have xgb.train use the multinomial log loss during boosting. See ?xgb.train for more details.

larry77 commented 9 years ago

Hello, I am not sure this is relevant, but the following code

###########################################################

library(readr) library(xgboost) library(caret)

train <- read_csv("train.csv")

x <- subset(train, select=-c(target))

x <- as.matrix(x) x = matrix(as.numeric(x),nrow(x),ncol(x)) ## now x is a matrix of purely numerical values

y <- train$target y = gsub('Class_','',y) y = as.integer(y)-1 #xgboost take features in [0,numOfClass)

In other words: y is now a numeric vector in 0...(numOfClass-1)

grid <- expand.grid(nrounds = seq(1, 601, by = 25), max_depth = 5:17, eta = c(seq(0.01,0.31,by=0.05)))

mod1 <- train(x,y, method = modelInfo, tuneGrid = grid, nthread=10, verbose=T, trControl = trainControl(method = "repeatedcv" , number=10, repeats=10, classProbs = TRUE) )

############################################################## where train.csv is taken from a kaggle competition

http://bit.ly/1I2bauw

returns a segmentation fault (see the end of the email), where I used the function for the trees. Am I making a mistake somewhere? Cheers

Lorenzo

##########################################################

* caught segfault * address 0x190, cause 'memory not mapped'

Traceback: 1: .Call("XGBoosterUpdateOneIter_R", booster, as.integer(iter), dtrain, PACKAGE = "xgboost") 2: xgb.iter.update(bst$handle, dtrain, i - 1, obj) 3: xgb.train(list(eta = param$eta, max_depth = param$max_depth), data = dat, nrounds = param$nrounds, objective = "reg:linear", ...) 4: method$fit(x = x, y = y, wts = wts, param = tuneValue, lev = obsLevels, last = last, classProbs = classProbs, ...) 5: createModel(x = x[modelIndex, , drop = FALSE], y = y[modelIndex], wts = wts[modelIndex], method = method, tuneValue = info$loop[parm, , drop = FALSE], obsLevels = lev, pp = ppp, classProbs = ctrl$classProbs, ...) 6: doTryCatch(return(expr), name, parentenv, handler) 7: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 8: tryCatchList(expr, classes, parentenv, handlers) 9: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = stderr()) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))}) 10: try(createModel(x = x[modelIndex, , drop = FALSE], y = y[modelIndex], wts = wts[modelIndex], method = method, tuneValue = info$loop[parm, , drop = FALSE], obsLevels = lev, pp = ppp, classProbs = ctrl$classProbs, ...), silent = TRUE) 11: eval(expr, envir, enclos) 12: eval(xpr, envir = envir) 13: doTryCatch(return(expr), name, parentenv, handler) 14: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 15: tryCatchList(expr, classes, parentenv, handlers) 16: tryCatch(eval(xpr, envir = envir), error = function(e) e) 17: doTryCatch(return(expr), name, parentenv, handler) 18: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 19: tryCatchList(expr, classes, parentenv, handlers) 20: tryCatch({ repeat { args <- nextElem(it) if (obj$verbose) { cat(sprintf("evaluation # %d:\n", i)) print(args) } for (a in names(args)) assign(a, args[[a]], pos = envir, inherits = FALSE) r <- tryCatch(eval(xpr, envir = envir), error = function(e) e) if (obj$verbose) { cat("result of evaluating expression:\n") print(r) } tryCatch(accumulator(list(r), i), error = function(e) { cat("error calling combine function:\n") print(e) NULL }) i <- i + 1 }}, error = function(e) { if (!identical(conditionMessage(e), "StopIteration")) stop(simpleError(conditionMessage(e), expr))}) 21: e$fun(obj, substitute(ex), parent.frame(), e$data) 22: foreach(iter = seq(along = resampleIndex), .combine = "c", .verbose = FALSE, .packages = pkgs, .errorhandling = "stop") %:% foreach(parm = 1:nrow(info$loop), .combine = "c", .verbose = FALSE, .packages = pkgs, .errorhandling = "stop") %op% { testing <- FALSE if (!(length(ctrl$seeds) == 1 && is.na(ctrl$seeds))) set.seed(ctrl$seeds[[iter]][parm]) library(caret) if (ctrl$verboseIter) progress(printed[parm, , drop = FALSE], names(resampleIndex), iter) if (names(resampleIndex)[iter] != "AllData") { modelIndex <- resampleIndex[[iter]] holdoutIndex <- ctrl$indexOut[[iter]] } else { modelIndex <- 1:nrow(x) holdoutIndex <- modelIndex } if (testing) cat("pre-model\n") if (is.null(info$submodels[[parm]]) || nrow(info$submodels[[parm]]) > 0) { submod <- info$submodels[[parm]] } else submod <- NULL mod <- try(createModel(x = x[modelIndex, , drop = FALSE], y = y[modelIndex], wts = wts[modelIndex], method = method, tuneValue = info$loop[parm, , drop = FALSE], obsLevels = lev, pp = ppp, classProbs = ctrl$classProbs, ...), silent = TRUE) if (class(mod)[1] != "try-error") { predicted <- try(predictionFunction(method = method, modelFit = mod$fit, newdata = x[holdoutIndex, , drop = FALSE], preProc = mod$preProc, param = submod), silent = TRUE) if (class(predicted)[1] == "try-error") { wrn <- paste(colnames(printed[parm, , drop = FALSE]), printed[parm, , drop = FALSE], sep = "=", collapse = ", ") wrn <- paste("predictions failed for ", names(resampleIndex)[iter], ": ", wrn, " ", as.character(predicted), sep = "") if (ctrl$verboseIter) cat(wrn, "\n") warning(wrn) rm(wrn) nPred <- length(holdoutIndex) if (!is.null(lev)) { predicted <- rep("", nPred) predicted[seq(along = predicted)] <- NA } else { predicted <- rep(NA, nPred) } if (!is.null(submod)) { tmp <- predicted predicted <- vector(mode = "list", length = nrow(info$submodels[[parm]]) + 1) for (i in seq(along = predicted)) predicted[[i]] <- tmp rm(tmp) } } } else { wrn <- paste(colnames(printed[parm, , drop = FALSE]), printed[parm, , drop = FALSE], sep = "=", collapse = ", ") wrn <- paste("model fit failed for ", names(resampleIndex)[iter], ": ", wrn, " ", as.character(mod), sep = "") if (ctrl$verboseIter) cat(wrn, "\n") warning(wrn) rm(wrn) nPred <- length(holdoutIndex) if (!is.null(lev)) { predicted <- rep("", nPred) predicted[seq(along = predicted)] <- NA } else { predicted <- rep(NA, nPred) } if (!is.null(submod)) { tmp <- predicted predicted <- vector(mode = "list", length = nrow(info$submodels[[parm]]) + 1) for (i in seq(along = predicted)) predicted[[i]] <- tmp rm(tmp) } } if (testing) print(head(predicted)) if (ctrl$classProbs) { if (class(mod)[1] != "try-error") { probValues <- probFunction(method = method, modelFit = mod$fit, newdata = x[holdoutIndex, , drop = FALSE], preProc = mod$preProc, param = submod) } else { probValues <- as.data.frame(matrix(NA, nrow = nPred, ncol = length(lev))) colnames(probValues) <- lev if (!is.null(submod)) { tmp <- probValues probValues <- vector(mode = "list", length = nrow(info$submodels[[parm]]) + 1) for (i in seq(along = probValues)) probValues[[i]] <- tmp rm(tmp) } } if (testing) print(head(probValues)) } if (is.numeric(y)) { if (is.logical(ctrl$predictionBounds) && any(ctrl$predictionBounds)) { if (is.list(predicted)) { predicted <- lapply(predicted, trimPredictions, mod_type = "Regression", bounds = ctrl$predictionBounds, limits = ctrl$yLimits) } else { predicted <- trimPredictions(mod_type = "Regression", bounds = ctrl$predictionBounds, limits = ctrl$yLimit, pred = predicted) } } else { if (is.numeric(ctrl$predictionBounds) && any(!is.na(ctrl$predictionBounds))) { if (is.list(predicted)) { predicted <- lapply(predicted, trimPredictions, mod_type = "Regression", bounds = ctrl$predictionBounds, limits = ctrl$yLimits) } else { predicted <- trimPredictions(mod_type = "Regression", bounds = ctrl$predictionBounds, limits = ctrl$yLimit, pred = predicted) } } } } if (!is.null(submod)) { allParam <- expandParameters(info$loop[parm, , drop = FALSE], info$submodels[[parm]]) allParam <- allParam[complete.cases(allParam), , drop = FALSE] predicted <- lapply(predicted, function(x, y, wts, lv) { if (!is.factor(x) & is.character(x)) x <- factor(as.character(x), levels = lv) out <- data.frame(pred = x, obs = y, stringsAsFactors = FALSE) if (!is.null(wts)) out$weights <- wts out }, y = y[holdoutIndex], wts = wts[holdoutIndex], lv = lev) if (testing) print(head(predicted)) if (ctrl$classProbs) { for (k in seq(along = predicted)) predicted[[k]] <- cbind(predicted[[k]], probValues[[k]]) } if (ctrl$savePredictions) { tmpPred <- predicted for (modIndex in seq(along = tmpPred)) { tmpPred[[modIndex]]$rowIndex <- holdoutIndex tmpPred[[modIndex]] <- merge(tmpPred[[modIndex]], allParam[modIndex, , drop = FALSE], all = TRUE) } tmpPred <- rbind.fill(tmpPred) tmpPred$Resample <- names(resampleIndex)[iter] } else tmpPred <- NULL thisResample <- lapply(predicted, ctrl$summaryFunction, lev = lev, model = method) if (testing) print(head(thisResample)) if (length(lev) > 1) { cells <- lapply(predicted, function(x) flatTable(x$pred, x$obs)) for (ind in seq(along = cells)) thisResample[[ind]] <- c(thisResample[[ind]], cells[[ind]]) } thisResample <- do.call("rbind", thisResample) thisResample <- cbind(allParam, thisResample) } else { if (is.factor(y)) predicted <- factor(as.character(predicted), levels = lev) tmp <- data.frame(pred = predicted, obs = y[holdoutIndex], stringsAsFactors = FALSE) names(tmp)[1] <- "pred" if (!is.null(wts)) tmp$weights <- wts[holdoutIndex] if (ctrl$classProbs) tmp <- cbind(tmp, probValues) if (ctrl$savePredictions) { tmpPred <- tmp tmpPred$rowIndex <- holdoutIndex tmpPred <- merge(tmpPred, info$loop[parm, , drop = FALSE], all = TRUE) tmpPred$Resample <- names(resampleIndex)[iter] } else tmpPred <- NULL thisResample <- ctrl$summaryFunction(tmp, lev = lev, model = method) if (length(lev) > 1) thisResample <- c(thisResample, flatTable(tmp$pred, tmp$obs)) thisResample <- as.data.frame(t(thisResample)) thisResample <- cbind(thisResample, info$loop[parm, , drop = FALSE]) } thisResample$Resample <- names(resampleIndex)[iter] if (ctrl$verboseIter) progress(printed[parm, , drop = FALSE], names(resampleIndex), iter, FALSE) list(resamples = thisResample, pred = tmpPred) } 23: nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, method = models, ppOpts = preProcess, ctrl = trControl, lev = classLevels, ...) 24: train.default(x, y, method = modelInfo, tuneGrid = grid, nthread = 10, verbose = T, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, classProbs = TRUE)) 25: train(x, y, method = modelInfo, tuneGrid = grid, nthread = 10, verbose = T, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, classProbs = TRUE)) 26: eval(expr, envir, enclos) 27: eval(ei, envir) 28: withVisible(eval(ei, envir)) 29: source("otto-xgboost-caret.R")

Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace Selection: 1 aborting ... Segmentation fault

On Fri, May 01, 2015 at 06:55:00AM -0700, topepo wrote:

The class probabilities are there but your function is referencing them incorrectly. See this page for more details. Basically, the class probabilities come in as different columns and the predicted column is the predicted class (a factor).

Also, you can have xgb.train use the multinomial log loss during boosting. See ?xgb.train for more details.


Reply to this email directly or view it on GitHub: https://github.com/topepo/caret/issues/147#issuecomment-98139393

topepo commented 9 years ago

I have seen a few set faults using the linear boosting code form that library. It doesn't seem to be dependent on caret

BrianMiner commented 9 years ago

Is there not support in caret for the 'missing' parameter in the xgboost matrix? This is a pretty important element but I cant seem to find it. Thanks!

topepo commented 9 years ago

The ... can only effectively point to one internal function here and it goes to xgb.train. You can overwrite the fitting code using a custom model and pass that option to xgb.DMatrix

wwdxfa commented 8 years ago

First, thank you very much for this wonderful job. But I have something to check. When I ran the varImp function in modelInfo to find the variables Importance, there was something wrong, I wonder the line "imp <- xgb.importance(x$xNames, model = x)" should be "imp <- xgb.importance(object$xNames, model = object)"? @topepo