Closed larry77 closed 9 years ago
I'm already looking into it! xgboost models are a little funky, as the R object is little more than an external pointer to a C++ object. Also, xgboost isn't on CRAN, which adds more issues.
That said, we can write the modeling functions, and leave the package installation up to the user (I think devtools::install_github
works)
The problem of save/load R object was an issue that has already been fixed, so I assume the obstacles are now cleaned.
The CRAN issue was a bit pain for us. This was our mistake caused by being not responsive to the updated due to timeline issue and out of sync. We are still investigating to resolve this. The quality of the tool and guarantee on cross-platform ability, however, steadily improved over time and is going to keep improving.
Thanks for the prompt reply. So, it looks like there is a strong push in the direction of integrating xgboost into the caret framework. I wonder: when this is done (even at an experimental level and outside cran at the beginning) will this be automatically notified to the interested users (for instance by email)? Cheers
Lorenzo
On Sat, Apr 25, 2015 at 05:26:09PM -0700, Tianqi Chen wrote:
The problem of save/load R object was an issue that has already been fixed, so I assume the obstacles are now cleaned.
The CRAN issue was a bit pain for us. This was our mistake caused by being not responsive to the updated due to timeline issue and out of sync. We are still investigating to resolve this. The quality of the tool and guarantee on cross-platform ability, however, steadily improved over time and is going to keep improving.
Reply to this email directly or view it on GitHub: https://github.com/topepo/caret/issues/147#issuecomment-96298833
This is probably the best place to keep track of the task. You can also check the CRAN page for updates and look at the changelog.
But this thread should email you when there's an update.
On Sun, Apr 26, 2015 at 6:50 AM, larry77 notifications@github.com wrote:
Thanks for the prompt reply. So, it looks like there is a strong push in the direction of integrating xgboost into the caret framework. I wonder: when this is done (even at an experimental level and outside cran at the beginning) will this be automatically notified to the interested users (for instance by email)? Cheers
Lorenzo
On Sat, Apr 25, 2015 at 05:26:09PM -0700, Tianqi Chen wrote:
The problem of save/load R object was an issue that has already been fixed, so I assume the obstacles are now cleaned.
The CRAN issue was a bit pain for us. This was our mistake caused by being not responsive to the updated due to timeline issue and out of sync. We are still investigating to resolve this. The quality of the tool and guarantee on cross-platform ability, however, steadily improved over time and is going to keep improving.
Reply to this email directly or view it on GitHub: https://github.com/topepo/caret/issues/147#issuecomment-96298833
— Reply to this email directly or view it on GitHub https://github.com/topepo/caret/issues/147#issuecomment-96364294.
Here is some code to test for trees. I'll get the linear model version soon:
modelInfo <- list(label = "eXtreme Gradient Boosting",
library = c("xgboost", "plyr"),
type = c("Regression", "Classification"),
parameters = data.frame(parameter = c('nrounds', 'max_depth', 'eta'),
class = rep("numeric", 3),
label = c('# Boosting Iterations', 'Max Tree Depth',
'Shrinkage')),
grid = function(x, y, len = NULL) expand.grid(max_depth = seq(1, len),
nrounds = floor((1:len) * 50),
eta = .3),
loop = function(grid) {
loop <- ddply(grid, c("eta", "max_depth"),
function(x) c(nrounds = max(x$nrounds)))
submodels <- vector(mode = "list", length = nrow(loop))
for(i in seq(along = loop$nrounds)) {
index <- which(grid$max_depth == loop$max_depth[i] &
grid$eta == loop$eta[i])
trees <- grid[index, "nrounds"]
submodels[[i]] <- data.frame(nrounds = trees[trees != loop$nrounds[i]])
}
list(loop = loop, submodels = submodels)
},
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
if(is.factor(y)) {
if(length(lev) == 2) {
y <- ifelse(y == lev[1], 1, 0)
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(eta = param$eta,
max_depth = param$max_depth),
data = dat,
nrounds = param$nrounds,
objective = "binary:logistic",
...)
} else {
y <- as.numeric(y) - 1
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(eta = param$eta,
max_depth = param$max_depth),
data = dat,
num_class = length(lev),
nrounds = param$nrounds,
objective = "multi:softprob",
...)
}
} else {
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(eta = param$eta,
max_depth = param$max_depth),
data = dat,
nrounds = param$nrounds,
objective = "reg:linear",
...)
}
out
},
predict = function(modelFit, newdata, submodels = NULL) {
newdata <- xgb.DMatrix(as.matrix(newdata))
out <- predict(modelFit, newdata)
if(modelFit$problemType == "Classification") {
if(length(modelFit$obsLevels) == 2) {
out <- ifelse(out >= .5,
modelFit$obsLevels[1],
modelFit$obsLevels[2])
} else {
out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
out <- modelFit$obsLevels[apply(out, 1, which.max)]
}
}
if(!is.null(submodels)) {
tmp <- vector(mode = "list", length = nrow(submodels) + 1)
tmp[[1]] <- out
for(j in seq(along = submodels$nrounds)) {
tmp_pred <- predict(modelFit, newdata, ntreelimit = submodels$nrounds[j])
if(modelFit$problemType == "Classification") {
if(length(modelFit$obsLevels) == 2) {
tmp_pred <- ifelse(tmp_pred >= .5,
modelFit$obsLevels[1],
modelFit$obsLevels[2])
} else {
tmp_pred <- matrix(tmp_pred, ncol = length(modelFit$obsLevels), byrow = TRUE)
tmp_pred <- modelFit$obsLevels[apply(tmp_pred, 1, which.max)]
}
}
tmp[[j+1]] <- tmp_pred
}
out <- tmp
}
out
},
prob = function(modelFit, newdata, submodels = NULL) {
newdata <- xgb.DMatrix(as.matrix(newdata))
out <- predict(modelFit, newdata)
if(length(modelFit$obsLevels) == 2) {
out <- cbind(out, 1 - out)
colnames(out) <- modelFit$obsLevels
} else {
out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
colnames(out) <- modelFit$obsLevels
}
out <- as.data.frame(out)
if(!is.null(submodels)) {
tmp <- vector(mode = "list", length = nrow(submodels) + 1)
tmp[[1]] <- out
for(j in seq(along = submodels$nrounds)) {
tmp_pred <- predict(modelFit, newdata, ntreelimit = submodels$nrounds[j])
if(length(modelFit$obsLevels) == 2) {
tmp_pred <- cbind(tmp_pred, 1 - tmp_pred)
colnames(tmp_pred) <- modelFit$obsLevels
} else {
tmp_pred <- matrix(tmp_pred, ncol = length(modelFit$obsLevels), byrow = TRUE)
colnames(tmp_pred) <- modelFit$obsLevels
}
tmp_pred <- as.data.frame(tmp_pred)
tmp[[j+1]] <- tmp_pred
}
out <- tmp
}
out
},
predictors = function(x, ...) {
imp <- xgb.importance(x$xNames, model = x)
x$xNames[x$xNames %in% imp$Feature]
},
varImp = function(object, numTrees = NULL, ...) {
imp <- xgb.importance(x$xNames, model = x)
imp <- as.data.frame(imp)[, 1:2]
rownames(imp) <- as.character(imp[,1])
imp <- imp[,2,drop = FALSE]
colnames(imp) <- "Overall"
imp
},
levels = function(x) x$obsLevels,
tags = c("Tree-Based Model", "Boosting", "Ensemble Model", "Implicit Feature Selection"),
sort = function(x) {
# This is a toss-up, but the # trees probably adds
# complexity faster than number of splits
x[order(x$nrounds, x$max_depth, x$eta),]
})
For example:
library(caret)
library(pROC)
grid <- expand.grid(nrounds = seq(1, 201, by = 25),
max_depth = 1:6,
eta = (1:4)/10)
set.seed(1)
reg1 <- SLC14_1(200)
reg2 <- SLC14_1(200)
set.seed(2)
mod1 <- train(y ~ ., data = reg1,
method = modelInfo,
tuneGrid = grid,
trControl = trainControl(method = "cv"))
postResample(predict(mod1, reg2), reg2$y)
set.seed(3)
class1 <- twoClassSim(200)
class2 <- twoClassSim(200)
set.seed(4)
mod2 <- train(Class ~ ., data = class1,
method = modelInfo,
tuneGrid = grid,
trControl = trainControl(method = "cv"))
confusionMatrix(predict(mod2, class2), class2$Class)
set.seed(4)
mod3 <- train(Class ~ ., data = class1,
method = modelInfo,
tuneGrid = grid,
metric = "ROC",
trControl = trainControl(method = "cv",
summaryFunction = twoClassSummary,
classProbs = TRUE))
probs <- predict(mod3, class2, type = "prob")
roc(class2$Class, probs[, "Class1"], levels = rev(levels(class2$Class)))
Here is some code for the linear models:
modelInfo <- list(label = "eXtreme Gradient Boosting",
library = c("xgboost"),
type = c("Regression", "Classification"),
parameters = data.frame(parameter = c('nrounds', 'lambda', 'alpha'),
class = rep("numeric", 3),
label = c('# Boosting Iterations', 'L2 Regularization',
'L2 Regularization')),
grid = function(x, y, len = NULL)
expand.grid(lambda = c(0, 10 ^ seq(-1, -4, length = len - 1)),
alpha = c(0, 10 ^ seq(-1, -4, length = len - 1))),
loop = NULL,
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
if(is.factor(y)) {
if(length(lev) == 2) {
y <- ifelse(y == lev[1], 1, 0)
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(lambda = param$lambda,
alpha = param$alpha),
data = dat,
nrounds = param$nrounds,
objective = "binary:logistic",
...)
} else {
y <- as.numeric(y) - 1
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(lambda = param$lambda,
alpha = param$alpha),
data = dat,
num_class = length(lev),
nrounds = param$nrounds,
objective = "multi:softprob",
...)
}
} else {
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(lambda = param$lambda,
alpha = param$alpha),
data = dat,
nrounds = param$nrounds,
objective = "reg:linear",
...)
}
out
},
predict = function(modelFit, newdata, submodels = NULL) {
newdata <- xgb.DMatrix(as.matrix(newdata))
out <- predict(modelFit, newdata)
if(modelFit$problemType == "Classification") {
if(length(modelFit$obsLevels) == 2) {
out <- ifelse(out >= .5,
modelFit$obsLevels[1],
modelFit$obsLevels[2])
} else {
out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
out <- modelFit$obsLevels[apply(out, 1, which.max)]
}
}
out
},
prob = function(modelFit, newdata, submodels = NULL) {
newdata <- xgb.DMatrix(as.matrix(newdata))
out <- predict(modelFit, newdata)
if(length(modelFit$obsLevels) == 2) {
out <- cbind(out, 1 - out)
colnames(out) <- modelFit$obsLevels
} else {
out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
colnames(out) <- modelFit$obsLevels
}
as.data.frame(out)
},
predictors = function(x, ...) {
imp <- xgb.importance(x$xNames, model = x)
x$xNames[x$xNames %in% imp$Feature]
},
varImp = function(object, numTrees = NULL, ...) {
imp <- xgb.importance(x$xNames, model = x)
imp <- as.data.frame(imp)[, 1:2]
rownames(imp) <- as.character(imp[,1])
imp <- imp[,2,drop = FALSE]
colnames(imp) <- "Overall"
imp
},
levels = function(x) x$obsLevels,
tags = c("Linear Classifier Models",
"Linear Regression Models",
"L1 Regularization Models",
"L2 Regularization Models",
"Boosting", "Ensemble Model", "Implicit Feature Selection"),
sort = function(x) {
# This is a toss-up, but the # trees probably adds
# complexity faster than number of splits
x[order(x$nrounds, x$alpha, x$lambda),]
})
Thanks for the excellent work! If I install the development version of caret, have these functions already been incorporated? Cheers
Lorenzo
On Mon, Apr 27, 2015 at 07:04:27PM -0700, topepo wrote:
Here is some code for the linear models:
modelInfo <- list(label = "eXtreme Gradient Boosting", library = c("xgboost"), type = c("Regression", "Classification"), parameters = data.frame(parameter = c('nrounds', 'lambda', 'alpha'), class = rep("numeric", 3), label = c('# Boosting Iterations', 'L2 Regularization', 'L2 Regularization')), grid = function(x, y, len = NULL) expand.grid(lambda = c(0, 10 ^ seq(-1, -4, length = len - 1)), alpha = c(0, 10 ^ seq(-1, -4, length = len - 1))), loop = NULL, fit = function(x, y, wts, param, lev, last, classProbs, ...) { if(is.factor(y)) { if(length(lev) == 2) { y <- ifelse(y == lev[1], 1, 0) dat <- xgb.DMatrix(as.matrix(x), label = y) out <- xgb.train(list(lambda = param$lambda, alpha = param$alpha), data = dat, nrounds = param$nrounds, objective = "binary:logistic", ...) } else { y <- as.numeric(y) - 1 dat <- xgb.DMatrix(as.matrix(x), label = y) out <- xgb.train(list(lambda = param$lambda, alpha = param$alpha), data = dat, num_class = length(lev), nrounds = param$nrounds, objective = "multi:softprob", ...) } } else { dat <- xgb.DMatrix(as.matrix(x), label = y) out <- xgb.train(list(lambda = param$lambda, alpha = param$alpha), data = dat, nrounds = param$nrounds, objective = "reg:linear", ...) } out }, predict = function(modelFit, newdata, submodels = NULL) { newdata <- xgb.DMatrix(as.matrix(newdata)) out <- predict(modelFit, newdata) if(modelFit$problemType == "Classification") { if(length(modelFit$obsLevels) == 2) { out <- ifelse(out >= .5, modelFit$obsLevels[1], modelFit$obsLevels[2]) } else { out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE) out <- modelFit$obsLevels[apply(out, 1, which.max)] } } out }, prob = function(modelFit, newdata, submodels = NULL) { newdata <- xgb.DMatrix(as.matrix(newdata)) out <- predict(modelFit, newdata) if(length(modelFit$obsLevels) == 2) { out <- cbind(out, 1 - out) colnames(out) <- modelFit$obsLevels } else { out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE) colnames(out) <- modelFit$obsLevels } as.data.frame(out) }, predictors = function(x, ...) { imp <- xgb.importance(x$xNames, model = x) x$xNames[x$xNames %in% imp$Feature] }, varImp = function(object, numTrees = NULL, ...) { imp <- xgb.importance(x$xNames, model = x) imp <- as.data.frame(imp)[, 1:2] rownames(imp) <- as.character(imp[,1]) imp <- imp[,2,drop = FALSE] colnames(imp) <- "Overall" imp }, levels = function(x) x$obsLevels, tags = c("Linear Classifier Models", "Linear Regression Models", "L1 Regularization Models", "L2 Regularization Models", "Boosting", "Ensemble Model", "Implicit Feature Selection"), sort = function(x) { # This is a toss-up, but the # trees probably adds # complexity faster than number of splits x[order(x$nrounds, x$alpha, x$lambda),] })
Reply to this email directly or view it on GitHub: https://github.com/topepo/caret/issues/147#issuecomment-96875650
No, you would just source the two model lists and use the code shown at the bottom of the first post.
They will eventually be part of the package but not before I write tests and some documentation.
On Tue, Apr 28, 2015 at 6:46 AM, larry77 notifications@github.com wrote:
Thanks for the excellent work! If I install the development version of caret, have these functions already been incorporated? Cheers
Lorenzo
On Mon, Apr 27, 2015 at 07:04:27PM -0700, topepo wrote:
Here is some code for the linear models:
modelInfo <- list(label = "eXtreme Gradient Boosting", library = c("xgboost"), type = c("Regression", "Classification"), parameters = data.frame(parameter = c('nrounds', 'lambda', 'alpha'), class = rep("numeric", 3), label = c('# Boosting Iterations', 'L2 Regularization', 'L2 Regularization')), grid = function(x, y, len = NULL) expand.grid(lambda = c(0, 10 ^ seq(-1, -4, length = len - 1)), alpha = c(0, 10 ^ seq(-1, -4, length = len - 1))), loop = NULL, fit = function(x, y, wts, param, lev, last, classProbs, ...) { if(is.factor(y)) { if(length(lev) == 2) { y <- ifelse(y == lev[1], 1, 0) dat <- xgb.DMatrix(as.matrix(x), label = y) out <- xgb.train(list(lambda = param$lambda, alpha = param$alpha), data = dat, nrounds = param$nrounds, objective = "binary:logistic", ...) } else { y <- as.numeric(y) - 1 dat <- xgb.DMatrix(as.matrix(x), label = y) out <- xgb.train(list(lambda = param$lambda, alpha = param$alpha), data = dat, num_class = length(lev), nrounds = param$nrounds, objective = "multi:softprob", ...) } } else { dat <- xgb.DMatrix(as.matrix(x), label = y) out <- xgb.train(list(lambda = param$lambda, alpha = param$alpha), data = dat, nrounds = param$nrounds, objective = "reg:linear", ...) } out }, predict = function(modelFit, newdata, submodels = NULL) { newdata <- xgb.DMatrix(as.matrix(newdata)) out <- predict(modelFit, newdata) if(modelFit$problemType == "Classification") { if(length(modelFit$obsLevels) == 2) { out <- ifelse(out >= .5, modelFit$obsLevels[1], modelFit$obsLevels[2]) } else { out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE) out <- modelFit$obsLevels[apply(out, 1, which.max)] } } out }, prob = function(modelFit, newdata, submodels = NULL) { newdata <- xgb.DMatrix(as.matrix(newdata)) out <- predict(modelFit, newdata) if(length(modelFit$obsLevels) == 2) { out <- cbind(out, 1 - out) colnames(out) <- modelFit$obsLevels } else { out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE) colnames(out) <- modelFit$obsLevels } as.data.frame(out) }, predictors = function(x, ...) { imp <- xgb.importance(x$xNames, model = x) x$xNames[x$xNames %in% imp$Feature] }, varImp = function(object, numTrees = NULL, ...) { imp <- xgb.importance(x$xNames, model = x) imp <- as.data.frame(imp)[, 1:2] rownames(imp) <- as.character(imp[,1]) imp <- imp[,2,drop = FALSE] colnames(imp) <- "Overall" imp }, levels = function(x) x$obsLevels, tags = c("Linear Classifier Models", "Linear Regression Models", "L1 Regularization Models", "L2 Regularization Models", "Boosting", "Ensemble Model", "Implicit Feature Selection"), sort = function(x) { # This is a toss-up, but the # trees probably adds # complexity faster than number of splits x[order(x$nrounds, x$alpha, x$lambda),] })
Reply to this email directly or view it on GitHub: https://github.com/topepo/caret/issues/147#issuecomment-96875650
— Reply to this email directly or view it on GitHub https://github.com/topepo/caret/issues/147#issuecomment-97012797.
Thank you very much for this quick implementation!
It seems to work well, but unfortunately I can't seem to be able to coax it to work do a CV with a log-loss in a multi-class classification problem.
If I run it with trainControl method='none' I get the expected result, but if I set it to CV, it switches from mlogloss to Accuracy. If I write my own summaryFunction with the log-loss calculation, it crashes because the the summaryFunction receives only class labels not class probabilities. I've tried to blindly add type='prob' to all predict calls in your modelInfo, but no luck...
By any chance do you happen know what I'm doing wrong? Thank you very much!
You probably need to add classProbs = TRUE to your trainControl function.
On Thu, Apr 30, 2015 at 3:02 PM, rsuhada notifications@github.com wrote:
Thank you very much for this quick implementation!
It seems to work well, but unfortunately I can't seem to be able to coax it to work do a CV with a log-loss in a multi-class classification problem.
If I run it with trainControl method='none' I get the expected result, but if I set it to CV, it switches from mlogloss to Accuracy. If I write my own summaryFunction with the log-loss calculation, it crashes because the the summaryFunction receives only class labels not class probabilities. I've tried to blindly add type='prob' to all predict calls in your modelInfo, but no luck...
By any chance do you happen know what I'm doing wrong? Thank you very much!
— Reply to this email directly or view it on GitHub https://github.com/topepo/caret/issues/147#issuecomment-97928462.
Yes, I have classProbs=TRUE too (and it works as long method='none'...).
Can you send us the code and an example for testing?
On Thu, Apr 30, 2015 at 3:26 PM, rsuhada notifications@github.com wrote:
Yes, I have classProbs=TRUE too (and it works as long method='none'...).
— Reply to this email directly or view it on GitHub https://github.com/topepo/caret/issues/147#issuecomment-97936617.
A small grid for testing purposes:
## small grid for test
grid.0 <- expand.grid(nrounds = c(10, 20),
max_depth = 6,
eta = 0.3)
Case A:
## train with cv (number/repeats low just for testing speed)
train.ctrl <- trainControl(method = 'cv', number = 3, repeats = 1,
classProbs = TRUE,
verboseIter=TRUE)
This traininig defaults to accuracy instead of the required log-loss:
The metric "mlogloss" was not in the result set. Accuracy will be used instead.
Case B: I define a summary function:
mc_logloss <- function(observed, predicted, model, eps=1e-15){
## multi-class log-loss
predicted <- pmin(pmax(predicted, eps), 1-eps)
out <- -1.0*sum(observed*log(predicted))/nrow(observed)
names(out) <- 'mlogloss'
return(out)
}
The corresponding trainControl:
train.ctrl <- trainControl(method = 'cv', number = 3, repeats = 1,
classProbs = TRUE,
summaryFunction = mc_logloss,
verboseIter=TRUE)
This leads to crash due to not passing the probabilities:
Error in log(predicted) (from #4) : non-numeric argument to mathematical function
The train function for Case A and Case B is the same:
xgb0 <- train(target ~ ., data = train,
type = "Classification",
metric = 'mlogloss',
objective = 'multi:softprob',
num.class = 9,
nthread = 4,
method = modelInfo,
tuneGrid = grid.0,
trControl = train.ctrl)
As mentioned method='none' trains on mlogloss and returns class probabilities as required, it just with CV I can't get it to do both. Thank you for your support!
The class probabilities are there but your function is referencing them incorrectly. See this page for more details. Basically, the class probabilities come in as different columns and the predicted
column is the predicted class (a factor).
Also, you can have xgb.train
use the multinomial log loss during boosting. See ?xgb.train
for more details.
Hello, I am not sure this is relevant, but the following code
###########################################################
library(readr) library(xgboost) library(caret)
train <- read_csv("train.csv")
x <- subset(train, select=-c(target))
x <- as.matrix(x) x = matrix(as.numeric(x),nrow(x),ncol(x)) ## now x is a matrix of purely numerical values
y <- train$target y = gsub('Class_','',y) y = as.integer(y)-1 #xgboost take features in [0,numOfClass)
grid <- expand.grid(nrounds = seq(1, 601, by = 25), max_depth = 5:17, eta = c(seq(0.01,0.31,by=0.05)))
mod1 <- train(x,y, method = modelInfo, tuneGrid = grid, nthread=10, verbose=T, trControl = trainControl(method = "repeatedcv" , number=10, repeats=10, classProbs = TRUE) )
############################################################## where train.csv is taken from a kaggle competition
returns a segmentation fault (see the end of the email), where I used the function for the trees. Am I making a mistake somewhere? Cheers
Lorenzo
##########################################################
* caught segfault * address 0x190, cause 'memory not mapped'
Traceback: 1: .Call("XGBoosterUpdateOneIter_R", booster, as.integer(iter), dtrain, PACKAGE = "xgboost") 2: xgb.iter.update(bst$handle, dtrain, i - 1, obj) 3: xgb.train(list(eta = param$eta, max_depth = param$max_depth), data = dat, nrounds = param$nrounds, objective = "reg:linear", ...) 4: method$fit(x = x, y = y, wts = wts, param = tuneValue, lev = obsLevels, last = last, classProbs = classProbs, ...) 5: createModel(x = x[modelIndex, , drop = FALSE], y = y[modelIndex], wts = wts[modelIndex], method = method, tuneValue = info$loop[parm, , drop = FALSE], obsLevels = lev, pp = ppp, classProbs = ctrl$classProbs, ...) 6: doTryCatch(return(expr), name, parentenv, handler) 7: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 8: tryCatchList(expr, classes, parentenv, handlers) 9: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = stderr()) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))}) 10: try(createModel(x = x[modelIndex, , drop = FALSE], y = y[modelIndex], wts = wts[modelIndex], method = method, tuneValue = info$loop[parm, , drop = FALSE], obsLevels = lev, pp = ppp, classProbs = ctrl$classProbs, ...), silent = TRUE) 11: eval(expr, envir, enclos) 12: eval(xpr, envir = envir) 13: doTryCatch(return(expr), name, parentenv, handler) 14: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 15: tryCatchList(expr, classes, parentenv, handlers) 16: tryCatch(eval(xpr, envir = envir), error = function(e) e) 17: doTryCatch(return(expr), name, parentenv, handler) 18: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 19: tryCatchList(expr, classes, parentenv, handlers) 20: tryCatch({ repeat { args <- nextElem(it) if (obj$verbose) { cat(sprintf("evaluation # %d:\n", i)) print(args) } for (a in names(args)) assign(a, args[[a]], pos = envir, inherits = FALSE) r <- tryCatch(eval(xpr, envir = envir), error = function(e) e) if (obj$verbose) { cat("result of evaluating expression:\n") print(r) } tryCatch(accumulator(list(r), i), error = function(e) { cat("error calling combine function:\n") print(e) NULL }) i <- i + 1 }}, error = function(e) { if (!identical(conditionMessage(e), "StopIteration")) stop(simpleError(conditionMessage(e), expr))}) 21: e$fun(obj, substitute(ex), parent.frame(), e$data) 22: foreach(iter = seq(along = resampleIndex), .combine = "c", .verbose = FALSE, .packages = pkgs, .errorhandling = "stop") %:% foreach(parm = 1:nrow(info$loop), .combine = "c", .verbose = FALSE, .packages = pkgs, .errorhandling = "stop") %op% { testing <- FALSE if (!(length(ctrl$seeds) == 1 && is.na(ctrl$seeds))) set.seed(ctrl$seeds[[iter]][parm]) library(caret) if (ctrl$verboseIter) progress(printed[parm, , drop = FALSE], names(resampleIndex), iter) if (names(resampleIndex)[iter] != "AllData") { modelIndex <- resampleIndex[[iter]] holdoutIndex <- ctrl$indexOut[[iter]] } else { modelIndex <- 1:nrow(x) holdoutIndex <- modelIndex } if (testing) cat("pre-model\n") if (is.null(info$submodels[[parm]]) || nrow(info$submodels[[parm]]) > 0) { submod <- info$submodels[[parm]] } else submod <- NULL mod <- try(createModel(x = x[modelIndex, , drop = FALSE], y = y[modelIndex], wts = wts[modelIndex], method = method, tuneValue = info$loop[parm, , drop = FALSE], obsLevels = lev, pp = ppp, classProbs = ctrl$classProbs, ...), silent = TRUE) if (class(mod)[1] != "try-error") { predicted <- try(predictionFunction(method = method, modelFit = mod$fit, newdata = x[holdoutIndex, , drop = FALSE], preProc = mod$preProc, param = submod), silent = TRUE) if (class(predicted)[1] == "try-error") { wrn <- paste(colnames(printed[parm, , drop = FALSE]), printed[parm, , drop = FALSE], sep = "=", collapse = ", ") wrn <- paste("predictions failed for ", names(resampleIndex)[iter], ": ", wrn, " ", as.character(predicted), sep = "") if (ctrl$verboseIter) cat(wrn, "\n") warning(wrn) rm(wrn) nPred <- length(holdoutIndex) if (!is.null(lev)) { predicted <- rep("", nPred) predicted[seq(along = predicted)] <- NA } else { predicted <- rep(NA, nPred) } if (!is.null(submod)) { tmp <- predicted predicted <- vector(mode = "list", length = nrow(info$submodels[[parm]]) + 1) for (i in seq(along = predicted)) predicted[[i]] <- tmp rm(tmp) } } } else { wrn <- paste(colnames(printed[parm, , drop = FALSE]), printed[parm, , drop = FALSE], sep = "=", collapse = ", ") wrn <- paste("model fit failed for ", names(resampleIndex)[iter], ": ", wrn, " ", as.character(mod), sep = "") if (ctrl$verboseIter) cat(wrn, "\n") warning(wrn) rm(wrn) nPred <- length(holdoutIndex) if (!is.null(lev)) { predicted <- rep("", nPred) predicted[seq(along = predicted)] <- NA } else { predicted <- rep(NA, nPred) } if (!is.null(submod)) { tmp <- predicted predicted <- vector(mode = "list", length = nrow(info$submodels[[parm]]) + 1) for (i in seq(along = predicted)) predicted[[i]] <- tmp rm(tmp) } } if (testing) print(head(predicted)) if (ctrl$classProbs) { if (class(mod)[1] != "try-error") { probValues <- probFunction(method = method, modelFit = mod$fit, newdata = x[holdoutIndex, , drop = FALSE], preProc = mod$preProc, param = submod) } else { probValues <- as.data.frame(matrix(NA, nrow = nPred, ncol = length(lev))) colnames(probValues) <- lev if (!is.null(submod)) { tmp <- probValues probValues <- vector(mode = "list", length = nrow(info$submodels[[parm]]) + 1) for (i in seq(along = probValues)) probValues[[i]] <- tmp rm(tmp) } } if (testing) print(head(probValues)) } if (is.numeric(y)) { if (is.logical(ctrl$predictionBounds) && any(ctrl$predictionBounds)) { if (is.list(predicted)) { predicted <- lapply(predicted, trimPredictions, mod_type = "Regression", bounds = ctrl$predictionBounds, limits = ctrl$yLimits) } else { predicted <- trimPredictions(mod_type = "Regression", bounds = ctrl$predictionBounds, limits = ctrl$yLimit, pred = predicted) } } else { if (is.numeric(ctrl$predictionBounds) && any(!is.na(ctrl$predictionBounds))) { if (is.list(predicted)) { predicted <- lapply(predicted, trimPredictions, mod_type = "Regression", bounds = ctrl$predictionBounds, limits = ctrl$yLimits) } else { predicted <- trimPredictions(mod_type = "Regression", bounds = ctrl$predictionBounds, limits = ctrl$yLimit, pred = predicted) } } } } if (!is.null(submod)) { allParam <- expandParameters(info$loop[parm, , drop = FALSE], info$submodels[[parm]]) allParam <- allParam[complete.cases(allParam), , drop = FALSE] predicted <- lapply(predicted, function(x, y, wts, lv) { if (!is.factor(x) & is.character(x)) x <- factor(as.character(x), levels = lv) out <- data.frame(pred = x, obs = y, stringsAsFactors = FALSE) if (!is.null(wts)) out$weights <- wts out }, y = y[holdoutIndex], wts = wts[holdoutIndex], lv = lev) if (testing) print(head(predicted)) if (ctrl$classProbs) { for (k in seq(along = predicted)) predicted[[k]] <- cbind(predicted[[k]], probValues[[k]]) } if (ctrl$savePredictions) { tmpPred <- predicted for (modIndex in seq(along = tmpPred)) { tmpPred[[modIndex]]$rowIndex <- holdoutIndex tmpPred[[modIndex]] <- merge(tmpPred[[modIndex]], allParam[modIndex, , drop = FALSE], all = TRUE) } tmpPred <- rbind.fill(tmpPred) tmpPred$Resample <- names(resampleIndex)[iter] } else tmpPred <- NULL thisResample <- lapply(predicted, ctrl$summaryFunction, lev = lev, model = method) if (testing) print(head(thisResample)) if (length(lev) > 1) { cells <- lapply(predicted, function(x) flatTable(x$pred, x$obs)) for (ind in seq(along = cells)) thisResample[[ind]] <- c(thisResample[[ind]], cells[[ind]]) } thisResample <- do.call("rbind", thisResample) thisResample <- cbind(allParam, thisResample) } else { if (is.factor(y)) predicted <- factor(as.character(predicted), levels = lev) tmp <- data.frame(pred = predicted, obs = y[holdoutIndex], stringsAsFactors = FALSE) names(tmp)[1] <- "pred" if (!is.null(wts)) tmp$weights <- wts[holdoutIndex] if (ctrl$classProbs) tmp <- cbind(tmp, probValues) if (ctrl$savePredictions) { tmpPred <- tmp tmpPred$rowIndex <- holdoutIndex tmpPred <- merge(tmpPred, info$loop[parm, , drop = FALSE], all = TRUE) tmpPred$Resample <- names(resampleIndex)[iter] } else tmpPred <- NULL thisResample <- ctrl$summaryFunction(tmp, lev = lev, model = method) if (length(lev) > 1) thisResample <- c(thisResample, flatTable(tmp$pred, tmp$obs)) thisResample <- as.data.frame(t(thisResample)) thisResample <- cbind(thisResample, info$loop[parm, , drop = FALSE]) } thisResample$Resample <- names(resampleIndex)[iter] if (ctrl$verboseIter) progress(printed[parm, , drop = FALSE], names(resampleIndex), iter, FALSE) list(resamples = thisResample, pred = tmpPred) } 23: nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, method = models, ppOpts = preProcess, ctrl = trControl, lev = classLevels, ...) 24: train.default(x, y, method = modelInfo, tuneGrid = grid, nthread = 10, verbose = T, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, classProbs = TRUE)) 25: train(x, y, method = modelInfo, tuneGrid = grid, nthread = 10, verbose = T, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, classProbs = TRUE)) 26: eval(expr, envir, enclos) 27: eval(ei, envir) 28: withVisible(eval(ei, envir)) 29: source("otto-xgboost-caret.R")
Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace Selection: 1 aborting ... Segmentation fault
On Fri, May 01, 2015 at 06:55:00AM -0700, topepo wrote:
The class probabilities are there but your function is referencing them incorrectly. See this page for more details. Basically, the class probabilities come in as different columns and the
predicted
column is the predicted class (a factor).Also, you can have
xgb.train
use the multinomial log loss during boosting. See?xgb.train
for more details.
Reply to this email directly or view it on GitHub: https://github.com/topepo/caret/issues/147#issuecomment-98139393
I have seen a few set faults using the linear boosting code form that library. It doesn't seem to be dependent on caret
Is there not support in caret for the 'missing' parameter in the xgboost matrix? This is a pretty important element but I cant seem to find it. Thanks!
The ...
can only effectively point to one internal function here and it goes to xgb.train
. You can overwrite the fitting code using a custom model and pass that option to xgb.DMatrix
First, thank you very much for this wonderful job. But I have something to check. When I ran the varImp function in modelInfo to find the variables Importance, there was something wrong, I wonder the line "imp <- xgb.importance(x$xNames, model = x)" should be "imp <- xgb.importance(object$xNames, model = object)"? @topepo
I cannot be of much help here apart from suggesting that the xgboost library would be a very good tool to add to caret.