Closed harrysouthworth closed 9 years ago
I unfortunately am having a similar issue. did you find a fix, besides using the gbm.more workaround?
I didn't have time to dig in more deeply, but it looks like there are some problems with the multinomial distribution, and possibly an issue with gbmentry.
Here is the Valgrind output from running this quick example on Mint 17. On Linux this doesn't result in a segfault, but it does produce the following error:
Error in gbm.more(q, n.new.trees = 100) :
Observations are not in order. gbm() was unable to build an index for the design matrix. Could be a bug in gbm or an unusual data type in data.
This issue was moved to gbm-developers/gbm#14
Sent to me by email.
I am testing ‘gbm' on some new data. Using gbm v2.1-05, R 3.0.3 (using the GUI), Max OS 10.9.2. I’ve attached a .rds file with the test data. My response variable is a factor consisting of >40 levels. The predictors are a mix of categorical and numeric/integer. I get an error (actually, R crashes) using ‘gbm.more’. I can replicate with the following code:
require(gbm) Loading required package: gbm Loading required package: survival Loading required package: splines Loading required package: lattice Loading required package: parallel Loaded gbm 2.1-05
dset = readRDS(“test.rds”) # Attached .rds file
This will complete in ~90 seconds (no errors)
q = gbm(puma00~year+loc+hincp+age+race+educ+hstat+rent+mortgage, distribution="multinomial", data=dset, interaction.depth=4, shrinkage = 0.01, n.minobsinnode=max(30,ceiling(nrow(dset)*0.001)), n.cores=1, n.trees=100)
Not sure what this error is when printing ‘q’ -- not fatal, just FYI
q gbm(formula = puma00 ~ year + loc + hincp + age + race + educ + hstat + rent + mortgage, distribution = "multinomial", data = dset, n.trees = 100, interaction.depth = 4, n.minobsinnode = max(30, ceiling(nrow(dset) * 0.001)), shrinkage = 0.01, n.cores = 1) A gradient boosted model with multinomial loss function. 100 iterations were performed. There were 9 predictors of which 9 had non-zero influence. Error in apply(x$cv.fitted, 1, function(x, labels) { : dim(X) must have a positive length
HERE is the real problem: attempt to add 100 trees
q2 = gbm.more(q, n.new.trees=100)
* caught segfault * address 0x1c74a8000, cause 'memory not mapped'
Traceback: 1: .Call("gbm", Y = as.double(y), Offset = as.double(offset), X = as.double(x), X.order = as.integer(x.order), weights = as.double(w), Misc = as.double(Misc), cRows = as.integer(cRows), cCols = as.integer(cCols), var.type = as.integer(object$var.type), var.monotone = as.integer(object$var.monotone), distribution = as.character(distribution.call.name), n.trees = as.integer(n.new.trees), interaction.depth = as.integer(object$interaction.depth), n.minobsinnode = as.integer(object$n.minobsinnode), n.classes = as.integer(object$num.classes), shrinkage = as.double(object$shrinkage), bag.fraction = as.double(object$bag.fraction), train.fraction = as.integer(nTrain), fit.old = as.double(object$fit), n.cat.splits.old = as.integer(length(object$c.splits)), n.trees.old = as.integer(object$n.trees), verbose = as.integer(verbose), PACKAGE = "gbm") 2: gbm.more(q, n.new.trees = 100)
Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace
INTERESTINGLY, if I switch to a numeric response variable, I get different issues:
q = gbm(hincp~year+loc+age+race+educ+hstat+rent+mortgage, distribution="gaussian", data=dset, interaction.depth=4, shrinkage = 0.01, n.minobsinnode=max(30,ceiling(nrow(dset)*0.001)), n.cores=1, n.trees=100)
No predictors are included...
q gbm(formula = hincp ~ year + loc + age + race + educ + hstat + rent + mortgage, distribution = "gaussian", data = dset, n.trees = 100, interaction.depth = 4, n.minobsinnode = max(30, ceiling(nrow(dset) * 0.001)), shrinkage = 0.01, n.cores = 1) A gradient boosted model with gaussian loss function. 100 iterations were performed. There were 8 predictors of which 0 had non-zero influence.
Summary of cross-validation residuals: 0% 25% 50% 75% 100% NA NA NA NA NA
Cross-validation pseudo R-squared: 1
summary(q) Error in plot.window(xlim, ylim, log = log, ...) : need finite 'xlim' values In addition: Warning messages: 1: In min(x) : no non-missing arguments to min; returning Inf 2: In max(x) : no non-missing arguments to max; returning -Inf
BUT ‘gbm.more’ does not cause R to crash...
q2 = gbm.more(q, n.new.trees=100) # No error
Any ideas? Maybe a pointer issue when distribution=“multinomial”?
Many thanks, Kevin