harrysouthworth / gbm

Gradient boosted models
Other
106 stars 27 forks source link

segfault #13

Closed harrysouthworth closed 9 years ago

harrysouthworth commented 10 years ago

Sent to me by email.

I am testing ‘gbm' on some new data. Using gbm v2.1-05, R 3.0.3 (using the GUI), Max OS 10.9.2. I’ve attached a .rds file with the test data. My response variable is a factor consisting of >40 levels. The predictors are a mix of categorical and numeric/integer. I get an error (actually, R crashes) using ‘gbm.more’. I can replicate with the following code:

require(gbm) Loading required package: gbm Loading required package: survival Loading required package: splines Loading required package: lattice Loading required package: parallel Loaded gbm 2.1-05

dset = readRDS(“test.rds”) # Attached .rds file

This will complete in ~90 seconds (no errors)

q = gbm(puma00~year+loc+hincp+age+race+educ+hstat+rent+mortgage, distribution="multinomial", data=dset, interaction.depth=4, shrinkage = 0.01, n.minobsinnode=max(30,ceiling(nrow(dset)*0.001)), n.cores=1, n.trees=100)

Not sure what this error is when printing ‘q’ -- not fatal, just FYI

q gbm(formula = puma00 ~ year + loc + hincp + age + race + educ + hstat + rent + mortgage, distribution = "multinomial", data = dset, n.trees = 100, interaction.depth = 4, n.minobsinnode = max(30, ceiling(nrow(dset) * 0.001)), shrinkage = 0.01, n.cores = 1) A gradient boosted model with multinomial loss function. 100 iterations were performed. There were 9 predictors of which 9 had non-zero influence. Error in apply(x$cv.fitted, 1, function(x, labels) { : dim(X) must have a positive length

HERE is the real problem: attempt to add 100 trees

q2 = gbm.more(q, n.new.trees=100)

* caught segfault * address 0x1c74a8000, cause 'memory not mapped'

Traceback: 1: .Call("gbm", Y = as.double(y), Offset = as.double(offset), X = as.double(x), X.order = as.integer(x.order), weights = as.double(w), Misc = as.double(Misc), cRows = as.integer(cRows), cCols = as.integer(cCols), var.type = as.integer(object$var.type), var.monotone = as.integer(object$var.monotone), distribution = as.character(distribution.call.name), n.trees = as.integer(n.new.trees), interaction.depth = as.integer(object$interaction.depth), n.minobsinnode = as.integer(object$n.minobsinnode), n.classes = as.integer(object$num.classes), shrinkage = as.double(object$shrinkage), bag.fraction = as.double(object$bag.fraction), train.fraction = as.integer(nTrain), fit.old = as.double(object$fit), n.cat.splits.old = as.integer(length(object$c.splits)), n.trees.old = as.integer(object$n.trees), verbose = as.integer(verbose), PACKAGE = "gbm") 2: gbm.more(q, n.new.trees = 100)

Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace

INTERESTINGLY, if I switch to a numeric response variable, I get different issues:

q = gbm(hincp~year+loc+age+race+educ+hstat+rent+mortgage, distribution="gaussian", data=dset, interaction.depth=4, shrinkage = 0.01, n.minobsinnode=max(30,ceiling(nrow(dset)*0.001)), n.cores=1, n.trees=100)

No predictors are included...

q gbm(formula = hincp ~ year + loc + age + race + educ + hstat + rent + mortgage, distribution = "gaussian", data = dset, n.trees = 100, interaction.depth = 4, n.minobsinnode = max(30, ceiling(nrow(dset) * 0.001)), shrinkage = 0.01, n.cores = 1) A gradient boosted model with gaussian loss function. 100 iterations were performed. There were 8 predictors of which 0 had non-zero influence.

Summary of cross-validation residuals: 0% 25% 50% 75% 100% NA NA NA NA NA

Cross-validation pseudo R-squared: 1

summary(q) Error in plot.window(xlim, ylim, log = log, ...) : need finite 'xlim' values In addition: Warning messages: 1: In min(x) : no non-missing arguments to min; returning Inf 2: In max(x) : no non-missing arguments to max; returning -Inf

BUT ‘gbm.more’ does not cause R to crash...

q2 = gbm.more(q, n.new.trees=100) # No error

Any ideas? Maybe a pointer issue when distribution=“multinomial”?

Many thanks, Kevin

wilkinsonjason commented 9 years ago

I unfortunately am having a similar issue. did you find a fix, besides using the gbm.more workaround?

grantbrown commented 9 years ago

I didn't have time to dig in more deeply, but it looks like there are some problems with the multinomial distribution, and possibly an issue with gbmentry.

Here is the Valgrind output from running this quick example on Mint 17. On Linux this doesn't result in a segfault, but it does produce the following error:

Error in gbm.more(q, n.new.trees = 100) : 
  Observations are not in order. gbm() was unable to build an index for the design matrix. Could be a bug in gbm or an unusual data type in data.
harrysouthworth commented 9 years ago

This issue was moved to gbm-developers/gbm#14