suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

version 0.1-17 keeps crashing R session in windows #46

Closed csetzkorn closed 4 years ago

csetzkorn commented 5 years ago

The following example code keeps crashing my R session in windows. Any ideas?

library(dplyr) library(ggplot2) library(Rborist) library(datasets)

rm(list = ls()) set.seed(42)

data(iris)

temp <- iris %>% select( Sepal.Width , Petal.Length , Petal.Width , Sepal.Length )

x <- subset(temp, select = -c(Sepal.Length))

model <- Rborist(x , temp$Sepal.Length

, quantVec = seq(0.1, 1.0, by = 0.1)

, quantiles = TRUE
, minNode = 5
, regMono = c
(
    -1.0, 0.0, 0.0
)

)

prediction <- predict(model, x, quantVec = seq(0.6)) temp$prediction <- prediction$yPred temp$residuals <- temp$Sepal.Length - temp$prediction

ggplot(temp, aes(x = prediction, y = adjusted_damages_paid_100)) + geom_point(color = "black")

suiji commented 5 years ago

Thank you for distilling this to a minimal test case.

The crash does not reproduce with the development version, 0.2-0 on Linux. I will attempt to test it on a Windows system over the weekend.

Do you recall whether the failure was precipitated by the predict() call, or does it happen as early as training (i.e., Rborist() call)?

suiji commented 5 years ago

Yes, this also reproduces with 0.1-17 on Linux. Will attempt to identify a work-around, as 0.2-0 is not yet CRAN-ready.

csetzkorn commented 5 years ago

Thanks!!!! We isolated this to predict. Also noticed that it produces same predictions for different quartiles when it works ...

csetzkorn commented 5 years ago

Could I install from GitHub to get something working?

suiji commented 5 years ago

Apologies for the delay. I had been away on conference duty.

As you are only observing the problem in prediction, this is looking like another report of the zero-length vector problem reported a few weeks ago. We had been relying on undocumented behaviour in Rcpp which, rightfully so, changed under our feet. Assuming that was indeed the problem, then 0-2.0 should be satisfactory.

Yes, you should be able to clone the Github project and build a tarball, as described in or near the front page. Please do get back with questions, as needed.

As for bad quantile reports, let me do some more digging. Sorry to make you a guinea pig in this process. That said, I appreciate your taking the time to log the bugs.

csetzkorn commented 5 years ago

How can I install from Github under Windows. I tried:

library(devtools) install_github("suiji/Rborist")

and

library(githubinstall) githubinstall("suiji/Rborist")

suiji commented 5 years ago

The Github source tree is structured to facilitate building multiple tools in multiple languages. It is not organized in a form suitable for building R packages directly. We may in the not-too-distant future generate a CRAN-friendly mirror of the project to make it easier to rebuild using the 'devtools' package.

That said, the easiest way to build the development version is as follows:

i) Clone the Github project. ii) Run the bash script 'Rborist.dev.sh' in the Rborist/Package subdirectory. iii) Install the resulting tarball, in this case 'Rborst_0.2-0.tar.gz' as you would any other package. I can assist further, if these instructions are unfamiliar.

Quantiles appear to be laid out correctly in 0.2-0. Quantile handling has been streamlined significantly in the development version. I do not completely trust its behaviour in the presence of ties, though, which is another reason the package is not yet ready for CRAN.

suiji commented 5 years ago

If you don't want to clone the project, I can instead just send you a tarball. A few people request this from time to time. Just let me know.

csetzkorn commented 5 years ago

Yes that would be fine. Do you need my email? Or simply attach here? Thanks!

suiji commented 5 years ago

A tarball of the latest source snapshot has been placed in a new "Tarballs" directory. This should be a useful feature for those wishing to install packages directly.

The thread will stay open for a while, in case you notice additional problems.

csetzkorn commented 5 years ago

Thanks. I just tried:

install.packages("C:/SomeDirectory/Rborist_0.2-0.tar.gz", repos=NULL, type="source")

but get:

*** arch - i386 Error in .shlib_internal(args) : C++14 standard requested but CXX14 is not defined

suiji commented 5 years ago

This has to do with the version of R you are using. If your R instance does not recognize CXX14, then it should also not have been able to install version 0.1-17. There are ways around this problem but I'm curious: are you attempting to install using the same version of R under which you installed 0.1-17?

csetzkorn commented 5 years ago

yes. maybe I should switch to the latest version?

suiji commented 5 years ago

Installing a more recent version of R should result in CXX14 being defined. It's possible that an older compiler will have problems actually recognizing the c++14 extensions, but this is usually treatable by fiddling with Makevars definitions.

It's still puzzling that you were able to install 0.1-17, which also specifies CXX14. Do you recall whether you built from source or, instead, downloaded a pre-built binary?

csetzkorn commented 5 years ago

Sorry about confusion. I installed the package from CRAN but it is not working ... that's why I am here haha.

suiji commented 5 years ago

Right - CRAN gives you the option of installing pre-built Windows binaries, so you never encountered the CXX14 warning. Now that you are attempting to build from source, however, installation itself is failing.

csetzkorn commented 5 years ago

What are the time lines to put latest version on CRAN?

suiji commented 5 years ago

We really want to get something up in the next 3 to 4 weeks. The plan was to include importance testing and hooks for multi-socket parallelization, both of which have been requested by users. We may just release 0.2-0 as an infrastructure and quality update, introducing the new features as follow-ups.

csetzkorn commented 5 years ago

Do you reckon there is someone who could build me a package for windows 7? I would really love to use your package for a model refresh in September ensuring monotonic decrease of the dependent variable given one particular IV.

suiji commented 5 years ago

Let me try building some binaries for you on Windows 10.

suiji commented 5 years ago

There is at least one construction in the code preventing the public server from completing a Windows build. This probably stems from its using an older compiler. Once these constructions have been flushed out, it should be possible to offer a Windows binary.

suiji commented 5 years ago

The public server now builds 0.2-0 successfully for R 3.6. There is a zipped binary available.

csetzkorn commented 5 years ago

Thanks - amazing! Where is it please?

suiji commented 5 years ago

It's on the WinBuilder project's server, file Rborist_0.2-0.zip, under a randomly-generated directory name having a life expectancy of 2-3 days:

https://win-builder.r-project.org/Ww7mv42yq0iY/

Pointing you here seems like a better option than posting binaries and tarballs on Github. Github's journaling facility is not really appropriate for these types of files.

This version, or something very similar, is heading to CRAN shortly. I am going over the quantile results now, but would appreciate any feedback you might offer before freezing the code. Please note that invoking the quantile feature produces a histogram of the all the regression means, viz., it does not (yet) perform actual quantile regression.

csetzkorn commented 5 years ago

Thanks a lot. This works now. The only issue I have, is that the predictions are the same for different quantiles - see example code below. Maybe I do something wrong or this dataset is too simple?

library(dplyr) library(ggplot2) library(Rborist) library(datasets) library(reshape2) library(reshape)

rm(list = ls())

setwd("C:/Data") set.seed(42)

data(iris)

temp <- iris %>% select( Sepal.Width , Petal.Length , Petal.Width , Sepal.Length )

x <- subset(temp, select = -c(Sepal.Length))

model <- Rborist(x , temp$Sepal.Length , quantVec = seq(0.1, 1.0, by = 0.1) , quantiles = TRUE

, minNode = 5

, regMono = c
(
    -1.0, 0.0, 0.0
)

)

temp$prediction_50 <- predict(model, x, quantVec = seq(0.5))$yPred temp$prediction_60 <- predict(model, x, quantVec = seq(0.6))$yPred temp$prediction_90 <- predict(model, x, quantVec = seq(0.9))$yPred

write.csv(temp, "temp.csv", row.names=FALSE)

temp1 = melt(temp, id.vars = c("Sepal.Length"), measure.vars = c("prediction_50", "prediction_60", "prediction_90"))

ggplot(temp1, aes(x = value, y = Sepal.Length, color=variable, group=variable)) + geom_point()

csetzkorn commented 5 years ago

I just checked this on a more relativistic dataset and the predictions for different quantiles are also the same. However, maybe I do something wrong in the above code (e.g. how I obtain the predictions fro different quantiles and/or how I fit the model)?

suiji commented 5 years ago

There are a couple of problems in the test code:

i) You probably want, for example, "seq(0.6, 1)" instead of "seq(0.6)". The latter defaults to an integer value of 1. In the case of quantiles, this would always yield the maximum prediction: 100th percentile.

ii) You definitely want "$qPred" and not "$yPred". The latter is simply giving you the predicted y-value.

Finally, yesterday's characterization of quantile prediction was not quite precise. In lieu of true quantile regression, what is actually being computed for a given observation, is a sample-weighted mean of response values over binned ranks. The ranks are binned from all terminal node predictions in the (possibly unbagged) forest.

csetzkorn commented 5 years ago

Thanks. I will try this or is it safer to wait until the official CRAN release, which is imminent?

csetzkorn commented 5 years ago

Tried the code below but results do not make much sense.

library(dplyr) library(ggplot2) library(Rborist) library(datasets) library(reshape2) library(reshape)

rm(list = ls())

setwd("C:/Data") set.seed(42)

data(iris)

temp <- iris %>% select( Sepal.Width , Petal.Length , Petal.Width , Sepal.Length )

x <- subset(temp, select = -c(Sepal.Length))

model <- Rborist(x , temp$Sepal.Length

, quantVec = seq(0.1, 1.0, by = 0.1)

, quantiles = TRUE
#, minNode = 5
, regMono = c
(
    -1.0, 0.0, 0.0
)

)

temp$prediction_50 <- predict(model, x, quantVec = seq(0.5, 1))$qPred temp$prediction_60 <- predict(model, x, quantVec = seq(0.6, 1))$qPred temp$prediction_90 <- predict(model, x, quantVec = seq(0.9, 1))$qPred

write.csv(temp, "temp.csv", row.names=FALSE)

temp1 = melt(temp, id.vars = c("Sepal.Length"), measure.vars = c("prediction_50", "prediction_60", "prediction_90"))

ggplot(temp1, aes(x = value, y = Sepal.Length, color=variable, group=variable)) + geom_point()

suiji commented 5 years ago

I reran your example, but passed 'oob = TRUE' to the prediction method. The first fifty or so quantile ranges look "good", in that the 'yPred' value is close to the median. Somewhere beyond row fifty, however, things go south (or, better perhaps, north): 'yPred' tends to hover above the 90th. percentile. I concur that an error lurks here.

Let me get back to you, as this is gating the CRAN release.

Thanks again.

suiji commented 4 years ago

Some, possibly all, of this issue has been addressed by the most recent changes posted to Github. I will get back to you with a link to the binaries, as it becomes available.

Your test isolated some poor behaviour not revealed by our own tests, probably because of the interplay with monotonicity.

csetzkorn commented 4 years ago

Thanks - great stuff!

suiji commented 4 years ago

The link just became available. As before, it should be good for about three days. If it expires before you get a chance download, please post a note on this thread to arrange a private transfer.

https://win-builder.r-project.org/cSByzxbK28J3/

Thank you for your attention to this. Quantiles have not received a lot of interest recently.

csetzkorn commented 4 years ago

Thanks amazing! I ran the code below. It appears to produce more sensible results. Just curious, am I accessing the predictions for different quantiles correctly? I am also wondering, would it be possible to obtain the package build for R 3.5.x? Unfortunately, upgrading my deployment environment, won't be straightforward )-:

library(dplyr) library(ggplot2) library(Rborist) library(datasets) library(reshape2) library(reshape)

library(installr)

uninstall.packages("Rborist")

rm(list = ls())

setwd("C:/Data") set.seed(42)

data(iris)

temp <- iris %>% select( Sepal.Width , Petal.Length , Petal.Width , Sepal.Length )

x <- subset(temp, select = -c(Sepal.Length))

model <- Rborist(x , temp$Sepal.Length

, quantVec = seq(0.1, 1.0, by = 0.1)

, quantiles = TRUE
#, minNode = 5
, regMono = c
(
    -1.0, 0.0, 0.0
)

)

temp$prediction_50 <- predict(model, x, quantVec = seq(0.5, 1))$qPred temp$prediction_60 <- predict(model, x, quantVec = seq(0.6, 1))$qPred temp$prediction_90 <- predict(model, x, quantVec = seq(0.9, 1))$qPred

write.csv(temp, "temp.csv", row.names=FALSE)

temp1 = melt(temp, id.vars = c("Sepal.Length"), measure.vars = c("prediction_50", "prediction_60", "prediction_90"))

ggplot(temp1, aes(x = value, y = Sepal.Length, color=variable, group=variable)) + geom_point()

suiji commented 4 years ago

What you're doing is not incorrect, although it is not especially concise. You may want to try:

temp$prediction <- predict(model, x, quantVec=c(0.5, 0.6, 0.9))

Then temp$prediction$qPred will hold a 150 x 3 array bearing the quantile estimates for each row. Further, temp$prediction$yPred will contain the y-value estimates, facilitating direct comparison. Finally, temp$prediction$qEst will display the quantiles of the y-estimates themselves.

WinBuilder currently hosts R 3.6 and R 3.7 versions. For 3.5.x, your best bets may either be trying to build locally or waiting until CRAN offers a 0.2-0 "oldrel" package.

csetzkorn commented 4 years ago

Thanks.

What would be a best way to install from Github. I tried this:

library(devtools) install_github("suiji/Arborist")

but get 404 error. So do not even get address correct ...

suiji commented 4 years ago

It has to build from a tarball, as the Github source tree does not have the structure expected by devtools. You would need to obtain a tarball, then fiddle with the flags defined in .R/Makevars.

The windows "oldrel" version of 0.2-0 should be available from CRAN in the next day or two. Half of the packages have already been installed.

suiji commented 4 years ago

It looks like the 'oldrel' version has posted on CRAN.

Closing this thread, but please feel free to reopen as needed.