imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
770 stars 193 forks source link

Predict performance #519

Open kadyb opened 4 years ago

kadyb commented 4 years ago

I compared the speed of the predict method from the ranger and randomForest packages and noticed a significant difference (in the case of 1 million observations over four times in favor of randomForest). Where does this difference come from?

library("randomForest")
library("ranger")

iris_data = iris

mdl_randomForest = randomForest(Species ~ ., iris_data)
mdl_ranger = ranger(Species ~ ., iris_data)

data_gen = data.frame(
  Sepal.Length = rnorm(1000000, mean(iris_data[, 1]), sd(iris_data[, 1])),
  Sepal.Width = rnorm(1000000, mean(iris_data[, 2]), sd(iris_data[, 2])),
  Petal.Length = rnorm(1000000, mean(iris_data[, 3]), sd(iris_data[, 3])),
  Petal.Width = rnorm(1000000, mean(iris_data[, 4]), sd(iris_data[, 4]))
)

start_time = Sys.time()
x1 = predict(mdl_ranger, data_gen, verbose = FALSE)
Sys.time() - start_time
#> Time difference of 1.068823 mins

start_time = Sys.time()
x2 = predict(mdl_randomForest, data_gen)
Sys.time() - start_time
#> Time difference of 14.75926 secs
Session Info ```r R version 3.6.3 (2020-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.3 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ranger_0.12.1 randomForest_4.6-14 loaded via a namespace (and not attached): [1] Rcpp_1.0.4.6 lattice_0.20-41 digest_0.6.25 crayon_1.3.4 [5] IRdisplay_0.7.0 grid_3.6.3 repr_1.1.0 jsonlite_1.6.1 [9] evaluate_0.14 pillar_1.4.4 rlang_0.4.6 uuid_0.1-4 [13] Matrix_1.2-18 IRkernel_1.1 tools_3.6.3 compiler_3.6.3 [17] base64enc_0.1-3 htmltools_0.4.0 pbdZMQ_0.3-3 ```
mnwright commented 4 years ago

Prediction is not very much optimized. We should do some profiling! See also #133 and #500.

toddmmorley commented 4 years ago

When I call predict() passing it thousands of high-dimensional data points in a data frame, it computes all of the predictions in about one second. But when I call predict() passing it one data point at a time (one row in the data frame), the performance is abysmal--about two seconds per row. I want to use my ranger random forest for streaming analytics in a context where I need to predict about once per second. That means I need predict() to return a single prediction in around one tenth of a second. Can you provide an efficient single-point prediction function?

mnwright commented 4 years ago

In the current setup, the tree structure needs to be converted from R to C++ when calling predict() in R. If you call several times, this has to be done each time... We should think about an alternative tree structure.

toddmmorley commented 4 years ago

Thanks for your kind reply. It certainly would be a boon for streaming-analytics applications to have a predict() call for a single record run about as fast as predict() operates for each of N records, when we pass a whole data frame to predict() in one go. For my current project we've had to reduce our forest by an order of magnitude, to get acceptable runtime performance.

I hope you'll let me know if you come up with something.

Brilliant library!

Todd Morley (720) 560-8901 ToddMMorley@gmail.com It's more fun to do something well than poorly.

On Fri, Jul 24, 2020 at 4:28 AM Marvin N. Wright notifications@github.com wrote:

In the current setup, the tree structure needs to be converted from R to C++ when calling predict() in R. If you call several times, this has to be done each time... We should think about an alternative tree structure.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/imbs-hl/ranger/issues/519#issuecomment-663473594, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMB7WNMY7Q6BCJA5FIJ3Q4TR5FO5RANCNFSM4NZL66IA .