Open kadyb opened 4 years ago
Prediction is not very much optimized. We should do some profiling! See also #133 and #500.
When I call predict() passing it thousands of high-dimensional data points in a data frame, it computes all of the predictions in about one second. But when I call predict() passing it one data point at a time (one row in the data frame), the performance is abysmal--about two seconds per row. I want to use my ranger random forest for streaming analytics in a context where I need to predict about once per second. That means I need predict() to return a single prediction in around one tenth of a second. Can you provide an efficient single-point prediction function?
In the current setup, the tree structure needs to be converted from R to C++ when calling predict()
in R. If you call several times, this has to be done each time... We should think about an alternative tree structure.
Thanks for your kind reply. It certainly would be a boon for streaming-analytics applications to have a predict() call for a single record run about as fast as predict() operates for each of N records, when we pass a whole data frame to predict() in one go. For my current project we've had to reduce our forest by an order of magnitude, to get acceptable runtime performance.
I hope you'll let me know if you come up with something.
Brilliant library!
Todd Morley (720) 560-8901 ToddMMorley@gmail.com It's more fun to do something well than poorly.
On Fri, Jul 24, 2020 at 4:28 AM Marvin N. Wright notifications@github.com wrote:
In the current setup, the tree structure needs to be converted from R to C++ when calling predict() in R. If you call several times, this has to be done each time... We should think about an alternative tree structure.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/imbs-hl/ranger/issues/519#issuecomment-663473594, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMB7WNMY7Q6BCJA5FIJ3Q4TR5FO5RANCNFSM4NZL66IA .
I compared the speed of the predict method from the ranger and randomForest packages and noticed a significant difference (in the case of 1 million observations over four times in favor of randomForest). Where does this difference come from?
Session Info
```r R version 3.6.3 (2020-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.3 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ranger_0.12.1 randomForest_4.6-14 loaded via a namespace (and not attached): [1] Rcpp_1.0.4.6 lattice_0.20-41 digest_0.6.25 crayon_1.3.4 [5] IRdisplay_0.7.0 grid_3.6.3 repr_1.1.0 jsonlite_1.6.1 [9] evaluate_0.14 pillar_1.4.4 rlang_0.4.6 uuid_0.1-4 [13] Matrix_1.2-18 IRkernel_1.1 tools_3.6.3 compiler_3.6.3 [17] base64enc_0.1-3 htmltools_0.4.0 pbdZMQ_0.3-3 ```