Closed beckernick closed 3 years ago
The time for the first prediction includes conversion to FIL format. The FIL-converted tree is cached after the first call. If you have a tiny dataset and don't want to pay the setup cost, you can use the CPU-based predict call, which is much lower throughput but lower latency.
I'm going to rephrase this as a feature request to speed up FIL translation and leave it in the FEA queue.
Thanks @JohnZed ! That sounds great
Note that #2263 was merged yesterday, which speeds up serialization of RF objects. We should run the benchmark again to obtain the new measurement for RF->FIL conversion.
@hcho3 looks like the serialization changes made a meaningful improvement. The below example is from the 2020-07-01 nightly as of 3 PM EDT. Looks to be about 4 seconds shaved off, or about 1/3 of the time.
import cupy as cp
from sklearn.datasets import make_classification
from cuml.ensemble import RandomForestClassifier as gpu_rf
X, y = make_classification(
n_samples=1000000,
n_features=20,
n_informative=18,
n_classes=2,
random_state=0,
)
n_trees = 300
X = X.astype("float32")
y = y.astype("int32")
gX, gy = cp.asarray(X), cp.asarray(y)
clf1 = gpu_rf(n_estimators=n_trees)
clf1.fit(gX, gy)
%time clf1.predict(gX)
%time clf1.predict(gX)
CPU times: user 7.57 s, sys: 1.21 s, total: 8.77 s
Wall time: 7.87 s
CPU times: user 878 ms, sys: 345 ms, total: 1.22 s
Wall time: 1.22 s
array([1, 0, 1, ..., 0, 1, 1], dtype=int32)
The conversion slowdown appears strongly related to the number of features. While not surprising, it's interesting to see it play out. I wonder if there is an inflection point.
import cupy as cp
from sklearn.datasets import make_classification
from cuml.ensemble import RandomForestClassifier as gpu_rf
n_trees = 300
for nfeat in [5, 10, 15, 20]:
X, y = make_classification(
n_samples=1000000,
n_features=nfeat,
n_informative=nfeat-2,
n_classes=2,
random_state=0,
)
X = X.astype("float32")
y = y.astype("int32")
gX, gy = cp.asarray(X), cp.asarray(y)
clf1 = gpu_rf(n_estimators=n_trees)
clf1.fit(gX, gy)
print(f"{nfeat} Features")
%time clf1.predict(gX)
print()
5 Features
CPU times: user 1.33 s, sys: 35.9 ms, total: 1.36 s
Wall time: 404 ms
10 Features
CPU times: user 5.45 s, sys: 687 ms, total: 6.14 s
Wall time: 5.24 s
15 Features
CPU times: user 6.19 s, sys: 785 ms, total: 6.97 s
Wall time: 6.43 s
20 Features
CPU times: user 7.23 s, sys: 570 ms, total: 7.8 s
Wall time: 6.88 s
Paying the one-time cost is probably more impactful in a cross-validation workflow, in which potentially many unique models call predict over the lifecycle. We'd end up with a linear lower bound on total time of num_models x RF/FIL conversion time
. With that said, if it's still faster than other approaches, we're still coming out ahead.
Update: Pickle protocol 5 speeds up RF -> FIL conversion further. It uses a technique called "out-of-band serialization" to speed up conversion between NumPy arrays and bytes.
make_dataset.py
and rf_benchmark.py
to newer cuML source directory.n_gpus=2, n_gb=2, n_features=20, depth=25, n_estimators=10.
This leads to a forest consisting of 10 depth-25 trees, and we run through 25 million data rows.End-to-end time for prediction (sec) | |
---|---|
Before #2263 | 269.0 sec |
Most recent commit (489a7d80908271245152d7d0be7a32f4faf68928) | 108.4 sec |
Most recent commit (489a7d80908271245152d7d0be7a32f4faf68928) + Pickle5 | 73.4 sec |
As noted in #2263, most of the run time is consumed by RF->FIL conversion.
There are two options:
pickle
module will use Pickle protocol 5. ORcloudpickle
, pickle5
, distributed
by running the following commands:
conda install -c rapidsai -c nvidia -c rapidsai-nightly -c conda-forge cloudpickle pickle5
# Install development version of Dask and Distributed
conda remove --force distributed dask
git clone https://github.com/dask/dask.git
cd dask
python -m pip install .
cd ..
git clone https://github.com/dask/distributed.git
cd distributed
python setup.py install
Special thanks to @jakirkham who brought Pickle 5 to Dask.
As a note, the Distributed change ( https://github.com/dask/distributed/pull/3849 ) will be part of the 2.21.0 release.
Awesome benchmark and summary @hcho3 . Do you have a sense of how 73 seconds for prediction compares to sklearn's random forest on the same data?
The time for the first prediction includes conversion to FIL format. The FIL-converted tree is cached after the first call. If you have a tiny dataset and don't want to pay the setup cost, you can use the CPU-based predict call, which is much lower throughput but lower latency.
i saw there's only one core was used during the conversion, maybe a multiprocess task could speedup the inference?
Probably not. Most of the savings here is avoiding copies. Once that is done, which I believe is already the case here, we are just passing pointers and metadata around until it goes over the wire. Though feel free to correct me if I'm missing anything here Philip 🙂
We need more perf benchmark before we can conclusively say what's causing the slowdown.
Probably not. Most of the savings here is avoiding copies. Once that is done, which I believe is already the case here, we are just passing pointers and metadata around until it goes over the wire. Though feel free to correct me if I'm missing anything here Philip 🙂
okay...
I have another question: Is it possible to inference without conversion?
In some financial scenarios, we compare the labels and preds for just a few times(even just one time).
This conversion step is an essential part of our inference code at the moment, but speeding it up is currently my focus and number one priority. The short version of my profiling findings is that our use of unordered_map
is the bottleneck. Lookups are slow, and data locality is notoriously poor with unordered_map
due to its use of a linked list for the hash-table buckets. I was able to make things a bit more efficient by eliminating some unnecessary calls to slow unordered_map
methods, but I'm working on a proof-of-concept to replace the unordered_map
entirely with a vector
storing nodes in a cache-friendly layout. I'll continue to update this thread as the work progresses.
A brief update on profiling and where we're at with this:
I'm currently profiling the single-GPU case only; I'll be looking at the MNMG class independently later. All reported results below are for randomly-generated (but consistent) data with 100,000 samples and 20 features. An RF classifier is trained with 300 trees, and prediction is run once on the same data used for training. I've done some investigation with other parameters, but I will stick to reporting these (apparently fairly representative) results unless otherwise noted.
The relevant method for this issue is _obtain_treelite_handle
, and it is indeed the single greatest contributor to runtime for our predict
call (the other being load_using_treelite_handle
). Results below are all focused on _obtain_treelite_handle
for the moment, though we may also investigate load_using_treelite_handle
shortly.
On current Treelite mainline, _obtain_treelite_handle
took about 68 seconds of the 79 seconds used for prediction, with almost the entirety of the remaining 11 seconds going to load_using_treelite_handle
. Within _obtain_treelite_handle
, the following methods had the greatest overall contribution to runtime:
TreeliteTreeBuilderSetNumericalTestNode
: 16.86 sTreeliteTreeBuilderCreateNode
: 16.28 sTreeliteTreeBuilderCreateValue
: 13.44 sTreeliteDeleteModelBuilder
: 6.82 sTreeliteTreeBuilderSetLeafNode
: 3.85sFor the moment, we will ignore the deletion method. Breaking the other methods down further, it became clear that unordered_map
methods were by far the greatest contributor to overall runtime across all of these methods. Not only is unordered_map
inherently slow due to its linked-list bucket implementation, but some safety checks were being performed redundantly, causing the map to find an entry and then immediately find it again rather than performing both the check and retrieval together. The most problematic use of unordered_map
is for mapping node ids to nodes, but there is also a small (but surprisingly significant) slowdown due to its use for mapping operator names to operators via the optable
variable.
unordered_map
count
: 8.08 soptables
unordered_map
count
: 3.25 sunordered_map
[]
: 13.42 soptable
unordered_map
at
: 1.63 sWith this profiling data available, I created a proof-of-concept PR for Treelite that moves from an unordered_map
to FastMap
, a hash table with open addressing and linear local probing for tracking the node id mappings. The inefficient double-lookups have not been eliminated yet because my iterator implementation for FastMap
is inefficient enough that the double-lookups are actually cheaper than using the iterator. This can be improved on with later refinements. I also shifted from a table to simple lookup function for optable
.
Reviewing the same data presented for mainline, we have:
TreeliteTreeBuilderSetNumericalTestNode
: 5.60 s (3.01x speedup)TreeliteTreeBuilderCreateNode
: 7.35s (2.21x speedup)TreeliteTreeBuilderCreateValue
: 10.31 s ( 1.30x speedup)TreeliteDeleteModelBuilder
: 2.90 s (2.35x speedup)TreeliteTreeBuilderSetLeafNode
: 1.64s (2.35x speedup)And for the low-level breakdown:
FastMap
count
: 0.12 s (67.33x speedup)optables
function calls: 0.32 s (15.25x speedup)FastMap
[]
: 1.22 s (11.0x speedup)The moral of the stories is that linked lists are evil and hence so are stdlib maps ;). The overall runtime for _obtain_treelite_handle
was approximately 38s, for a total speedup of about 1.79x.
With profiling results from the POC, we see that TreeliteTreeBuilderCreateValue
is now the worst bottleneck, followed by TreeliteTreeBuilderCreateNode
. This seems to be the result of performing many small memory allocations. I will investigate this further along with the possibility of other larger structural changes that would either eliminate the need to pass through Treelite or do a faster Treelite build by leveraging the stronger guarantees of node ordering provided by our RF conversion step.
Another update based on the same profiling setup. My general approach has been to develop cheap/quick PoCs exploring different possible avenues for conversion speedup and using them to guide further exploration and longterm development decisions. A quick summary of a few areas of investigation and the runtime for a single prediction (predict
call) with the corresponding PoC:
EDIT: Removing earlier results based on build against incorrect Treelite library to avoid confusion.
Based on offline discussion today, I'll be looking at parallelization of the TL->FIL conversion and ensuring that at least the FastMap + RF->TL parallelization + TL->FIL pre-allocation approach makes it into 0.18. I'll then turn my attention to migrating the direct RF->FIL conversion from a PoC to a more robust and optimized implementation. This may make it into 0.18 as an experimental feature, but it may also get pushed back to 0.19.
Closed by #3395 - there is more room for optimization but this is by far the most important speedup we need
The brief version of the final speedup we obtained was that we got about a 21.23x speedup relative to baseline for the parameters described above. I'll do one more "matrix" of runs with a variety of tree depths, number of features etc., and I'll post that table here for final comparison.
In today's nightly (cuml commit
f1f1c7f6a
), thepredict
method of random forest classifier takes quite a bit of time the first time it's called on 1M rows binary classification, but is much faster the second time. Perhaps this could be related to #1922 ?After this, I added a print statement in the
predict
method to see if it's using the GPU path, which it appears to be.https://github.com/rapidsai/cuml/blob/4b3213d9dac68c9dce4600f82305c2989ce67c2d/python/cuml/ensemble/randomforestclassifier.pyx#L869-L871