beckernick commented 4 years ago

In today's nightly (cuml commit f1f1c7f6a), the predict method of random forest classifier takes quite a bit of time the first time it's called on 1M rows binary classification, but is much faster the second time. Perhaps this could be related to #1922 ?

import cupy as cp
from sklearn.datasets import make_classification
from cuml.ensemble import RandomForestClassifier as gpu_rf

X, y = make_classification(
    n_samples=1000000,
    n_features=20,
    n_informative=18,
    n_classes=2,
    random_state=0,
)

n_trees = 300

X = X.astype("float32")
y = y.astype("int32")

gX, gy = cp.asarray(X), cp.asarray(y)

clf1 = gpu_rf(n_estimators=n_trees)
clf1.fit(gX, gy)

%time clf1.predict(gX)
%time clf1.predict(gX)
CPU times: user 11.4 s, sys: 3.08 s, total: 14.5 s
Wall time: 13.6 s
CPU times: user 2.14 s, sys: 397 ms, total: 2.54 s
Wall time: 2.53 s
array([1., 0., 1., ..., 0., 1., 1.], dtype=float32)

After this, I added a print statement in the predict method to see if it's using the GPU path, which it appears to be.

https://github.com/rapidsai/cuml/blob/4b3213d9dac68c9dce4600f82305c2989ce67c2d/python/cuml/ensemble/randomforestclassifier.pyx#L869-L871

import cupy as cp
from sklearn.datasets import make_classification
from cuml.ensemble import RandomForestClassifier as gpu_rf
X, y = make_classification(
    n_samples=1000000,
    n_features=20,
    n_informative=18,
    n_classes=2,
    random_state=0,
)
n_trees = 300
X = X.astype("float32")
y = y.astype("int32")
gX, gy = cp.asarray(X), cp.asarray(y)
clf1 = gpu_rf(n_estimators=n_trees)
clf1.fit(gX, gy, convert_dtype=False)
%time clf1.predict(gX)
%time clf1.predict(gX)
using GPU
CPU times: user 11.4 s, sys: 1.34 s, total: 12.7 s
Wall time: 11.7 s
using GPU
CPU times: user 1.71 s, sys: 259 ms, total: 1.97 s
Wall time: 1.97 s

clf1 = gpu_rf(n_estimators=n_trees)
clf1.fit(gX, gy, convert_dtype=True)
%time clf1.predict(gX)
%time clf1.predict(gX)
using GPU
CPU times: user 11.2 s, sys: 1.21 s, total: 12.4 s
Wall time: 11.4 s
using GPU
CPU times: user 1.64 s, sys: 312 ms, total: 1.95 s
Wall time: 1.95 s

JohnZed commented 4 years ago

The time for the first prediction includes conversion to FIL format. The FIL-converted tree is cached after the first call. If you have a tiny dataset and don't want to pay the setup cost, you can use the CPU-based predict call, which is much lower throughput but lower latency.

JohnZed commented 4 years ago

I'm going to rephrase this as a feature request to speed up FIL translation and leave it in the FEA queue.

beckernick commented 4 years ago

Thanks @JohnZed ! That sounds great

hcho3 commented 4 years ago

Note that #2263 was merged yesterday, which speeds up serialization of RF objects. We should run the benchmark again to obtain the new measurement for RF->FIL conversion.

beckernick commented 4 years ago

@hcho3 looks like the serialization changes made a meaningful improvement. The below example is from the 2020-07-01 nightly as of 3 PM EDT. Looks to be about 4 seconds shaved off, or about 1/3 of the time.

import cupy as cp
from sklearn.datasets import make_classification
from cuml.ensemble import RandomForestClassifier as gpu_rf

X, y = make_classification(
    n_samples=1000000,
    n_features=20,
    n_informative=18,
    n_classes=2,
    random_state=0,
)

n_trees = 300

X = X.astype("float32")
y = y.astype("int32")

gX, gy = cp.asarray(X), cp.asarray(y)

clf1 = gpu_rf(n_estimators=n_trees)
clf1.fit(gX, gy)

%time clf1.predict(gX)
%time clf1.predict(gX)
CPU times: user 7.57 s, sys: 1.21 s, total: 8.77 s
Wall time: 7.87 s
CPU times: user 878 ms, sys: 345 ms, total: 1.22 s
Wall time: 1.22 s
array([1, 0, 1, ..., 0, 1, 1], dtype=int32)

The conversion slowdown appears strongly related to the number of features. While not surprising, it's interesting to see it play out. I wonder if there is an inflection point.

import cupy as cp
from sklearn.datasets import make_classification
from cuml.ensemble import RandomForestClassifier as gpu_rf


n_trees = 300

for nfeat in [5, 10, 15, 20]:
    X, y = make_classification(
        n_samples=1000000,
        n_features=nfeat,
        n_informative=nfeat-2,
        n_classes=2,
        random_state=0,
    )

    X = X.astype("float32")
    y = y.astype("int32")

    gX, gy = cp.asarray(X), cp.asarray(y)

    clf1 = gpu_rf(n_estimators=n_trees)
    clf1.fit(gX, gy)
    print(f"{nfeat} Features")
    %time clf1.predict(gX)
    print()
5 Features
CPU times: user 1.33 s, sys: 35.9 ms, total: 1.36 s
Wall time: 404 ms

10 Features
CPU times: user 5.45 s, sys: 687 ms, total: 6.14 s
Wall time: 5.24 s

15 Features
CPU times: user 6.19 s, sys: 785 ms, total: 6.97 s
Wall time: 6.43 s

20 Features
CPU times: user 7.23 s, sys: 570 ms, total: 7.8 s
Wall time: 6.88 s

Paying the one-time cost is probably more impactful in a cross-validation workflow, in which potentially many unique models call predict over the lifecycle. We'd end up with a linear lower bound on total time of num_models x RF/FIL conversion time. With that said, if it's still faster than other approaches, we're still coming out ahead.

hcho3 commented 4 years ago

Update: Pickle protocol 5 speeds up RF -> FIL conversion further. It uses a technique called "out-of-band serialization" to speed up conversion between NumPy arrays and bytes.

Benchmark setup

Using this script by @Salonijain27 and @hcho3. See the link for instructions. To use a more recent version of cuML, it suffices to copy over make_dataset.py and rf_benchmark.py to newer cuML source directory.
AWS EC2 instance g4dn.12xlarge, T4 GPU (16 GB GDDR6) X 4
Benchmark setting: n_gpus=2, n_gb=2, n_features=20, depth=25, n_estimators=10. This leads to a forest consisting of 10 depth-25 trees, and we run through 25 million data rows.

Benchmark Results

	End-to-end time for prediction (sec)
Before #2263	269.0 sec
Most recent commit (489a7d80908271245152d7d0be7a32f4faf68928)	108.4 sec
Most recent commit (489a7d80908271245152d7d0be7a32f4faf68928) + Pickle5	73.4 sec

As noted in #2263, most of the run time is consumed by RF->FIL conversion.

How to opt into Pickle 5

There are two options:

Upgrade to Python 3.8. In this case, the built-in pickle module will use Pickle protocol 5. OR

Install latest versions of cloudpickle, pickle5, distributed by running the following commands:

conda install -c rapidsai -c nvidia -c rapidsai-nightly -c conda-forge cloudpickle pickle5
# Install development version of Dask and Distributed
conda remove --force distributed dask
git clone https://github.com/dask/dask.git
cd dask
python -m pip install .
cd ..
git clone https://github.com/dask/distributed.git
cd distributed
python setup.py install

Special thanks to @jakirkham who brought Pickle 5 to Dask.

jakirkham commented 4 years ago

As a note, the Distributed change ( https://github.com/dask/distributed/pull/3849 ) will be part of the 2.21.0 release.

beckernick commented 4 years ago

Awesome benchmark and summary @hcho3 . Do you have a sense of how 73 seconds for prediction compares to sklearn's random forest on the same data?

ysq151944 commented 3 years ago

The time for the first prediction includes conversion to FIL format. The FIL-converted tree is cached after the first call. If you have a tiny dataset and don't want to pay the setup cost, you can use the CPU-based predict call, which is much lower throughput but lower latency.

i saw there's only one core was used during the conversion, maybe a multiprocess task could speedup the inference?

jakirkham commented 3 years ago

Probably not. Most of the savings here is avoiding copies. Once that is done, which I believe is already the case here, we are just passing pointers and metadata around until it goes over the wire. Though feel free to correct me if I'm missing anything here Philip 🙂

hcho3 commented 3 years ago

We need more perf benchmark before we can conclusively say what's causing the slowdown.

ysq151944 commented 3 years ago

Probably not. Most of the savings here is avoiding copies. Once that is done, which I believe is already the case here, we are just passing pointers and metadata around until it goes over the wire. Though feel free to correct me if I'm missing anything here Philip 🙂

okay...

I have another question: Is it possible to inference without conversion?

In some financial scenarios, we compare the labels and preds for just a few times(even just one time).

wphicks commented 3 years ago

This conversion step is an essential part of our inference code at the moment, but speeding it up is currently my focus and number one priority. The short version of my profiling findings is that our use of unordered_map is the bottleneck. Lookups are slow, and data locality is notoriously poor with unordered_map due to its use of a linked list for the hash-table buckets. I was able to make things a bit more efficient by eliminating some unnecessary calls to slow unordered_map methods, but I'm working on a proof-of-concept to replace the unordered_map entirely with a vector storing nodes in a cache-friendly layout. I'll continue to update this thread as the work progresses.

wphicks commented 3 years ago

A brief update on profiling and where we're at with this:

Profiling Setup

I'm currently profiling the single-GPU case only; I'll be looking at the MNMG class independently later. All reported results below are for randomly-generated (but consistent) data with 100,000 samples and 20 features. An RF classifier is trained with 300 trees, and prediction is run once on the same data used for training. I've done some investigation with other parameters, but I will stick to reporting these (apparently fairly representative) results unless otherwise noted.

The relevant method for this issue is _obtain_treelite_handle, and it is indeed the single greatest contributor to runtime for our predict call (the other being load_using_treelite_handle). Results below are all focused on _obtain_treelite_handle for the moment, though we may also investigate load_using_treelite_handle shortly.

Treelite mainline Results

On current Treelite mainline, _obtain_treelite_handle took about 68 seconds of the 79 seconds used for prediction, with almost the entirety of the remaining 11 seconds going to load_using_treelite_handle. Within _obtain_treelite_handle, the following methods had the greatest overall contribution to runtime:

TreeliteTreeBuilderSetNumericalTestNode: 16.86 s
TreeliteTreeBuilderCreateNode: 16.28 s
TreeliteTreeBuilderCreateValue: 13.44 s
TreeliteDeleteModelBuilder: 6.82 s
TreeliteTreeBuilderSetLeafNode: 3.85s

For the moment, we will ignore the deletion method. Breaking the other methods down further, it became clear that unordered_map methods were by far the greatest contributor to overall runtime across all of these methods. Not only is unordered_map inherently slow due to its linked-list bucket implementation, but some safety checks were being performed redundantly, causing the map to find an entry and then immediately find it again rather than performing both the check and retrieval together. The most problematic use of unordered_map is for mapping node ids to nodes, but there is also a small (but surprisingly significant) slowdown due to its use for mapping operator names to operators via the optable variable.

Nodes unordered_map count: 8.08 s
optables unordered_map count: 3.25 s
Nodes unordered_map []: 13.42 s
optable unordered_map at: 1.63 s

Treelite POC with open-addressing/local-probing hash table

With this profiling data available, I created a proof-of-concept PR for Treelite that moves from an unordered_map to FastMap, a hash table with open addressing and linear local probing for tracking the node id mappings. The inefficient double-lookups have not been eliminated yet because my iterator implementation for FastMap is inefficient enough that the double-lookups are actually cheaper than using the iterator. This can be improved on with later refinements. I also shifted from a table to simple lookup function for optable.

Reviewing the same data presented for mainline, we have:

TreeliteTreeBuilderSetNumericalTestNode: 5.60 s (3.01x speedup)
TreeliteTreeBuilderCreateNode: 7.35s (2.21x speedup)
TreeliteTreeBuilderCreateValue: 10.31 s ( 1.30x speedup)
TreeliteDeleteModelBuilder: 2.90 s (2.35x speedup)
TreeliteTreeBuilderSetLeafNode: 1.64s (2.35x speedup)

And for the low-level breakdown:

Nodes FastMap count: 0.12 s (67.33x speedup)
optables function calls: 0.32 s (15.25x speedup)
Nodes FastMap []: 1.22 s (11.0x speedup)

The moral of the stories is that linked lists are evil and hence so are stdlib maps ;). The overall runtime for _obtain_treelite_handle was approximately 38s, for a total speedup of about 1.79x.

Still to be done

With profiling results from the POC, we see that TreeliteTreeBuilderCreateValue is now the worst bottleneck, followed by TreeliteTreeBuilderCreateNode. This seems to be the result of performing many small memory allocations. I will investigate this further along with the possibility of other larger structural changes that would either eliminate the need to pass through Treelite or do a faster Treelite build by leveraging the stronger guarantees of node ordering provided by our RF conversion step.

wphicks commented 3 years ago

Another update based on the same profiling setup. My general approach has been to develop cheap/quick PoCs exploring different possible avenues for conversion speedup and using them to guide further exploration and longterm development decisions. A quick summary of a few areas of investigation and the runtime for a single prediction (predict call) with the corresponding PoC:

EDIT: Removing earlier results based on build against incorrect Treelite library to avoid confusion.

Based on offline discussion today, I'll be looking at parallelization of the TL->FIL conversion and ensuring that at least the FastMap + RF->TL parallelization + TL->FIL pre-allocation approach makes it into 0.18. I'll then turn my attention to migrating the direct RF->FIL conversion from a PoC to a more robust and optimized implementation. This may make it into 0.18 as an experimental feature, but it may also get pushed back to 0.19.

JohnZed commented 3 years ago

Closed by #3395 - there is more room for optimization but this is by far the most important speedup we need

wphicks commented 3 years ago

The brief version of the final speedup we obtained was that we got about a 21.23x speedup relative to baseline for the parameters described above. I'll do one more "matrix" of runs with a variety of tree depths, number of features etc., and I'll post that table here for final comparison.

rapidsai / cuml

[FEA] Speed up RF -> FIL conversion for inference #2399

Benchmark setup

Benchmark Results

How to opt into Pickle 5

Profiling Setup

Treelite mainline Results

Treelite POC with open-addressing/local-probing hash table

Still to be done