Open pseudotensor opened 3 years ago
A related problem is that on a different system with 2 GPUs, I have opposite problem. The explainer part, doing the same things as another system, uses only 100% of 1 core and uses very little GPU during this time.
I have the explainer only try to work on about 100 rows, so it's very minimal work. I don't understand why one system would behave so differently than another.
The fast system with uncontrollable CPU during explainer is an i9 2080 GPU system with 128GB mem.
The one that is super slow using only 100% of 1 core with very little GPU usage (like 1%-10% jumping back and forth) is a 1080ti with 2 GPUs, but this it is only running one explainer at a time and still is very slow. This system has 256GB mem on dual xeon system.
Both use exactly the same installation of conda-based rapids and run Ubuntu 18.04.
https://github.com/rapidsai/cuml/issues/4046#issue-941656089 has some more installation details, although now I'm running both on rapids nightly as of 2 days ago.
Here's repro of the slow case. It uses only 100% of 1 core 99% of explainer time and uses no more than 10% GPU on 2080ti and no more than 5% on 1080ti.
import pickle
import time
model_class, params, X, y, X_s, y_s, valid_X_s, valid_y_s = pickle.load(open("slowshap.pkl", "rb"))
model = model_class(**params)
print(params)
print("X shape: %s" % str(X.shape))
t0 = time.time()
model.fit(X, y)
t1 = time.time()
print("fit duration: %g" % (t1 - t0))
from cuml.explainer import PermutationExplainer # ignore import error, just no init in cuml for explainer
t0 = time.time()
cu_explainer = PermutationExplainer(model=model.predict_proba, data=X_s)
print("X_s shape: %s" % str(X_s.shape))
print("valid_X_s shape: %s" % str(valid_X_s.shape))
cu_shap_values = cu_explainer.shap_values(valid_X_s, npermutations=3)
t1 = time.time()
print("explainer duration: %g" % (t1 - t0))
https://0xdata-public.s3.amazonaws.com/jon/slowshap.pkl.zip
gives
(base) jon@pseudotensor:~/$ time python slowimp3.py
/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py:794: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_state is set
return func(**kwargs)
/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py:794: UserWarning: The 'use_experimental_backend' parameter is deprecated and has no effect. It will be removed in 21.10 release.
return func(**kwargs)
/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py:794: UserWarning: The 'split_algo' parameter is deprecated and has no effect. It will be removed in 21.10 release.
return func(**kwargs)
{'handle': <cuml.raft.common.handle.Handle object at 0x7f050d264d70>, 'verbose': 2, 'output_type': 'numpy', 'n_estimators': 99, 'max_depth': 17, 'max_features': 'auto', 'n_bins': 157, 'split_algo': None, 'split_criterion': 1, 'min_samples_leaf': 334, 'min_samples_split': 46, 'min_impurity_decrease': 0.00033165436600697263, 'bootstrap': True, 'max_samples': 1.0, 'max_leaves': 286, 'accuracy_metric': None, 'use_experimental_backend': None, 'max_batch_size': 128, 'n_streams': 4, 'dtype': dtype('float32'), 'min_weight_fraction_leaf': None, 'n_jobs': None, 'max_leaf_nodes': None, 'min_impurity_split': None, 'oob_score': None, 'random_state': 883507612, 'warm_start': None, 'class_weight': None, 'criterion': None}
X shape: (91457, 162)
fit duration: 0.769819
X_s shape: (100, 162)
valid_X_s shape: (100, 162)
explainer duration: 110.549
88.28user 27.75system 1:54.29elapsed 101%CPU (0avgtext+0avgdata 2459904maxresident)k
0inputs+0outputs (0major+30911313minor)pagefaults 0swaps
This is quite odd given other cases with similar data and similar model parameters where on same system the explainer is much faster, uses more GPU, and uses 800% CPU (i.e. 100% of 8 cores).
I can't see anything specific about the data or model parameters in this case that would lead to such a slowdown. There are only 99 trees, the leaves have a large number of samples, there are very few leaves, etc. Doesn't make sense.
This isn't just a one-off case. Once certain hyperparameters and data are reached, it seems to get into this state. So I'm sure it's something specific about the data's very specific details (again same data in slightly different form is not slow) or very specific hyperparameters (randomly going over hyperparameters doesn't often hit this problem).
Using KernelExplainer
is no better. Has same behavior of very poor GPU usage of order 10% and only 100% of 1 core.
If I mess with nsamples in KernelExplainer, maybe something is enlightening. I see many warnings like:
/home/jon/minicondadai_py38/lib/python3.8/site-packages/sklearn/linear_model/_least_angle.py:615: ConvergenceWarning: Regressors in active set degenerate. Dropping a regressor, after 13 iterations, i.e. alpha=2.146e-05, with an active set of 13 regressors, and the smallest cholesky pivot element being 2.220e-16. Reduce max_iter or increase eps parameters.
when using
import pickle
import time
model_class, params, X, y, X_s, y_s, valid_X_s, valid_y_s = pickle.load(open("slowshap.pkl", "rb"))
model = model_class(**params)
print(params)
print("X shape: %s" % str(X.shape))
t0 = time.time()
model.fit(X, y)
t1 = time.time()
print("fit duration: %g" % (t1 - t0))
t0 = time.time()
#from cuml.explainer import PermutationExplainer # ignore import error, just no init in cuml for explainer
#cu_explainer = PermutationExplainer(model=model.predict_proba, data=X_s)
from cuml.explainer import KernelExplainer
cu_explainer = KernelExplainer(model=model.predict_proba, data=X_s, nsamples=100)
print("X_s shape: %s" % str(X_s.shape))
print("valid_X_s shape: %s" % str(valid_X_s.shape))
cu_shap_values = cu_explainer.shap_values(valid_X_s)#, npermutations=1)
t1 = time.time()
print("explainer duration: %g" % (t1 - t0))
This is all during the slow usage.
It seems like CUMl is using sklearn linear model instead of its own CUML linear model.
Only in the very end does it seem like CUML linear model is used. E.g. if I run above is complains that:
Traceback (most recent call last):
File "slowimp3.py", line 21, in <module>
cu_shap_values = cu_explainer.shap_values(valid_X_s)#, npermutations=1)
File "cuml/explainer/kernel_shap.pyx", line 286, in cuml.explainer.kernel_shap.KernelExplainer.shap_values
File "cuml/explainer/base.pyx", line 283, in cuml.explainer.base.SHAPBase._explain
File "cuml/explainer/kernel_shap.pyx", line 423, in cuml.explainer.kernel_shap.KernelExplainer._explain_single_observation
File "cuml/explainer/kernel_shap.pyx", line 656, in cuml.explainer.kernel_shap._weighted_linear_regression
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
return func(*args, **kwargs)
File "cuml/linear_model/linear_regression.pyx", line 241, in cuml.linear_model.linear_regression.LinearRegression.fit
That is, it appears that at least for the KernelExplainer that most of the time is on a linear model CPU algorithm, not on the GPU.
I don't understand this error. The valid_X_s is of course not just one column as printed out. So maybe this is a separate problem.
Going back to the permutation case, if I gdb attach to the slow process, then I see it in:
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fa8e8a1335a in ML::fil::dense_forest::init(raft::handle_t const&, ML::fil::dense_node const*, ML::fil::forest_params_t const*, std::vector<float, std::allocator<float> > const&) ()
---Type <return> to continue, or q <return> to quit---
No matter how many times I gdb attach to the process, it's here. So it seems like some major overhead with initializing things, I guess.
Changing things like: 1) "output_type" (I was worried about internal switching) to None doesn't help. 2) n_bins to 32, doesn't help 3) min_samples_leaf from 334 to 1 definitely helps alot. Time goes from 36s to 1s.
Specifically above about a value of 200, things get much slower. E.g. 100 still only takes 1 second for explainer. But 200 takes 15s and 300 takes 30s. So it seems to grow linearly once the value is about 100.
4) min_impurity_decrease to 0 doesn't help. 5) max_leaves from 286 to -1 doesn't help.
So seems there is a flaw in the explainer (not fit) method with some kind of intiialization overhead when min_samples_leaf is large. This seems like something that should be fixed.
Another related problem is that you can see I'm passing X_s to the explainer itself. This is because for some reason if I pass X, then I get GPU OOM, even though the fit has no problem.
i.e.
import pickle
import time
model_class, params, X, y, X_s, y_s, valid_X_s, valid_y_s = pickle.load(open("slowshap.pkl", "rb"))
params['min_samples_leaf'] = 200
model = model_class(**params)
print(params)
print("X shape: %s" % str(X.shape))
t0 = time.time()
model.fit(X, y)
t1 = time.time()
print("fit duration: %g" % (t1 - t0))
t0 = time.time()
from cuml.explainer import PermutationExplainer # ignore import error, just no init in cuml for explainer
cu_explainer = PermutationExplainer(model=model.predict_proba, data=X)
print("X_s shape: %s" % str(X_s.shape))
print("valid_X_s shape: %s" % str(valid_X_s.shape))
cu_shap_values = cu_explainer.shap_values(valid_X_s, npermutations=1)
t1 = time.time()
print("explainer duration: %g" % (t1 - t0))
fails with
Traceback (most recent call last):
File "slowimp3.py", line 27, in <module>
cu_shap_values = cu_explainer.shap_values(valid_X_s, npermutations=1)
File "cuml/explainer/permutation_shap.pyx", line 250, in cuml.explainer.permutation_shap.PermutationExplainer.shap_values
File "cuml/explainer/base.pyx", line 274, in cuml.explainer.base.SHAPBase._explain
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cupy/_creation/basic.py", line 209, in zeros
a = cupy.ndarray(shape, dtype, order=order)
File "cupy/_core/core.pyx", line 164, in cupy._core.core.ndarray.__init__
File "cupy/cuda/memory.pyx", line 735, in cupy.cuda.memory.alloc
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/rmm/rmm.py", line 212, in rmm_cupy_allocator
buf = librmm.device_buffer.DeviceBuffer(size=nbytes, stream=stream)
File "rmm/_lib/device_buffer.pyx", line 84, in rmm._lib.device_buffer.DeviceBuffer.__cinit__
MemoryError: std::bad_alloc: CUDA error at: /home/jon/minicondadai_py38/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
I find this very strange, since it was the same data originally fitted on.
I feel like I should pass X since that was original data trained on, but this is not clear from the API what the implications of not passing X are.
Thanks for the issue and reproducers/models @pseudotensor, all of which are extremely helpful!
I did a quick run with the pickled model and can see some odd behavior on a 2070 Super (we’ll also be trying other systems and more debugging) so we’ll be diagnosing and solving these issues in the near future.
Besides debugging, one important aspect that I wanted to mention (and why I hadn’t tested RF/FIL much with the first current version of the explainers in cuML) is that GPUTreeSHAP will be orders of magnitude faster and with better memory usage, no matter how many optimizations are done to black box model (that is the same GPU implementation present in the mainline SHAP package as well as in XGBoost). If I’m not mistaken all or almost all the needed components are now in place in cuML’s RF to support GPUTreeSHAP, so will triage with the team to add the support soon, and that will work much better for your use case.
@dantegd ,
Yes I was going to ask about that. If GPUTreeSHAP could support CUML RF that would be great and a sensible way forward.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
After CUML random forest is fitted, one can use the explainer to get Shapley:
https://docs.rapids.ai/api/cuml/stable/api.html#cuml.explainer.PermutationExplainer
As:
where model was just fit on the GPU and is on the GPU, and X is GPU data. During fit, about 200% (i.e. 2 full cores) are used, which since fixed and finite and small are ok that can't control it. E.g. xgboost uses only 100% of 1 core during GPU fitting.
However, I find that while the GPU is used a bit during this time, all CPU cores are used as well. There appears to be no way to control this, unlike sklearn algorithms.
This is a major problem because if one has more than 1 GPU, each competes and slows down the others due to overloading the system. So I find significant slowdown even just running 2 GPUs.
This is where the code is during that time: