Open ondrejhavlicek opened 5 years ago
Same issue
I am not a developer on this or any quantile regression package, but I've looked at the source code for both scikit-garden and quantRegForest/ranger, and I have a good idea of why the R versions are so much faster:
The basic idea of the skgarden predict
function is to save all the y_train
values corresponding to all of the leaves. Then, when predicting a new sample, you gather the relevant leaves and corresponding y_train
values, and compute the (weighted) quantile of that array. The R versions take a shortcut: they only save a single, randomly chosen y_train
value per leaf node. This has two advantages: it makes the gathering of relevant y_train
values a lot simpler since there is always exactly one value in every leaf node. Secondly, it makes the quantile calculation a lot simpler since every leaf has the exact same weight.
Since you only use a single (random) value per leaf instead of all of them, this is an approximation method. In my experience, if you have enough trees, (at least 50-100 or so), this has very little effect on the result. However, I don't know enough about the math to say how good the approximation is exactly.
Below is an implementation of the simpler R method of quantile prediction, for a RandomForestQuantileRegressor model. Note that the first half of the function is the (one-time) process of selecting a random y_train value per leaf. If the author were to implement this method in skgarden, they would logically move this part to the fit
method, leaving only the last 6 or so lines, which makes for a much faster predict
method. Also in my example, I am using quantiles from 0 to 1, instead of from 0 to 100.
def predict_approx(model, X_test, quantiles=[0.05, 0.5, 0.95]):
"""
Function to predict quantiles much faster than the default skgarden method
This is the same method that the ranger and quantRegForest packages in R use
Output is (n_samples, n_quantiles) or (n_samples, ) if a scalar is given as quantiles
"""
# Begin one-time calculation of random_values. This only depends on model, so could be saved.
n_leaves = np.max(model.y_train_leaves_) + 1 # leaves run from 0 to max(leaf_number)
random_values = np.zeros((model.n_estimators, n_leaves))
for tree in range(model.n_estimators):
for leaf in range(n_leaves):
train_samples = np.argwhere(model.y_train_leaves_[tree, :] == leaf).reshape(-1)
if len(train_samples) == 0:
random_values[tree, leaf] = np.nan
else:
train_values = model.y_train_[train_samples]
random_values[tree, leaf] = np.random.choice(train_values)
# Optionally, save random_values as a model attribute for reuse later
# For each sample, get the random leaf values from all the leaves they land in
X_leaves = model.apply(X_test)
leaf_values = np.zeros((X_test.shape[0], model.n_estimators))
for i in range(model.n_estimators):
leaf_values[:, i] = random_values[i, X_leaves[:, i]]
# For each sample, calculate the quantiles of the leaf_values
return np.quantile(leaf_values, np.array(quantiles), axis=1).transpose()
@fvermeij I tried the same implementation myself, the results are way worse than the standard procedure. The original R implementation was as described in the paper, the latest release is radically different (with the random sample approach). My guess is that the R implementation has the same limits of the python one when you have to deal with "not so small" datasets. If you try to replicate the results in the paper with the most recent R implementation you will get much worse results unfortunately.
@fennovj Your code worked. It is very fast. Thanks a lot for sharing it. In the same way, could you please write a code for Quantile XGBoost Regression and Quantile Gradient Boosting Regression?
@fennovj how can I replace model.y_trainleaves and model.ytrain? I am trying out this functionon the RandomForestRegressor scikit-learn model, which does not have the parameters mentioned above. Thank you in advance.
FYI for those facing similar issues and looking specifically for a QRF implementation that is actively maintained, optimized, and compatible with the latest versions of Python and scikit-learn, consider checking out quantile-forest. The models in the package extend the scikit-learn ForestRegressor and by default use a similar optimization to that suggested above.
Prediction of quantiles for a few thousand new records (3000 rows, 3 quantiles, 41 predictors) using a
RandomForestQuantileRegressor
(e.g.n_estimators=50, min_samples_split=10, min_samples_leaf=10, n_jobs=4
) takes an unreasonable amount of time, something like 6 minutes.This same operation takes a few seconds at most in the R implementation written by the author of the Quantile Regression Forests paper.
Other users have reported this and done a benchmark: https://stackoverflow.com/questions/51483951/quantile-random-forests-from-scikit-garden-very-slow-at-making-predictions
This is such a great package so it is a real shame it is not usable in production. Can something be done about it, e.g. copying the logic from the R code?