scikit-garden / scikit-garden

A garden for scikit-learn compatible trees
http://scikit-garden.github.io/
Other
284 stars 74 forks source link

Very slow QRF prediction #74

Open ondrejhavlicek opened 5 years ago

ondrejhavlicek commented 5 years ago

Prediction of quantiles for a few thousand new records (3000 rows, 3 quantiles, 41 predictors) using a RandomForestQuantileRegressor (e.g. n_estimators=50, min_samples_split=10, min_samples_leaf=10, n_jobs=4) takes an unreasonable amount of time, something like 6 minutes.

This same operation takes a few seconds at most in the R implementation written by the author of the Quantile Regression Forests paper.

Other users have reported this and done a benchmark: https://stackoverflow.com/questions/51483951/quantile-random-forests-from-scikit-garden-very-slow-at-making-predictions

This is such a great package so it is a real shame it is not usable in production. Can something be done about it, e.g. copying the logic from the R code?

lmssdd commented 5 years ago

Same issue

fennovj commented 5 years ago

I am not a developer on this or any quantile regression package, but I've looked at the source code for both scikit-garden and quantRegForest/ranger, and I have a good idea of why the R versions are so much faster:

Explanation of difference in skgarden/quantregforest methods

The basic idea of the skgarden predict function is to save all the y_train values corresponding to all of the leaves. Then, when predicting a new sample, you gather the relevant leaves and corresponding y_train values, and compute the (weighted) quantile of that array. The R versions take a shortcut: they only save a single, randomly chosen y_train value per leaf node. This has two advantages: it makes the gathering of relevant y_train values a lot simpler since there is always exactly one value in every leaf node. Secondly, it makes the quantile calculation a lot simpler since every leaf has the exact same weight.

Since you only use a single (random) value per leaf instead of all of them, this is an approximation method. In my experience, if you have enough trees, (at least 50-100 or so), this has very little effect on the result. However, I don't know enough about the math to say how good the approximation is exactly.

TL;DR: how to make skgarden predict faster

Below is an implementation of the simpler R method of quantile prediction, for a RandomForestQuantileRegressor model. Note that the first half of the function is the (one-time) process of selecting a random y_train value per leaf. If the author were to implement this method in skgarden, they would logically move this part to the fit method, leaving only the last 6 or so lines, which makes for a much faster predict method. Also in my example, I am using quantiles from 0 to 1, instead of from 0 to 100.

def predict_approx(model, X_test, quantiles=[0.05, 0.5, 0.95]):
    """
    Function to predict quantiles much faster than the default skgarden method
    This is the same method that the ranger and quantRegForest packages in R use
    Output is (n_samples, n_quantiles) or (n_samples, ) if a scalar is given as quantiles
    """
    # Begin one-time calculation of random_values. This only depends on model, so could be saved.
    n_leaves = np.max(model.y_train_leaves_) + 1  # leaves run from 0 to max(leaf_number)
    random_values = np.zeros((model.n_estimators, n_leaves))
    for tree in range(model.n_estimators):
        for leaf in range(n_leaves):
            train_samples = np.argwhere(model.y_train_leaves_[tree, :] == leaf).reshape(-1)
            if len(train_samples) == 0:
                random_values[tree, leaf] = np.nan
            else:
                train_values = model.y_train_[train_samples]
                random_values[tree, leaf] = np.random.choice(train_values)
    # Optionally, save random_values as a model attribute for reuse later

    # For each sample, get the random leaf values from all the leaves they land in
    X_leaves = model.apply(X_test)
    leaf_values = np.zeros((X_test.shape[0], model.n_estimators))
    for i in range(model.n_estimators):
        leaf_values[:, i] = random_values[i, X_leaves[:, i]]

    # For each sample, calculate the quantiles of the leaf_values
    return np.quantile(leaf_values, np.array(quantiles), axis=1).transpose()
lmssdd commented 5 years ago

@fvermeij I tried the same implementation myself, the results are way worse than the standard procedure. The original R implementation was as described in the paper, the latest release is radically different (with the random sample approach). My guess is that the R implementation has the same limits of the python one when you have to deal with "not so small" datasets. If you try to replicate the results in the paper with the most recent R implementation you will get much worse results unfortunately.

nikhilteja91 commented 2 years ago

@fennovj Your code worked. It is very fast. Thanks a lot for sharing it. In the same way, could you please write a code for Quantile XGBoost Regression and Quantile Gradient Boosting Regression?

tompetrossiann commented 2 years ago

@fennovj how can I replace model.y_trainleaves and model.ytrain? I am trying out this functionon the RandomForestRegressor scikit-learn model, which does not have the parameters mentioned above. Thank you in advance.

reidjohnson commented 7 months ago

FYI for those facing similar issues and looking specifically for a QRF implementation that is actively maintained, optimized, and compatible with the latest versions of Python and scikit-learn, consider checking out quantile-forest. The models in the package extend the scikit-learn ForestRegressor and by default use a similar optimization to that suggested above.