[QST] Distributed Random Forest RMSE much worse than single GPU Random Forest on pseudo-randomly generated data

beckernick commented 2 years ago

Comparing multi-GPU dask.xgboost regressor training with single GPU xgboost training on the same sample dataset, I generally get similar RMSE results if I use the same number of boosting rounds. This doesn't entirely surprise me, as my understanding is each gradient update forces a sync across workers.

However, when comparing multi-GPU random forest regressor training with standard single GPU random forest regressor training on the sample dataset, I generally get significantly superior results with the single GPU estimator when using the same total number of trees and what I believe to be the same configuration (max_depth defaults to -1 for cuml.dask.ensemble.RF so I set it as 1000 in the single GPU test) .

Based on the distributed RF implementation (embarrassingly parallel tree construction with each worker having a portion of the data locally), is this expected behavior? Does the data handling per worker differ from how XGBoost handles it?

With pseudo-randomly generated data, I wouldn't initially expect data skew/ordering to be significant here.

This result is shown with the following example:

from dask.distributed import Client
from dask_cuda import LocalCUDACluster

import cuml
from cuml.dask.datasets import make_regression
from cuml.dask.ensemble import RandomForestRegressor
import xgboost as xgb

dxgb_gpu_params = {
    "tree_method": "gpu_hist",
    "objective": "reg:squarederror"
}

def create_data():
    X, y = make_regression(
        n_samples=100000,
        n_features=10,
        n_informative=10,
        n_parts=10
    )
    return X, y

def fit_predict_rmse_xgb(client, X, y, use_dask=True):
    if use_dask:
        dtrain = xgb.dask.DaskDMatrix(
            client, X, y
        )

        bst = xgb.dask.train(
            client, dxgb_gpu_params, dtrain,
            num_boost_round=1000,
        )

        preds = xgb.dask.predict(client, bst, X).compute()
    else:
        clf = xgb.sklearn.XGBRegressor(
            tree_method="gpu_hist",
            n_estimators=1000
        )
        clf.fit(X.compute(), y.compute())
        preds = clf.predict(X.compute())

    rmse = cuml.metrics.mean_squared_error(
        y.compute(),
        preds
    ) ** 0.5
    return rmse

def fit_predict_rmse_rf(client, X, y, use_dask=True):
    if use_dask:
        clf = RandomForestRegressor(
            n_estimators=1000,
            n_streams=1,
            ignore_empty_partitions=True
        )

        clf.fit(X, y)
        preds = clf.predict(X).compute()
    else:
        clf = cuml.ensemble.RandomForestRegressor(
            n_estimators=1000,
            max_depth=1000,
        )
        clf.fit(X.compute(), y.compute())
        preds = clf.predict(X.compute())

    rmse = cuml.metrics.mean_squared_error(
        y.compute(),
        preds
    ) ** 0.5
    return rmse

if __name__ == "__main__":
    cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0,1", dashboard_address=':8889')
    client = Client(cluster)

    X, y = create_data()
    X, y = cuml.dask.common.utils.persist_across_workers(
        client, [X, y]
    )

    print("XGBoost Comparison:")
    print(fit_predict_rmse_xgb(client, X, y, use_dask=True))
    print(fit_predict_rmse_xgb(client, X, y, use_dask=False))
    print("Random Forest Comparison:")
    print(fit_predict_rmse_rf(client, X, y, use_dask=True))
    print(fit_predict_rmse_rf(client, X, y, use_dask=False))

python rf-vs-xgb-test.py
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
XGBoost Comparison:
[23:06:30] task [xgboost.dask]:tcp://127.0.0.1:45529 got new rank 0
[23:06:30] task [xgboost.dask]:tcp://127.0.0.1:33411 got new rank 1
17.537464
17.996174
Random Forest Comparison:
77.21135
37.694645

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapidsai / cuml

[QST] Distributed Random Forest RMSE much worse than single GPU Random Forest on pseudo-randomly generated data #4429