[BUG] Dask RF accuracy reduces on increase in number of gpus and partitions_per_worker

Salonijain27 commented 4 years ago

This test example was taken from test/dask/test_random_forest.py and modified to scale the number of samples and number of estimators with increase in number of gpus.

import cudf
import dask_cudf
import numpy as np
import pandas as pd

from cuml.dask.ensemble import RandomForestRegressor as cuRFR_mg
from cuml.dask.common import utils as dask_utils

from dask.array import from_array
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

def _prep_training_data(c, X_train, y_train, partitions_per_worker):
    workers = c.has_what().keys()
    n_partitions = partitions_per_worker * len(workers)
    X_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X_train))
    X_train_df = dask_cudf.from_cudf(X_cudf, npartitions=n_partitions)

    y_cudf = cudf.Series(y_train)
    y_train_df = \
        dask_cudf.from_cudf(y_cudf, npartitions=n_partitions)

    print(" X_train : ", X_train_df)

    X_train_df, \
        y_train_df = dask_utils.persist_across_workers(c,
                                                       [X_train_df,
                                                        y_train_df],
                                                       workers=workers)
    return X_train_df, y_train_df

cluster = LocalCUDACluster(threads_per_worker=1)
print(" cluster : ", cluster)
c = Client(cluster)
partitions_per_worker = 1
try:
    n_workers = len(cluster.workers)
    X, y = make_regression(n_samples=10000*n_workers,
                           n_features=20,
                           n_informative=10, random_state=123)

    X = X.astype(np.float32)
    y = y.astype(np.float32)

    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        test_size=1000*n_workers,
                                                        random_state=123)

    cu_rf_params = {
        'n_estimators': 50*n_workers,
        'max_depth': 16,
        'n_bins': 16,
    }

    workers = c.has_what().keys()
    n_partitions = partitions_per_worker * len(workers)

    X_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X_train))
    X_train_df = \
        dask_cudf.from_cudf(X_cudf, npartitions=n_partitions)

    y_cudf = cudf.Series(y_train)
    y_train_df = \
        dask_cudf.from_cudf(y_cudf, npartitions=n_partitions)

    X_train_df, y_train_df = dask_utils.persist_across_workers(
        c, [X_train_df, y_train_df], workers=workers)
    X_test_dask_array = from_array(X_test)
    cuml_mod = cuRFR_mg(**cu_rf_params)
    cuml_mod.fit(X_train_df, y_train_df)

    cuml_mod_predict = cuml_mod.predict(X_test, predict_model='CPU')
    cuml_mod_gpu = cuml_mod.predict(X_test_dask_array, predict_model='GPU')
    acc_score_GPU = r2_score(cuml_mod_gpu.compute(), y_test)

    acc_score = r2_score(cuml_mod_predict, y_test)
    print(cuml_mod.rfs)
    print(" acc_score in dask_cpu : ", acc_score)
    print(" acc_score in acc_score_GPU : ", acc_score_GPU)

    assert acc_score >= 0.67

finally:
    c.close()

Looking into it further

Salonijain27 commented 4 years ago

Compared the forest built in dask RF with the ones built in non-dask RF implementation. The non dask forests when built with using the same seed values as used in dask RF workers created almost identical forests.

This issue could be due to the fact that on changing the seed value for the forest we can see that the RF model accuracy varies a lot. Since each worker has a different seed the forest created in each worker is different. This could be affecting the overall accuracy of the Dask RF model.

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

rapidsai / cuml

[BUG] Dask RF accuracy reduces on increase in number of gpus and partitions_per_worker #2437