[BUG] RFRegressor: Largely inconsistent performance accross GPU models

Oleg-dM commented 3 years ago

Hello,

Been using cuML for 3/4 months now and noticed something strange recently RFRegressor performance are inconsistent from one GPU model to another

This seems to have worsen in the latest release as detailed below.

Test description

The test (adapted from cuML doc example) consists in running 15 times a regression using make_regression and a RFRegressor and averaging the mean_squared_error (from cuml and sklearn) over the 15 runs. Something to note is that the issue worsens as the dataset gets bigger: here 25k samples / 100 features while perfs are aligned on smaller datasets of e.g. 10k samples / 50 features

Test results (detailed tests results below)

Release 21.06 seems to perform better than 21.10 in this test

------------- Release 21.06 ------------

MSE is twice as low for the 1070ti vs 1050

GTX 1050:

avg cuml mse: 18467.385
avg sklearn mse: 22.203711

GTX 1070ti:

avg cuml mse: 10360.385
avg sklearn mse: 10.603711

------------- Release 21.10 ------------

MSE is comparable for 1070ti vs 1050 while significantly worse for RTX 4000

GTX 1050:

avg cuml mse: 18686.445
avg sklearn mse: 23.547796

RTX 3060ti:

avg cuml mse: 17811.584
avg sklearn mse: 22.781849

RTX 4000:

avg cuml mse: 28606.68
avg sklearn mse: 34.714363

Test reproduction

System used is Ubuntu 20.04 & Cuda 11.2

The code used for the test is slightly adapted from a cuML documentation example (found here)

%env CUDA_VISIBLE_DEVICES=2

%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import pandas as pd
import cuml
import cupy as cp
from cuml.datasets.regression import make_regression
from cuml.model_selection import train_test_split
from cuml.ensemble import RandomForestRegressor as cuRF
from sklearn.metrics import mean_squared_error

# synthetic dataset dimensions
n_samples = 25000
n_features = 100

acc_cu = []
acc_sk = []
for i in range (0, 15):

    # generate synthetic data [ binary classification task ]
    X, y = make_regression (n_features = n_features,
                         n_samples = n_samples,
                         random_state = 0 )

    X_train, X_test, y_train, y_test = train_test_split( X, y, random_state = 0 )

    model = cuRF( max_features=1.0,
                  max_depth = 5000,
                  n_estimators = 125,
                  n_bins = 128,
                  random_state  = 0,
                  min_samples_split = 2)

    trained_RF = model.fit ( X_train, y_train )
    predictions = model.predict ( X_test )

    cu_score = cuml.metrics.regression.mean_squared_error( y_test, predictions )
    sk_score = mean_squared_error( cp.asnumpy( y_test ), cp.asnumpy( predictions ) )

    print( " cuml accuracy: ", cu_score )
    acc_cu.append(cu_score.get())
    print( " sklearn accuracy : ", sk_score )
    acc_sk.append(sk_score)

np.mean(acc_cu), np.mean(acc_sk)

Detailed tests results

------------- Release 21.06 ------------ Server 1: Ubuntu 20.04, Cuda 11.2

GTX 1050

    acc_cu      acc_sk
0   18446.268   22.441614
1   18632.25    24.513664
2   18218.719   22.330492
3   18669.227   19.781389
4   18642.93    24.818087
5   18278.35    19.682056
6   18050.219   27.107809
7   18412.809   23.255785
8   18207.566   24.568937
9   18678.936   22.203779
10  18747.725   22.353735
11  18479.879   21.011339
12  18648.385   24.909163
13  18529.021   22.107994
14  18518.441   22.790295

avg: (18467.385, 22.203711)

GTX 1070ti

    cu_MSE     sk_MSE
0   10483.717     11.845588
1   10324.178      12.974723
2   10487.438   10.379570
3   10574.077   11.708376
4   10447.6     12.256223
5   10311.188   7.706892
6   10115.344   7.960054
7   10234.686   9.588173
8   10318.748   12.676580
9   10414.839   11.744102
10  10321.573   10.217284
11  10254.622   9.216642
12  10443.75    14.032436
13  10458.478   9.085072
14  10215.521   7.663937

avg: (10360.385, 10.603711)

------------- Release 21.10 ------------ Server 2: Ubuntu 20.04, Cuda 11.2

GTX 1050

    acc_cu  acc_sk
0   18779.246   19.910265
1   18700.316   23.518761
2   18195.303   21.492762
3   18927.07    23.784945
4   18753.113   19.355326
5   18939.809   27.984495
6   18412.676   26.754286
7   18711.05    29.116203
8   18549.498   26.827385
9   18435.504   21.944023
10  18616.049   23.043634
11  18971.78    23.360584
12  18596.293   24.489685
13  18919.521   21.008913
14  18789.465   20.625677

avg: (18686.445, 23.547796)

RTX 3060ti

    acc_cu      acc_sk
0   17883.842   20.258152
1   17780.596   20.380213
2   17799.564   21.685471
3   18261.98    23.380175
4   17631.197   22.533108
5   17543.809   22.828203
6   18100.209   25.829378
7   17565.72    17.602024
8   18282.281   25.763174
9   17783.22    24.113585
10  17648.469   23.089861
11  17586.965   20.813580
12  18006.36    25.121824
13  17763.885   24.430454
14  17535.637   23.898523

avg: (17811.584, 22.781849)

RTX 4000

    acc_cu      acc_sk
0   29558.697   37.993656
1   28248.191   32.322979
2   28997.182   37.677021
3   28337.803   34.355915
4   28540.05    32.718040
5   28851.107   36.112766
6   27995.83    32.956284
7   28931.229   31.140060
8   29303.906   41.207539
9   28321.346   31.981977
10  28515.256   34.392998
11  28181.438   33.853584
12  28810.854   35.071644
13  28687.695   35.905521
14  27819.627   33.025421

avg: (28606.68, 34.714363)

dantegd commented 3 years ago

Tagging @RAMitchell @venkywonka @vinaydes for input on the issue

venkywonka commented 3 years ago

Thank you for the bug issue @Oleg-dM ! This doesn't address the main problem, but the large discrepancy in cuml.metrics.regression.mean_squared_error vs sklearn.metrics.mean_squared_error is (i think) a bug in the way cuml deals with arrays.

so, from the above script, y_test.shape is (6250, 1) while predictions.shape is (6250,).
By using cp.ravel(y_test) instead of y_test, both metrics become equivalent.

Will find root cause for why this is the case and file a bug 👍🏾

Oleg-dM commented 3 years ago

Thank you for the bug issue @Oleg-dM ! This doesn't address the main problem, but the large discrepancy in cuml.metrics.regression.mean_squared_error vs sklearn.metrics.mean_squared_error is (i think) a bug in the way cuml deals with arrays.
* so, from the above script, `y_test.shape` is (6250, 1) while `predictions.shape` is (6250,).

* By using `cp.ravel(y_test)` instead of `y_test`, both metrics become equivalent.
Will find root cause for why this is the case and file a bug 👍🏾

Thank you Venkat - how serious do you think the main issue is?

One thing I didn't mention is that version 21.06 on which the test was run was source build for GTX architecture (CC 6.1) - could that explain the good performance of the GTX 1070ti compared to all other models?

Thank you in advance for keeping us up to date!

venkywonka commented 3 years ago

So the issue is completely unrelated to Random Forest! it's due to the non-reproducibility of both cuml's make_regression and train_test_split!
This can be verified by substituting the above mentioned modules for sklearn's make_regression and train_test_split in your script. In doing so, the mse values are pretty much close to each other as expected. EDIT: they may not be exactly identical due to floating point arithmetic
Thank you @Oleg-dM , this bug has given birth to 3 more (albeit less obscure) bugs! 😅

Oleg-dM commented 3 years ago

My concern is the highly different MSE across GPU models:

how is it that RTX 4000 has avg MSE of 34.7 on 15 runs while RTX 3060ti has avg MSE of 22.8

If you look at the detailed test results, RTX 4000 performs consistently worth than RTX 3060ti, MSE are always in the range 35 for the former and 23 for the latter.

That is the inconsistency mentioned in the issue title - reproducibility is not a concern afaic

I ran this test in particular to make it easier to understand the issue but in our much more complex system we have the same issues i.e. the different cards do not produce the same results ..

this is a real issue !! outputs vary from card to card and cannot be considered reliable as they vary !!

venkywonka commented 3 years ago

I understand your concern @Oleg-dM, but from what I discovered, the noticeably different mse for different cards are due to different X_train, X_test, y_train and y_test being given to the models. If you make sure the inputs to the model are identical, then the mse comes to be very close to each other. (The small variations are attributed to floating point arithmetic). Doing the same experiment for random forest classifier using sklearn make_classification gives EXACT results as no floating point arithmetic is done there.

Running your above script using sklearn's make_regression and train_test_split i get the following avg MSE's:

TU104 (Tesla T4, but same chip as Quadro RTX 4000)

``` cuml accuracy: 3971.882485676104 sklearn accuracy : 3971.882485676104 3972.0471872687635 3972.0471872687635 ```

GA104 (RTX 3070Ti, but same chip as RTX 3060 Ti)

``` cuml accuracy: 3971.5524583229244 sklearn accuracy : 3971.5524583229244 3971.9694753795507 3971.9694753795507 ```

the modified script

```python # %env CUDA_VISIBLE_DEVICES=2 # %load_ext autoreload # %autoreload 2 import pandas as pd import numpy as np import pandas as pd import cuml import cupy as cp from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from cuml.ensemble import RandomForestRegressor as cuRF from sklearn.metrics import mean_squared_error # synthetic dataset dimensions n_samples = 25000 n_features = 100 acc_cu = [] acc_sk = [] for i in range (0, 15): # generate synthetic data [ binary classification task ] X, y = make_regression (n_features = n_features, n_samples = n_samples, random_state = 0 ) X_train, X_test, y_train, y_test = train_test_split( X, y, random_state = 0 ) model = cuRF( max_features=1.0, max_depth = 5000, n_estimators = 125, n_bins = 128, random_state = 0, min_samples_split = 2) trained_RF = model.fit ( X_train, y_train ) predictions = model.predict ( X_test ) cu_score = cuml.metrics.regression.mean_squared_error( cp.ravel(cp.asarray(y_test)), cp.asarray(predictions) ) sk_score = mean_squared_error( cp.asnumpy( y_test ), cp.asnumpy( predictions ) ) print( " cuml accuracy: ", cu_score ) acc_cu.append(cu_score.get()) print( " sklearn accuracy : ", sk_score ) acc_sk.append(sk_score) print(np.mean(acc_cu), np.mean(acc_sk)) ```

Oleg-dM commented 3 years ago

I understand your concern @Oleg-dM, but from what I discovered, the noticeably different mse for different cards are due to different X_train, X_test, y_train and y_test being given to the models.

MSE is averaged out of 15 runs and even so is significantly different across cards - we are speaking here 50% deviation between rtx 3060ti and rtx 4000 results

Variations of make_regression or train_test_split output cannot explain that

I will test the issue using the exact same dataset for all cards to double check

EDIT: tested on a fixed dataset on different cards and got the same MSE

So the functions make_regression and train_test_split are dependent on the card architecture?

Sorry for the confusion

dantegd commented 3 years ago

Note, for the issue with make_regression and train_test_split, a PR to CuPy was merged recently (https://github.com/cupy/cupy/pull/5838) that fixes things, so the non deterministic behavior there should be fixed soon as well.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapidsai / cuml