[BUG] cuml multi GPU Ridge and Linear Regression fails with cudaErrorMisalignedAddress:misaligned address

akaanirban commented 3 years ago

Describe the bug Trying to fit a Linear Regression or a Ridge Regression model from cuml.dask.linear_model on a dask_cudf dataframe created from a set of parquet files fails with cudaErrorMisalignedAddress:misaligned address error, when using multiple GPUs.

The actual error .

```bash distributed.worker - WARNING - Compute Failed Function: _func_fit args: (RidgeMG(), < could not convert arg to str >, 2855640, 7, [(5, 273983), (1, 237932), (3, 266759), (4, 256895), (5, 268670), (2, 263836), (7, 276552), (0, 241991), (4, 261025), (6, 248808), (2, 259189)], 5) kwargs: {} Exception: RuntimeError("CUDA error encountered at: file=/opt/conda/envs/rapids/conda-bld/libcumlprims_1619021037911/work/cpp/src_prims_opg/linalg/svd.cu line=76: call='cudaStreamSynchronize(streams[i])', Reason=cudaErrorMisalignedAddress:misaligned address\nObtained 42 stack frames\n#0 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x46) [0x7f0d384d5076]\n#1 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x69) [0x7f0d384d57d9]\n#2 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcumlprims.so(_ZN8MLCommon6LinAlg3opg11svdEig_implIfEEvRKN4raft8handle_tERKSt6vectorIPNS_6Matrix4DataIT_EESaISC_EERKNS8_14PartDescriptorERSE_PSA_SL_PP11CUstream_sti+0xcd2) [0x7f0d522475a2]\n#3 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML5Ridge3opg8ridgeEigIfEEvRN4raft8handle_tERKSt6vectorIPN8MLCommon6Matrix4DataIT_EESaISC_EERKNS8_14PartDescriptorESG_PKSA_iPSA_PP11CUstream_stib+0x399) [0x7f0d38a1bda9]\n#4 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML5Ridge3opg8fit_implIfEEvRN4raft8handle_tERSt6vectorIPN8MLCommon6Matrix4DataIT_EESaISC_EERNS8_14PartDescriptorESF_PSA_iSI_SI_bbiPP11CUstream_stib+0x1b9) [0x7f0d38a1c379]\n#5 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML5Ridge3opg8fit_implIfEEvRN4raft8handle_tERSt6vectorIPN8MLCommon6Matrix4DataIT_EESaISC_EERNS8_14PartDescriptorESF_PSA_iSI_SI_bbib+0x163) [0x7f0d38a1ca73]\n#6 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/linear_model/ridge_mg.cpython-38-x86_64-linux-gnu.so(+0x26b78) [0x7f0d484d8b78]\n#7 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x255) [0x5555be6322b5]\n#8 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x21c1) [0x5555be6dede1]\n#9 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x5555be6bd503]\n#10 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x378) [0x5555be6be8d8]\n#11 in /opt/conda/envs/rapids/bin/python(_PyObject_FastCallDict+0x2fd) [0x5555be62ce2d]\n#12 in /opt/conda/envs/rapids/bin/python(_PyObject_Call_Prepend+0x63) [0x5555be637983]\n#13 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/linear_model/base_mg.cpython-38-x86_64-linux-gnu.so(+0x23b4c) [0x7f0d48609b4c]\n#14 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/linear_model/base_mg.cpython-38-x86_64-linux-gnu.so(+0x27b77) [0x7f0d4860db77]\n#15 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x255) [0x5555be6322b5]\n#16 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x21c1) [0x5555be6dede1]\n#17 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x5555be6bd503]\n#18 in /opt/conda/envs/rapids/bin/python(+0x1b2007) [0x5555be6bf007]\n#19 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4ca3) [0x5555be6e18c3]\n#20 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1a6) [0x5555be6be706]\n#21 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x5e) [0x5555be6320be]\n#22 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x21c1) [0x5555be6dede1]\n#23 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1a6) [0x5555be6be706]\n#24 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x5e) [0x5555be6320be]\n#25 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x21c1) [0x5555be6dede1]\n#26 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1a6) [0x5555be6be706]\n#27 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0xa4b) [0x5555be6dd66b]\n#28 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1a6) [0x5555be6be706]\n#29 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x5e) [0x5555be6320be]\n#30 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x21c1) [0x5555be6dede1]\n#31 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1a6) [0x5555be6be706]\n#32 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0xa4b) [0x5555be6dd66b]\n#33 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1a6) [0x5555be6be706]\n#34 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0xa4b) [0x5555be6dd66b]\n#35 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1a6) [0x5555be6be706]\n#36 in /opt/conda/envs/rapids/bin/python(+0x1b1f91) [0x5555be6bef91]\n#37 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x5e) [0x5555be6320be]\n#38 in /opt/conda/envs/rapids/bin/python(+0x2566a9) [0x5555be7636a9]\n#39 in /opt/conda/envs/rapids/bin/python(+0x1eedb4) [0x5555be6fbdb4]\n#40 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0ef2a2d6db]\n#41 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f0ef1da971f]\n") ```

Steps/Code to reproduce bug The following is a code snippet to reproduce the bug. The parquet file can be downloaded from the google drive link (https://drive.google.com/file/d/1eSiyu47Kunq2a55rLGBJNl9UgaN6JKTy/view?usp=sharing) to read from the local machine. Optionally, the commented part of the code can be uncommented to directly read from the parquet files in google file systems in remote server and create the dask_cudf dataframe. Note you need to install python libraries gcsfs and fsspec=2021.4.0 in order to read the parquet from Rapids 0.19.

Code to reproduce the error.

```python from dask_cuda import LocalCUDACluster import dask_cudf import cuml import numpy as np from dask.distributed import Client, wait from cuml.dask.linear_model import Ridge as RidgeRegression from cuml.dask.linear_model import LinearRegression cluster = LocalCUDACluster(dashboard_address=':8224', n_workers=4) client = Client(cluster) data = dask_cudf.read_parquet("./test.parquet") # Persist the data data = client.persist(data) wait(data) # # UNCOMMENT THIS PART TO DIRECTLY READ FROM GOOGLE FILE SYSTEM AND CREATE THE DATAFRAME # def clean(df_part, remap, must_haves): # """ # This function performs the various clean up tasks for the data # and returns the cleaned dataframe. # """ # tmp = {col:col.strip().lower() for col in list(df_part.columns)} # df_part = df_part.rename(columns=tmp) # # rename using the supplied mapping # df_part = df_part.rename(columns=remap) # # iterate through columns in this df partition # for col in df_part.columns: # # drop anything not in our expected list # if col not in must_haves: # df_part = df_part.drop(col, axis=1) # continue # # fixes datetime error found by Ty Mckercher and fixed by Paul Mahler # if df_part[col].dtype == 'object' and col in ['pickup_datetime', 'dropoff_datetime']: # df_part[col] = df_part[col].astype('datetime64[ms]') # continue # # if column was read as a string, recast as float # if df_part[col].dtype == 'object': # df_part[col] = df_part[col].astype('float32') # else: # # downcast from 64bit to 32bit types # # Tesla T4 are faster on 32bit ops # if 'int' in str(df_part[col].dtype): # df_part[col] = df_part[col].astype('int32') # if 'float' in str(df_part[col].dtype): # df_part[col] = df_part[col].astype('float32') # df_part[col] = df_part[col].fillna(-1) # return df_part # def taxi_parquet_data_loader(client, response_dtype=np.float32): # # list of column names that need to be re-mapped # remap = {} # remap['tpep_pickup_datetime'] = 'pickup_datetime' # remap['tpep_dropoff_datetime'] = 'dropoff_datetime' # remap['ratecodeid'] = 'rate_code' # #create a list of columns & dtypes the df must have # must_haves = { # 'pickup_datetime': 'datetime64[ms]', # 'dropoff_datetime': 'datetime64[ms]', # 'passenger_count': 'int32', # 'trip_distance': 'float32', # 'pickup_longitude': 'float32', # 'pickup_latitude': 'float32', # 'rate_code': 'int32', # 'dropoff_longitude': 'float32', # 'dropoff_latitude': 'float32', # 'fare_amount': 'float32' # } # # apply a list of filter conditions to throw out records with missing or outlier values # query_frags = [ # 'fare_amount > 0 and fare_amount < 500', # 'passenger_count > 0 and passenger_count < 6', # 'pickup_longitude > -75 and pickup_longitude < -73', # 'dropoff_longitude > -75 and dropoff_longitude < -73', # 'pickup_latitude > 40 and pickup_latitude < 42', # 'dropoff_latitude > 40 and dropoff_latitude < 42' # ] # workers = client.has_what().keys() # taxi_parquet_path = "gs://anaconda-public-data/nyc-taxi/nyc.parquet/part.10*.parquet" # response_id = 'fare_amount' # fields = ['passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'rate_code', # 'dropoff_longitude', 'dropoff_latitude', 'fare_amount'] # taxi_df = dask_cudf.read_parquet(taxi_parquet_path, npartitions=len(workers), chunksize=25e6) # taxi_df = clean(taxi_df, remap, must_haves) # taxi_df = taxi_df.query(' and '.join(query_frags)) # taxi_df = taxi_df[fields] # #with dask.annotate(workers=set(workers)): # taxi_df = client.persist(collections=taxi_df) # wait(taxi_df) # X = taxi_df[taxi_df.columns.difference([response_id])].astype(np.float32) # y = taxi_df[response_id].astype(response_dtype) # return taxi_df, X, y # # READ THE PARQUET FILE (and optionally write to disk) # data, X, y = taxi_parquet_data_loader(client) # # data.to_parquet("./test.parquet") response_id = 'fare_amount' X = data[data.columns.difference([response_id])].astype(np.float32) y = data[response_id].astype(np.float32) ridge = RidgeRegression(client=client) ridge.fit(X, y) # # Uncomment to check linear as well # linear = LinearRegression(client=client) # linear.fit(X, y) ```

Expected behavior Calling model.fit on the linear models should complete without errors.

Environment details (please complete the following information):

Environment location: Docker
Linux Distro/Architecture: Ubuntu 18.04
GPU Model/Driver: V100
CUDA: 11.2

Method of cuDF & cuML install: Docker

If method of install is [Docker], provide docker pull & docker run commands used

docker pull rapidsai/rapidsai-core:0.19-cuda11.2-runtime-ubuntu18.04-py3.8
docker run --gpus all --rm -it -p 8222:8222 -p 8223:8223 -p 8224:8224 -v <host dir>:<docker dir> rapidsai/rapidsai-core:0.19-cuda11.2-runtime-ubuntu18.04-py3.8

Additional context While this fails when using the LocalCUDACluster in the reproducer above, it is also failing in MNMG case when tried with dask kubernetes on a MNMG kubernetes cluster.

hcho3 commented 3 years ago

Thanks for the example. We will try to reproduce it on our end.

hcho3 commented 3 years ago

I was able to reproduce the same error on my end.

Environment: Docker image (rapidsai-core:0.19-cuda11.2-runtime-ubuntu18.04-py3.8)
AWS Instance Type p3.8xlarge (four Tesla V100 GPUs)

hcho3 commented 3 years ago

And I got the same error when I used the latest cuML (from the Docker image rapidsai-core:21.06-cuda11.2-runtime-ubuntu20.04-py3.8)

@cjnolet Who is the best person to reach out to about this error in cumlprims?

cjnolet commented 3 years ago

@akaanirban, My apologies for being a little late to this discussion. The RMM allocator should be aligning any memory allocations but it is not set in cupy allocations by default (which are ultimately used in Dask arrays).

Can you try inserting this directly after creating the Dask client object to make sure Cupy is using RMM and see if it fixes your problem?

import cupy as cp
import rmm
rmm.reinitialize()
cp.cuda.set_allocator(rmm.rmm_cupy_allocator)

Nanthini10 commented 3 years ago

@cjnolet This was reported by an intern and he finished the internship a few weeks back. The error seems to have resolved even without the set_allocator code. But I am facing a cudaErrorMisalignedAddress with https://github.com/rapidsai/cuml/issues/4199

It looks like selecting a view causes the error to occur, any thoughts on why that happens?

cjnolet commented 3 years ago

@Nanthini10,

My guess is that the error you are encountering in #4199 is related but not necessarily caused by the same issue in this thread. I believe this thread may have been caused by cupy using its own allocator outside of RMM which wasn't aligning the addresses. I'll respond to #4199 on the issue itself.

Nanthini10 commented 3 years ago

Perfect thanks! I'll go ahead and close this issue since I can't reproduce the original issue.

Unless you see a reason for it to stay open.

cjnolet commented 3 years ago

If not caused by the the cuy allocator issue I proposed above, it's also very possible this could have resolved itself recently as I believe our libcumlprims package had been running an out of data RAFT version for a period.

akaanirban commented 3 years ago

@cjnolet thanks for looking into this. Very sorry for the late reply. Unfortunately I do not have the resources to test this at the moment :( Its great that the issue is resolved. Thanks @Nanthini10 !

rapidsai / cuml

[BUG] cuml multi GPU Ridge and Linear Regression fails with cudaErrorMisalignedAddress:misaligned address #3965