rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.01k stars 869 forks source link

[BUG] cudf-cuda11 not working in Databricks DBR 13.3 ML LTS on GPU instance #16041

Open jcampabadal-db opened 3 weeks ago

jcampabadal-db commented 3 weeks ago

Describe the bug

cudf-cuda11 is not using GPU while running on a Databricks DBR 13.3 ML LTS with GPU instance.

Steps/Code to reproduce bug

Using DBR 14.3 ML with GPU fails with error:

Internal error message: Spark error: Driver down cause: java.lang.IllegalArgumentException: This RAPIDS Plugin build does not support Spark build 3.5.0-databricks. Supported Spark versions: 3.1.1 {buildver=311}, 3.1.2 {buildver=312}, 3.1.3 {buildver=313}, 3.2.0 {buildver=320}, 3.2.1 {buildver=321}, 3.2.1-cloudera-3.2.7171000 {buildver=321cdh}, 3.2.2 {buildver=322}, 3.2.3 {buildver=323}, 3.2.4 {buildver=324}, 3.3.0 {buildver=330}, 3.3.0-cloudera-3.3.7180 {buildver=330cdh}, 3.3.0-databricks {buildver=330db}, 3.3.1 {buildver=331}, 3.3.2 {buildver=332}, 3.3.2-cloudera-3.3.7190 {buildver=332cdh}, 3.3.2-databricks {buildver=332db}, 3.3.3 {buildver=333}, 3.3.4 {buildver=334}, 3.4.0 {buildver=340}, 3.4.1 {buildver=341}, 3.4.1-databricks {buildver=341db}, 3.4.2 {buildver=342}, 3.5.0 {buildver=350}, 3.5.1 {buildver=351}. Consult the Release documentation at https://nvidia.github.io/spark-rapids/docs/download.html

We are following these guides:

https://docs.rapids.ai/deployment/stable/platforms/databricks/

https://docs.nvidia.com/spark-rapids/user-guide/23.12/getting-started/databricks.html

Expected behavior

For cudf-cuda11 package to utilize GPU to perform pandas operations.

Environment overview (please complete the following information)

Here I load cudf and I made sure it shows <module 'pandas' (ModuleAccelerator(fast=cudf, slow=pandas))> when printing pd.

image

How to debug why cuDF shows 0 per-gpu usage but only Per-GPU frame buffer utilization bytes? It seems to be only using the CPU. Please advise it seems cudf-cuda11 supports Cuda 11.2+ which the DBR release contains and the library is loaded just fine.

We are using this NVIDIA notebook for testing rapid cudf pandas accelerator:

https://colab.research.google.com/drive/12tCzP94zFG2BRduACucn5Q_OcX1TUKY3

lithomas1 commented 2 weeks ago

Can you try using the cudf.pandas.profile magic? https://docs.rapids.ai/api/cudf/stable/cudf_pandas/usage/#understanding-performance-the-cudf-pandas-profiler

I think this should tell you which operations are running on the GPU and which are running on CPU.

jcampabadal-db commented 2 weeks ago

Thank you @lithomas1, will check that

ericwong2965 commented 2 weeks ago

@lithomas1 I had been working with @jcampabadal-db on this, I observed super slow performance on GPU with following output on both Databricks DBR 13.3 ML(CUDA11.7) and Databricks DBR 14.3 ML(CUDA 11.8) on AWS EC2 g5.xlarge [A10G] following same command from https://docs.rapids.ai/api/cudf/stable/cudf_pandas/usage/#understanding-performance-the-cudf-pandas-profiler

but the output is below (noticed took several minutes), how to workaround or resolve such performance issue?

/databricks/python/lib/python3.10/site-packages/cupy/cuda/compiler.py:233: PerformanceWarning: Jitify is performing a one-time only warm-up to populate the persistent cache, this may take a few seconds and will be improved in a future release...
  jitify._init_module()

                                       Total time elapsed: 225.300 seconds                                 
                                       3 GPU function calls in 224.665 seconds                               
                                        1 CPU function calls in 0.012 seconds                                

                                                        Stats                                                

┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Function                ┃ GPU ncalls ┃ GPU cumtime ┃ GPU percall ┃ CPU ncalls ┃ CPU cumtime ┃ CPU percall ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ DataFrame               │ 1          │ 0.145       │ 0.145       │ 0          │ 0.000       │ 0.000       │
│ DataFrame.min           │ 1          │ 224.520     │ 224.520     │ 0          │ 0.000       │ 0.000       │
│ DataFrame.groupby       │ 1          │ 0.000       │ 0.000       │ 0          │ 0.000       │ 0.000       │
│ DataFrameGroupBy.filter │ 0          │ 0.000       │ 0.000       │ 1          │ 0.012       │ 0.012       │
└─────────────────────────┴────────────┴─────────────┴─────────────┴────────────┴─────────────┴─────────────┘

Not all pandas operations ran on the GPU. The following functions required CPU fallback:

ericwong2965 commented 2 weeks ago

Also if I follow https://docs.nvidia.com/spark-rapids/user-guide/23.12/getting-started/databricks.html

sometimes I run into OOM error even loading small dataset:

import cudf
import requests
from io import StringIO

url = "https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode("utf-8")

tips_df = cudf.read_csv(StringIO(content))

MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /__w/cudf/cudf/python/cudf/build/cp310-cp310-linux_x86_64/_deps/rmm-src/include/rmm/mr/device/cuda_memory_resource.hpp:60: cudaErrorMemoryAllocation out of memory

beckernick commented 2 days ago

Are you using cuDF Pandas alongside Spark RAPIDS in a single application/workflow or is this independent of Spark?

Would be curious to know if you experience this error when following only this guide https://docs.rapids.ai/deployment/stable/platforms/databricks/ (or if it's perhaps related to some combination).

ericwong2965 commented 2 days ago

@beckernick thanks for reply on this - actually this ticket was the issues encountered after following this guide pointed above - latest RAPIDS release mandate support of Databricks only till 13.3ML as describedin https://nvidia.github.io/spark-rapids/docs/download.html otherwise Databricks Spark cluster failed to boot up