rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.42k stars 901 forks source link

[BUG] Merge requires excessive memory #9906

Closed madsbk closed 2 years ago

madsbk commented 2 years ago

Describe the bug Merging two DataFrames of size 48MB requires an extra 132GB of device memory!

Steps/Code to reproduce bug The following code reproduce the issue. First it register a RMM resource adapter that prints the size of the allocation on failure and then it merge two 48MB sized DataFrames.

import cudf
from dask.utils import format_bytes
import rmm.mr

def oom(nbytes: int) -> bool:
    print(f"RMM allocation of {format_bytes(nbytes)} failed")
    return False

# Register a resource adapter that prints the size of the
# allocation on failure.
current_mr = rmm.mr.get_current_device_resource()
mr = rmm.mr.FailureCallbackResourceAdaptor(current_mr, oom)
rmm.mr.set_current_device_resource(mr)

# Perform a merge of two fairly modest sized DataFrames (48MB each)
df_a = cudf.datasets.randomdata(nrows=2000000)
df_b = cudf.datasets.randomdata(nrows=2000000)
print(f"Merging to DF of size {format_bytes(df_a.memory_usage().sum())}")
df_a.merge(df_b, on=["id"])

The output:

Merging to DF of size 45.78 MiB
RMM allocation of 132.89 GiB failed

Traceback (most recent call last):
  File "merge_oom.py", line 18, in <module>
    df_a.merge(df_b, on=["id"])
  File "/home/mkristensen/apps/miniconda3/envs/rap-1215/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/home/mkristensen/apps/miniconda3/envs/rap-1215/lib/python3.8/site-packages/cudf/core/dataframe.py", line 3703, in merge
    gdf_result = super()._merge(
  File "/home/mkristensen/apps/miniconda3/envs/rap-1215/lib/python3.8/site-packages/cudf/core/frame.py", line 3766, in _merge
    return merge(
  File "/home/mkristensen/apps/miniconda3/envs/rap-1215/lib/python3.8/site-packages/cudf/core/join/join.py", line 52, in merge
    return mergeobj.perform_merge()
  File "/home/mkristensen/apps/miniconda3/envs/rap-1215/lib/python3.8/site-packages/cudf/core/join/join.py", line 170, in perform_merge
    left_rows, right_rows = self._joiner(
  File "cudf/_lib/join.pyx", line 24, in cudf._lib.join.join
  File "cudf/_lib/join.pyx", line 30, in cudf._lib.join.join
MemoryError: std::bad_alloc: CUDA error at: /home/mkristensen/apps/miniconda3/envs/rap-1215/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory

Expected behavior Perform the merge with a peak memory use of around 100-200 MB.

Environment My workstation and DGX-15, nightly conda:

mamba create -n test -c rapidsai-nightly -c nvidia -c conda-forge rapids=21.12 python=3.8 cudatoolkit=11.4 \
    thrust dask-cuda ucx-proc=*=gpu ucx pytest pytest-asyncio cupy distributed numba "black==19.10b0" \
    isort flake8 automake make libtool pkg-config libhwloc setuptools "cython>=0.29.14,<3.0.0a0" ipython \
    gh graphviz python-graphviz oauth2client gspread spacy psutil asyncssh sphinx_rtd_theme mypy py-spy \
    cucim dask-sql

cc. @randerzander

jrhemstad commented 2 years ago

lolwat. That's crazy. This looks like a great place to try @shwina's Python memory profiler to see where that allocation is coming from.

beckernick commented 2 years ago

This looks like it's increasingly approximating a large self-join with only a few keys given n grows >> the distribution of id in the random dataset generator. Are we sure this isn't expected behavior with n=2e6?

import cudf
​
for n in (10000, 50000, 100000, 200000):
    df_a = cudf.datasets.randomdata(nrows=n)
    df_b = cudf.datasets.randomdata(nrows=n)
    res = df_a.merge(df_b, on=['id'])
    print(f"Output size: {format_bytes(res.memory_usage().sum())}")
    print(f"{n:,}", f"{len(res):,}")
    print()
Output size: 33.94 MiB
10,000 889,769

Output size: 848.07 MiB
50,000 22,231,725

Output size: 3.32 GiB
100,000 89,010,357

Output size: 13.27 GiB
200,000 356,134,437
jrhemstad commented 2 years ago

Oh yeah, if the data is heavily skewed then this join can definitely blow up.

madsbk commented 2 years ago

@beckernick you are absolutely right, I assumed that the "id" values in randomdata were unique. Thanks for making this clear and sorry for the noise.