Closed charlesbluca closed 3 weeks ago
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
I landed back on this issue while looking for UVM discussions in cuDF. How could we get an OOM error if UVM (Unified Virtual Memory) is enabled and working properly?
I've never seen an OOM with UVM, just full system hangs when host memory is exhausted.
Perhaps the original issue came out of UVM not working properly rather than a cuDF issues. @charlesbluca would you please share any updates since your original filing?
Sorry for the delay, went ahead and retried the script with updated nightlies and this still seems to be an issue:
``` ○ → conda list List of packages in environment: "/home/charlesb/micromamba/envs/cudf-23.10" Name Version Build Channel ───────────────────────────────────────────────────────────────────────────────────────────────────── _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge asttokens 2.2.1 pyhd8ed1ab_0 conda-forge aws-c-auth 0.7.0 hf8751d9_2 conda-forge aws-c-cal 0.6.0 h93469e0_0 conda-forge aws-c-common 0.8.23 hd590300_0 conda-forge aws-c-compression 0.2.17 h862ab75_1 conda-forge aws-c-event-stream 0.3.1 h9599702_1 conda-forge aws-c-http 0.7.11 hbe98c3e_0 conda-forge aws-c-io 0.13.28 h3870b5a_0 conda-forge aws-c-mqtt 0.8.14 h2e270ba_2 conda-forge aws-c-s3 0.3.13 heb0bb06_2 conda-forge aws-c-sdkutils 0.1.11 h862ab75_1 conda-forge aws-checksums 0.1.16 h862ab75_1 conda-forge aws-crt-cpp 0.20.3 he9c0e7f_4 conda-forge aws-sdk-cpp 1.10.57 hbc2ea52_17 conda-forge backcall 0.2.0 pyh9f0ad1d_0 conda-forge backports 1.0 pyhd8ed1ab_3 conda-forge backports.functools_lru_cache 1.6.5 pyhd8ed1ab_0 conda-forge bzip2 1.0.8 h7f98852_4 conda-forge c-ares 1.19.1 hd590300_0 conda-forge ca-certificates 2023.7.22 hbcca054_0 conda-forge cachetools 5.3.1 pyhd8ed1ab_0 conda-forge cubinlinker 0.3.0 py310hfdf336d_0 rapidsai-nightly cuda-python 11.8.2 py310h01a121a_0 conda-forge cuda-version 11.8 h70ddcb2_2 conda-forge cudatoolkit 11.8.0 h4ba93d1_12 conda-forge cudf 23.10.00a cuda11_py310_230807_ge92de8113d_60 rapidsai-nightly cupy 12.1.0 py310h53f8385_1 conda-forge decorator 5.1.1 pyhd8ed1ab_0 conda-forge dlpack 0.5 h9c3ff4c_0 conda-forge executing 1.2.0 pyhd8ed1ab_0 conda-forge fastrlock 0.8 py310hd8f1fbe_3 conda-forge fmt 9.1.0 h924138e_0 conda-forge fsspec 2023.6.0 pyh1a96a4e_0 conda-forge gflags 2.2.2 he1b5a44_1004 conda-forge glog 0.6.0 h6f12383_0 conda-forge gmock 1.14.0 ha770c72_0 conda-forge gtest 1.14.0 h00ab1b0_0 conda-forge ipython 8.14.0 pyh41d4057_0 conda-forge jedi 0.19.0 pyhd8ed1ab_0 conda-forge keyutils 1.6.1 h166bdaf_0 conda-forge krb5 1.21.1 h659d440_0 conda-forge ld_impl_linux-64 2.40 h41732ed_0 conda-forge libabseil 20230125.3 cxx17_h59595ed_0 conda-forge libarrow 12.0.1 h657c46f_7_cpu conda-forge libblas 3.9.0 17_linux64_openblas conda-forge libbrotlicommon 1.0.9 h166bdaf_9 conda-forge libbrotlidec 1.0.9 h166bdaf_9 conda-forge libbrotlienc 1.0.9 h166bdaf_9 conda-forge libcblas 3.9.0 17_linux64_openblas conda-forge libcrc32c 1.1.2 h9c3ff4c_0 conda-forge libcudf 23.10.00a cuda11_230807_ge92de8113d_60 rapidsai-nightly libcufile 1.4.0.31 0 nvidia libcufile-dev 1.4.0.31 0 nvidia libcurl 8.2.1 hca28451_0 conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libev 4.33 h516909a_1 conda-forge libevent 2.1.12 hf998b51_1 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-ng 13.1.0 he5830b7_0 conda-forge libgfortran-ng 13.1.0 h69a702a_0 conda-forge libgfortran5 13.1.0 h15d22d2_0 conda-forge libgomp 13.1.0 he5830b7_0 conda-forge libgoogle-cloud 2.12.0 h840a212_1 conda-forge libgrpc 1.56.2 h3905398_0 conda-forge libkvikio 23.10.00a cuda11_230807_g0247ca6_6 rapidsai-nightly liblapack 3.9.0 17_linux64_openblas conda-forge libllvm14 14.0.6 hcd5def8_4 conda-forge libnghttp2 1.52.0 h61bc06f_0 conda-forge libnsl 2.0.0 h7f98852_0 conda-forge libnuma 2.0.16 h0b41bf4_1 conda-forge libopenblas 0.3.23 pthreads_h80387f5_0 conda-forge libprotobuf 4.23.3 hd1fb520_0 conda-forge librmm 23.10.00a cuda11_230807_gcd37245e_9 rapidsai-nightly libsqlite 3.42.0 h2797004_0 conda-forge libssh2 1.11.0 h0841786_0 conda-forge libstdcxx-ng 13.1.0 hfd8a6a1_0 conda-forge libthrift 0.18.1 h8fd135c_2 conda-forge libutf8proc 2.8.0 h166bdaf_0 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libzlib 1.2.13 hd590300_5 conda-forge llvmlite 0.40.1 py310h1b8f574_0 conda-forge lz4-c 1.9.4 hcb278e6_0 conda-forge matplotlib-inline 0.1.6 pyhd8ed1ab_0 conda-forge ncurses 6.4 hcb278e6_0 conda-forge numba 0.57.1 py310h0f6aa51_0 conda-forge numpy 1.24.4 py310ha4c1d20_0 conda-forge nvcomp 2.6.1 h0800d71_2 conda-forge nvtx 0.2.5 py310h1fa729e_0 conda-forge openssl 3.1.2 hd590300_0 conda-forge orc 1.9.0 h385abfd_1 conda-forge packaging 23.1 pyhd8ed1ab_0 conda-forge pandas 1.5.3 py310h9b08913_1 conda-forge parso 0.8.3 pyhd8ed1ab_0 conda-forge pexpect 4.8.0 pyh1a96a4e_2 conda-forge pickleshare 0.7.5 py_1003 conda-forge pip 23.2.1 pyhd8ed1ab_0 conda-forge prompt-toolkit 3.0.39 pyha770c72_0 conda-forge prompt_toolkit 3.0.39 hd8ed1ab_0 conda-forge protobuf 4.23.3 py310hb875b13_0 conda-forge ptxcompiler 0.8.1 py310h01a121a_0 conda-forge ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge pure_eval 0.2.2 pyhd8ed1ab_0 conda-forge pyarrow 12.0.1 py310h0576679_7_cpu conda-forge pygments 2.16.1 pyhd8ed1ab_0 conda-forge python 3.10.12 hd12c33a_0_cpython conda-forge python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge python_abi 3.10 3_cp310 conda-forge pytz 2023.3 pyhd8ed1ab_0 conda-forge rdma-core 28.9 h59595ed_1 conda-forge re2 2023.03.02 h8c504da_0 conda-forge readline 8.2 h8228510_1 conda-forge rmm 23.10.00a cuda11_py310_230807_gcd37245e_9 rapidsai-nightly s2n 1.3.46 h06160fa_0 conda-forge setuptools 68.0.0 pyhd8ed1ab_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge snappy 1.1.10 h9fff704_0 conda-forge spdlog 1.11.0 h9b3ece8_1 conda-forge stack_data 0.6.2 pyhd8ed1ab_0 conda-forge tk 8.6.12 h27826a3_0 conda-forge traitlets 5.9.0 pyhd8ed1ab_0 conda-forge typing_extensions 4.7.1 pyha770c72_0 conda-forge tzdata 2023c h71feb2d_0 conda-forge ucx 1.14.1 h4a2ce2d_2 conda-forge wcwidth 0.2.6 pyhd8ed1ab_0 conda-forge wheel 0.41.1 pyhd8ed1ab_0 conda-forge xz 5.2.6 h166bdaf_0 conda-forge zstd 1.5.2 hfc55251_7 conda-forge ```
Perhaps the original issue came out of UVM not working properly rather than a cuDF issues
Any additional information I could provide here to help narrow down the cause? 🙂
Could this be some WSL + UVM bug? CC @harrism
@charlesbluca I asked and it turns out that this is expected on WSL. The UVM support on windows display driver model (WDDM) is a limited form of UVM that doesn't support oversubscription (or simultaneous CPU / GPU access). Pages are not migrated on WDDM.
That said, there is a limited form of oversubscription that is supported for regular cudaMalloc calls. Could you do me a favor and try your script with this line commented out so that it uses the default memory resource?
rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())
I kind of expect this to change or move the failure, rather than solve it, but it will be interesting to see.
Tested the script below on my WSL2 (Win 11/Ubuntu22.04) machine that has an RTX4090:
import cupy as cp
import cudf
import os
def generate_random_data(num_rows, num_columns):
"""Generate random numbers using CuPy and return a cuDF DataFrame."""
# Generate a random CuPy array
data = cp.random.rand(num_rows, num_columns)
# Convert to cuDF DataFrame
df = cudf.DataFrame(data, columns=[f'col_{i}' for i in range(num_columns)])
return df
def generate_csv_gpu(target_size, filename='cudf_uvm_data_20gb.csv'):
num_columns = 5
row_estimate = 1000 # Initial guess for number of rows
# Generate initial data
df = generate_random_data(row_estimate, num_columns)
df.to_csv(filename, index=False)
# Check file size and adjust
current_size = os.path.getsize(filename)
row_size = current_size / row_estimate
total_rows_needed = int(target_size / row_size)
# Generate the correct amount of data
df = generate_random_data(total_rows_needed, num_columns)
df.to_csv(filename, index=False, chunksize=1000000)
# Report final file size
final_size = os.path.getsize(filename)
print(f"Targeted file size was {target_size} bytes.")
print(f"Final file size is {final_size} bytes.")
# Usage example
generate_csv_gpu(20000000000)
input_data_path ='cudf_uvm_data_20gb.csv'
df = cudf.read_csv(input_data_path)
len(df)
>>> 336309673
So no issues with a CSV that's ~18gb with an available 24gb GPU memory. In the Windows task manager GPU perf tab we can see it needs to jump into the shared memory pool, but does so successfully.
So - I think this has been resolved at some point in WSL.
I wonder what "shared GPU Memory" is.
Confirmed internally that this means it is over subscribing. OK so the behavior with managed memory is expected, but this option is available.
I think we can close this now.
Describe the bug While testing the cuGraph's UVM notebook, I encountered an OOM error when trying to read in a large (~26 GB) CSV dataset on an RTX 8000 48GB.
Steps/Code to reproduce bug Sorry for the the lengthy reproducer - happy to switch over a more readily available large dataset if possible:
The above fails unless
nrows
is set to something under ~100,000,000:Expected behavior I would expect the CSV dataset to be read in its entirety - the notebook and this code succeed on a standard Linux machine with a V100 32GB (DGX1).
Environment overview (please complete the following information)
Environment details
Click here to see environment details