rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.09k stars 874 forks source link

[BUG] terminate called after throwing an instance of 'rmm::bad_alloc' what(): std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory #6772

Closed stromal closed 3 years ago

stromal commented 3 years ago

I am using Jupyter Lab on a pre configured EC2 g4dn.4xlarge instance (64 GB RAM, 16 core, NVIDIA T4 GPU)

I am just loading in a 3 GB csv with the following dimensions 3 million rows, 500 columns

dataset = cudf.read_csv('data.csv')
dataset.head()

It prints out the head correctly but when I run the next cell it restarts the whole thing and the code cell get number 1 so it doesn't sees the dataset i just loaded in a cell before because the whole environment restarted. There are no error messages in the jupyter lab at all I only see the following error message from

TERMINAL

[I 10:45:11.373 LabApp] Starting buffering for VERY_LONG_ID_HERE
[I 10:45:11.514 LabApp] Restoring connection for VERY_LONG_ID_HERE
terminate called after throwing an instance of 'rmm::bad_alloc'
  what():  std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory
[I 10:45:53.221 LabApp] KernelRestarter: restarting kernel (1/5), keep random ports
stromal commented 3 years ago

I have solved it by strictly defining datatypes

    'uint8',
    'float64',
    'float64']

The smaller bit you go the better. than load it in like this:

dataset = cudf.read_csv('data.csv', dtype = col_types)

beckernick commented 3 years ago

3 million rows x 500 columns of float64 values would be 12GB in memory (3e6 500 8 / 1e9). Glad you've got things working, but it's possible your data is larger than it seems on disk.

stromal commented 3 years ago

3 million rows x 500 columns of float64 values would be 12GB in memory (3e6 500 8 / 1e9). Glad you've got things working, but it's possible your data is larger than it seems on disk.

Is there a library that measures/calculate data sizes?