Closed jangorecki closed 3 years ago
Solving this issue allows cudf to compute medium size data (5GB). AFAIK 50 GB was failing due to OOM (main memory), thus I filled new FR in cudf to handle such cases: https://github.com/rapidsai/cudf/issues/3740
@datametrician I would appriciate if you have any clues what might be a problem.
Since I switched to using managed memory I started to get the following error
terminate called after throwing an instance of 'thrust::system::system_error'
what(): rmm_allocator::allocate(): RMM_ALLOC: unspecified launch failure
It causes cuda drivers to hang (I assume), trying to use cudf in another session hangs that session as well. I cannot even kill the process (from nvidia-smi
list) using kill -9
. I also tried nvidia-smi -r
but it gives
GPU Reset couldn't run because GPU 00000000:02:00.0 is the primary GPU.
The only way seems to be hard reboot, which is not an option at the moment.
more complete output @datametrician
Traceback (most recent call last):
File "./cudf/join-cudf.py", line 34, in <module>
x = cu.read_csv(src_jn_x, header=0, dtype=['int32','int32','int32','str','st
r','str','float64'])
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/io/csv.py
", line 82, in read_csv
index_col=index_col,
File "cudf/_lib/csv.pyx", line 41, in cudf._lib.csv.read_csv
File "cudf/_lib/csv.pyx", line 205, in cudf._lib.csv.read_csv
RuntimeError: rmm_allocator::allocate(): RMM_ALLOC: unspecified launch failure
@jangorecki, can you join the RAPIDS Go-AI Slack channel. We do have a feature for this, dask_cudf, and I can show you how to use dask_cudf to get around this. Is there a reason why using our dask_cudf
is insufficient for your benchmarks? Looking forward to chat!
@jangorecki this issue looks like managed memory eating up the system memory to the point that that driver context is corrupted where unfortunately the only option is to restart the machine. UVM only supports spilling to host memory because the migration from host --> GPU occurs via a page fault mechanism that won't work with disks.
As Taurean pointed out, dask-cudf
has a different mechanism for managing memory that involves chunking the workload and monitoring the memory usage and spilling from GPU --> host --> disk as needed. If your workload is larger than system memory I would highly recommend using dask-cudf
.
@taureandyernv Thanks for you comment. It is not that dask-cudf is insufficient. I want to use dask-cudf in benchmarks. Problem is that I found documentation lacking my use case (see "dask_cudf.read_csv docstring"). I know I could try to figure out myself by asking on GH (what I actually did), or reading existing GH comments, but
@kkraus14 Thanks for your comment. It helps a lot. It is quite bad that it is so easy to corrupt driver context. IMO it is good reason to warn users before using managed memory in cudf only, but of course not in dask-cudf as you explained. Hopefully I will move to dask-cudf soon.
spilling to main memory cannot be realiably made without using dask-cudf. Currently implemented spilling was rolled back so we can stil run cudf benchmarks. Re-opening this issue to wait for dask-cudf support
Hey @jangorecki , we use dask_cudf
and RMM
. dask_cudf
nor cudf
by themselves are not designed to spill to main memory. Happy to show you an example
@taureandyernv Thanks for trying to help. Altought spilling cudf to main mem works, it is not reliable because it can corrupt driver context and then whole machine has to be rebooted. So agree it only make sense to use it with dask_cudf, which AFAIU is not affected by that issue. Your example is good, but what could be even better, if you could contribute it to cudf repository as documentation. I am now waiting for https://github.com/rapidsai/cudf/issues/2277 and https://github.com/rapidsai/cudf/issues/2288 (you are even mentioned there). If your example does not cover those cases, it won't help much to push this issue forward.
according to comment in https://github.com/rapidsai/cudf/issues/2288#issuecomment-572290767 one could spil to main memory without actually using dask-cudf. related https://github.com/h2oai/db-benchmark/issues/126