h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
322 stars 85 forks source link

cudf should spil to main memory when running out of gpu memory #129

Closed jangorecki closed 3 years ago

jangorecki commented 4 years ago

according to comment in https://github.com/rapidsai/cudf/issues/2288#issuecomment-572290767 one could spil to main memory without actually using dask-cudf. related https://github.com/h2oai/db-benchmark/issues/126

jangorecki commented 4 years ago

Solving this issue allows cudf to compute medium size data (5GB). AFAIK 50 GB was failing due to OOM (main memory), thus I filled new FR in cudf to handle such cases: https://github.com/rapidsai/cudf/issues/3740

jangorecki commented 4 years ago

@datametrician I would appriciate if you have any clues what might be a problem.

Since I switched to using managed memory I started to get the following error

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  rmm_allocator::allocate(): RMM_ALLOC: unspecified launch failure

It causes cuda drivers to hang (I assume), trying to use cudf in another session hangs that session as well. I cannot even kill the process (from nvidia-smi list) using kill -9. I also tried nvidia-smi -r but it gives

GPU Reset couldn't run because GPU 00000000:02:00.0 is the primary GPU.

The only way seems to be hard reboot, which is not an option at the moment.

jangorecki commented 4 years ago

more complete output @datametrician

Traceback (most recent call last):
  File "./cudf/join-cudf.py", line 34, in <module>
    x = cu.read_csv(src_jn_x, header=0, dtype=['int32','int32','int32','str','st
r','str','float64'])
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/io/csv.py
", line 82, in read_csv
    index_col=index_col,
  File "cudf/_lib/csv.pyx", line 41, in cudf._lib.csv.read_csv
  File "cudf/_lib/csv.pyx", line 205, in cudf._lib.csv.read_csv
RuntimeError: rmm_allocator::allocate(): RMM_ALLOC: unspecified launch failure
taureandyernv commented 4 years ago

@jangorecki, can you join the RAPIDS Go-AI Slack channel. We do have a feature for this, dask_cudf, and I can show you how to use dask_cudf to get around this. Is there a reason why using our dask_cudf is insufficient for your benchmarks? Looking forward to chat!

kkraus14 commented 4 years ago

@jangorecki this issue looks like managed memory eating up the system memory to the point that that driver context is corrupted where unfortunately the only option is to restart the machine. UVM only supports spilling to host memory because the migration from host --> GPU occurs via a page fault mechanism that won't work with disks.

As Taurean pointed out, dask-cudf has a different mechanism for managing memory that involves chunking the workload and monitoring the memory usage and spilling from GPU --> host --> disk as needed. If your workload is larger than system memory I would highly recommend using dask-cudf.

jangorecki commented 4 years ago

@taureandyernv Thanks for you comment. It is not that dask-cudf is insufficient. I want to use dask-cudf in benchmarks. Problem is that I found documentation lacking my use case (see "dask_cudf.read_csv docstring"). I know I could try to figure out myself by asking on GH (what I actually did), or reading existing GH comments, but

  1. it takes much more time than just reading documentation,
  2. it doesn't guarantee to be successful, as some parts might have not been implemented yet,
  3. its API is not guaranteed to be stable; everything what is not in documentation should be considered as subject to change without notice, and adapting code to changes that don't have to be listed in changelog as breaking changes is even more time consuming.

@kkraus14 Thanks for your comment. It helps a lot. It is quite bad that it is so easy to corrupt driver context. IMO it is good reason to warn users before using managed memory in cudf only, but of course not in dask-cudf as you explained. Hopefully I will move to dask-cudf soon.

jangorecki commented 4 years ago

spilling to main memory cannot be realiably made without using dask-cudf. Currently implemented spilling was rolled back so we can stil run cudf benchmarks. Re-opening this issue to wait for dask-cudf support

taureandyernv commented 4 years ago

Hey @jangorecki , we use dask_cudf and RMM. dask_cudf nor cudf by themselves are not designed to spill to main memory. Happy to show you an example

jangorecki commented 4 years ago

@taureandyernv Thanks for trying to help. Altought spilling cudf to main mem works, it is not reliable because it can corrupt driver context and then whole machine has to be rebooted. So agree it only make sense to use it with dask_cudf, which AFAIU is not affected by that issue. Your example is good, but what could be even better, if you could contribute it to cudf repository as documentation. I am now waiting for https://github.com/rapidsai/cudf/issues/2277 and https://github.com/rapidsai/cudf/issues/2288 (you are even mentioned there). If your example does not cover those cases, it won't help much to push this issue forward.