Import cudf is changing the time taken for cupy array operation[BUG]

rapidsai / cudf

cuDF - GPU DataFrame Library

https://docs.rapids.ai/api/cudf/stable/

Apache License 2.0

8.09k stars 874 forks source link

Import cudf is changing the time taken for cupy array operation[BUG] #12222

Closed arpan-das-astrophysics closed 1 year ago

arpan-das-astrophysics commented 1 year ago

Hello, just importing the cudf array is making creation and operation of cupy arrays a lot slower. For example, I used the following code

import cupy as cp
from time import perf_counter as perf_counter, sleep

def bench():
    for iter in range(10):
        tic = perf_counter()
        A = cp.ones((1000, 1000, 1000))
        A = A * (iter + 12.3456)
        toc = perf_counter()
        print(f"Time: {toc-tic:.6f} {A[999,999,999]:.3f}")

print("### BENCH no cudf")
bench()

print("### BENCH cudf")
import cudf
bench()

And the output I am getting:

### BENCH no cudf
Time: 0.325103 12.346
Time: 0.000231 13.346
Time: 0.000088 14.346
Time: 0.000083 15.346
Time: 0.000074 16.346
Time: 0.000072 17.346
Time: 0.000072 18.346
Time: 0.000069 19.346
Time: 0.000070 20.346
Time: 0.000069 21.346
### BENCH cudf
Time: 0.997569 12.346
Time: 0.523170 13.346
Time: 0.448860 14.346
Time: 0.448881 15.346
Time: 0.448889 16.346
Time: 0.449504 17.346
Time: 0.449307 18.346
Time: 0.448872 19.346
Time: 0.448907 20.346
Time: 0.449003 21.346

I am converting a whole cpu based code to gpu code using cudf but it was making the code even slower and while debugging it we found this is the issue.

bdice commented 1 year ago

@arpan-das-astrophysics Can you offer information on your GPU and its memory size? Importing cudf will load some code onto the GPU that takes some space. I wonder if that is related. Do you see the same behavior for smaller inputs than 1000x1000x1000? Finally, can you add del A; gc.collect() at the end of the benchmark loop?

arpan-das-astrophysics commented 1 year ago

@arpan-das-astrophysics Can you offer information on your GPU and its memory size? Importing cudf will load some code onto the GPU that takes some space. I wonder if that is related. Do you see the same behavior for smaller inputs than 1000x1000x1000? Finally, can you add del A; gc.collect() at the end of the benchmark loop?

Yes sure. It is an A100 GPU with 40GB of memory. For smaller size array e.g. 200x200x200 the times are still different

### BENCH no cudf
Time: 0.298966 12.346
Time: 0.000226 13.346
Time: 0.000080 14.346
Time: 0.000075 15.346
Time: 0.000068 16.346
Time: 0.000079 17.346
Time: 0.000065 18.346
Time: 0.000063 19.346
Time: 0.000062 20.346
Time: 0.000062 21.346
### BENCH cudf
Time: 0.943010 12.346
Time: 0.001211 13.346
Time: 0.001048 14.346
Time: 0.001061 15.346
Time: 0.001066 16.346
Time: 0.001065 17.346
Time: 0.001058 18.346
Time: 0.001059 19.346
Time: 0.001055 20.346
Time: 0.001059 21.346

I used del A; gc.collect() and it didn't change anything

bdice commented 1 year ago

I suspect the difference comes from setting the RMM allocator: https://github.com/rapidsai/cudf/blob/d49e4123dd329d65f067deb4ffd8b100d84cf46a/python/cudf/cudf/__init__.py#L95

Can you compare a case where you don't import cudf, but only set the RMM allocator? https://docs.rapids.ai/api/rmm/stable/basics.html#using-rmm-with-cupy

import rmm
import cupy
cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)

bdice commented 1 year ago

If that shows a difference, you might try enabling the pool allocator:

import rmm
pool = rmm.mr.PoolMemoryResource(
    rmm.mr.CudaMemoryResource(),
    initial_pool_size=2**30,
    maximum_pool_size=2**32
)
rmm.mr.set_current_device_resource(pool)

https://docs.rapids.ai/api/rmm/stable/basics.html#memoryresource-objects

beckernick commented 1 year ago

CuPy uses a memory pool by default. Using the RMM pool should resolve this issue.

arpan-das-astrophysics commented 1 year ago

If that shows a difference, you might try enabling the pool allocator:
import rmm
pool = rmm.mr.PoolMemoryResource(
    rmm.mr.CudaMemoryResource(),
    initial_pool_size=2**30,
    maximum_pool_size=2**32
)
rmm.mr.set_current_device_resource(pool)
https://docs.rapids.ai/api/rmm/stable/basics.html#memoryresource-objects

Indeed, that was the issue. Enabling the pool allocator resolves the problem. Thank you so much. Is there a way to determine the initial_pool_size and maximum_pool_size based on the GPU memory? How do I know which value to use?

davidwendt commented 1 year ago

Is there a way to determine the initial_pool_size and maximum_pool_size based on the GPU memory? How do I know which value to use?

I don't think there are any standard rules. These 2 variables are optional. I would just start with the defaults and go from there. https://docs.rapids.ai/api/rmm/stable/api.html#rmm.mr.PoolMemoryResource