Closed arpan-das-astrophysics closed 1 year ago
@arpan-das-astrophysics Can you offer information on your GPU and its memory size? Importing cudf will load some code onto the GPU that takes some space. I wonder if that is related. Do you see the same behavior for smaller inputs than 1000x1000x1000
? Finally, can you add del A; gc.collect()
at the end of the benchmark loop?
@arpan-das-astrophysics Can you offer information on your GPU and its memory size? Importing cudf will load some code onto the GPU that takes some space. I wonder if that is related. Do you see the same behavior for smaller inputs than
1000x1000x1000
? Finally, can you adddel A; gc.collect()
at the end of the benchmark loop?
Yes sure. It is an A100 GPU with 40GB of memory. For smaller size array e.g. 200x200x200
the times are still different
### BENCH no cudf
Time: 0.298966 12.346
Time: 0.000226 13.346
Time: 0.000080 14.346
Time: 0.000075 15.346
Time: 0.000068 16.346
Time: 0.000079 17.346
Time: 0.000065 18.346
Time: 0.000063 19.346
Time: 0.000062 20.346
Time: 0.000062 21.346
### BENCH cudf
Time: 0.943010 12.346
Time: 0.001211 13.346
Time: 0.001048 14.346
Time: 0.001061 15.346
Time: 0.001066 16.346
Time: 0.001065 17.346
Time: 0.001058 18.346
Time: 0.001059 19.346
Time: 0.001055 20.346
Time: 0.001059 21.346
I used del A; gc.collect()
and it didn't change anything
I suspect the difference comes from setting the RMM allocator: https://github.com/rapidsai/cudf/blob/d49e4123dd329d65f067deb4ffd8b100d84cf46a/python/cudf/cudf/__init__.py#L95
Can you compare a case where you don't import cudf, but only set the RMM allocator? https://docs.rapids.ai/api/rmm/stable/basics.html#using-rmm-with-cupy
import rmm
import cupy
cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)
If that shows a difference, you might try enabling the pool allocator:
import rmm
pool = rmm.mr.PoolMemoryResource(
rmm.mr.CudaMemoryResource(),
initial_pool_size=2**30,
maximum_pool_size=2**32
)
rmm.mr.set_current_device_resource(pool)
https://docs.rapids.ai/api/rmm/stable/basics.html#memoryresource-objects
CuPy uses a memory pool by default. Using the RMM pool should resolve this issue.
If that shows a difference, you might try enabling the pool allocator:
import rmm pool = rmm.mr.PoolMemoryResource( rmm.mr.CudaMemoryResource(), initial_pool_size=2**30, maximum_pool_size=2**32 ) rmm.mr.set_current_device_resource(pool)
https://docs.rapids.ai/api/rmm/stable/basics.html#memoryresource-objects
Indeed, that was the issue. Enabling the pool allocator resolves the problem. Thank you so much. Is there a way to determine the initial_pool_size
and maximum_pool_size
based on the GPU memory? How do I know which value to use?
Is there a way to determine the
initial_pool_size
andmaximum_pool_size
based on the GPU memory? How do I know which value to use?
I don't think there are any standard rules. These 2 variables are optional. I would just start with the defaults and go from there. https://docs.rapids.ai/api/rmm/stable/api.html#rmm.mr.PoolMemoryResource
Hello, just importing the cudf array is making creation and operation of cupy arrays a lot slower. For example, I used the following code
And the output I am getting:
I am converting a whole cpu based code to gpu code using cudf but it was making the code even slower and while debugging it we found this is the issue.