Closed rlratzel closed 4 years ago
This seems to be an issue in general, here is a one more repro, might be with similar scenario in cudf python as #4826. @shwina have you come across this issue anytime?
import rmm
import cudf
rmm.reinitialize(
managed_memory=True,
pool_allocator=True,
initial_pool_size=2 << 27
)
assert(rmm.is_initialized())
df = cudf.DataFrame({"a":[1]})
(cudf_dev2) rgsl888@onepiece:~/Projects/backup/cudf$ cuda-memcheck python sam.py
========= CUDA-MEMCHECK
========= Program hit cudaErrorCudartUnloading (error 4) due to "driver shutting down" on CUDA API call to cudaGetDevice.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so [0x3ac5a3]
========= Host Frame:/home/rgsl888/anaconda3/envs/cudf_dev2/lib/python3.7/site-packages/rmm/_lib/../../../../libcudart.so.10.1 (cudaGetDevice + 0x186) [0x4b416]
========= Host Frame:/home/rgsl888/anaconda3/envs/cudf_dev2/lib/python3.7/site-packages/rmm/_lib/../../../../librmm.so (_ZN5cnmem7ContextD1Ev + 0x2b) [0x568fb]
========= Host Frame:/home/rgsl888/anaconda3/envs/cudf_dev2/lib/python3.7/site-packages/rmm/_lib/../../../../librmm.so (_ZN5cnmem7Context7releaseEv + 0x56) [0x56b56]
========= Host Frame:/home/rgsl888/anaconda3/envs/cudf_dev2/lib/python3.7/site-packages/rmm/_lib/../../../../librmm.so (_ZN3rmm2mr29cnmem_managed_memory_resourceD0Ev + 0x20) [0x37e90]
========= Host Frame:/home/rgsl888/anaconda3/envs/cudf_dev2/lib/python3.7/site-packages/rmm/_lib/../../../../librmm.so (_ZN3rmm7Manager8finalizeEv + 0x20f) [0x2ce9f]
========= Host Frame:/home/rgsl888/anaconda3/envs/cudf_dev2/lib/python3.7/site-packages/rmm/_lib/../../../../librmm.so (_Z11rmmFinalizev + 0x24) [0x27f54]
========= Host Frame:/home/rgsl888/anaconda3/envs/cudf_dev2/lib/python3.7/site-packages/rmm/_lib/lib.cpython-37m-x86_64-linux-gnu.so [0x7b96]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x43041]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x4313a]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xee) [0x21b9e]
========= Host Frame:python [0x1dcbd0]
=========
========= Program hit cudaErrorCudartUnloading (error 4) due to "driver shutting down" on CUDA API call to cudaSetDevice.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so [0x3ac5a3]
========= Host Frame:/home/rgsl888/anaconda3/envs/cudf_dev2/lib/python3.7/site-packages/rmm/_lib/../../../../libcudart.so.10.1 (cudaSetDevice + 0x180) [0x4b5b0]
========= Host Frame:/home/rgsl888/anaconda3/envs/cudf_dev2/lib/python3.7/site-packages/rmm/_lib/../../../../librmm.so (_ZN5cnmem7ContextD1Ev + 0x4d) [0x5691d]
========= Host Frame:/home/rgsl888/anaconda3/envs/cudf_dev2/lib/python3.7/site-packages/rmm/_lib/../../../../librmm.so (_ZN5cnmem7Context7releaseEv + 0x56) [0x56b56]
========= Host Frame:/home/rgsl888/anaconda3/envs/cudf_dev2/lib/python3.7/site-packages/rmm/_lib/../../../../librmm.so (_ZN3rmm2mr29cnmem_managed_memory_resourceD0Ev + 0x20) [0x37e90]
========= Host Frame:/home/rgsl888/anaconda3/envs/cudf_dev2/lib/python3.7/site-packages/rmm/_lib/../../../../librmm.so (_ZN3rmm7Manager8finalizeEv + 0x20f) [0x2ce9f]
========= Host Frame:/home/rgsl888/anaconda3/envs/cudf_dev2/lib/python3.7/site-packages/rmm/_lib/../../../../librmm.so (_Z11rmmFinalizev + 0x24) [0x27f54]
========= Host Frame:/home/rgsl888/anaconda3/envs/cudf_dev2/lib/python3.7/site-packages/rmm/_lib/lib.cpython-37m-x86_64-linux-gnu.so [0x7b96]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x43041]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x4313a]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xee) [0x21b9e]
========= Host Frame:python [0x1dcbd0]
@rlratzel The issue in the example is rmm is being reinitialized, it doesn't inform cudf about memory being released/acquired, and cudf still holds the unique_ptr
that it got which normally gets flushed out later. But, with rmm re-initialization, cudf and rmm both are out of sync which causes this issue. It would be better to allocate pool as module level fixture rather than initializing every-time.
For example, modified test case
import rmm
import cudf
def add_edge_list_to_adj_list(graph_file):
print('Reading ' + str(graph_file) + '...')
return cudf.read_csv(graph_file, delimiter=' ',
dtype=['int32', 'int32', 'float32'], header=None)
if __name__ == "__main__":
#ds="./datasets/netscience.csv"
rmm.reinitialize(
managed_memory=True,
pool_allocator=True,
initial_pool_size=2 << 27
)
assert(rmm.is_initialized())
ds="./dolphins.csv"
cu_M = add_edge_list_to_adj_list(ds)
cu_M = add_edge_list_to_adj_list(ds)
cu_M = add_edge_list_to_adj_list(ds)
cu_M = add_edge_list_to_adj_list(ds)
Closing as not an issue.
Thanks for the explanation @rgsl888prabhu .
We have the re-init normally take parameters in our tests to test cugraph under different RMM configurations (managed and pool True/False combinations), which is why it's there. For my own knowledge, is there a way we can manually inform cudf about the unique_ptr
being invalidated somehow in a teardown (or just a safe way to reinit with different params)? Or, is this just unsupported? (I'm fine with that answer too) We're considering removing the parameterization of the RMM config otherwise and just testing with one config.
You can del
the residual elements explicitly before moving to next tests. Don't know about any other way to trigger garbage collection.
Describe the bug cugraph has an issue where a failing pytest is not allowing proper cleanup to happen, which results in a series of calls I think I recreated in the python script below. When that series of calls is made, Python crashes with a seg fault.
Steps/Code to reproduce bug Run this script:
with this CSV: https://github.com/rapidsai/cugraph/blob/branch-0.14/datasets/dolphins.csv
Expected behavior No seg fault
Environment overview (please complete the following information) 0.14 nightly
devel
Docker image from night of May 2 (dated early May 3).Environment details
Click here to see environment details
Additional context