Open xk-huang opened 1 year ago
I tried to build from the source with all the cuda_check(
) deleted, but I still encountered the issue "RuntimeError: CUDA error: an illegal memory access was encountered".
My building command is CC=gcc-8 CXX=g++-8 pip install .
since directly building with pip install .
failed.
There is also a leak issue reported by nanobind
:
nanobind: leaked 4 instances!
nanobind: leaked 2 types!
- leaked type "CholeskySolverF"
- leaked type "MatrixType"
nanobind: leaked 2 functions!
- leaked function "solve"
- leaked function "__init__"
nanobind: this is likely caused by a reference counting issue in the binding code.
Hi,
Thanks for the report. I haven't tested cholespy
on multiple GPU setups, so it's possible that memory allocation is broken there.
Deleting the cuda_check
calls is absolutely not going to solve your issue, as this is a wrapper that analyses the return code of CUDA API calls and generate those error messages.
From what you described, it sounds like cholespy
only uploads data to GPU 0 so the other ones can't access it. That makes sense since the module is initializing the CUDA context on device 0: https://github.com/rgl-epfl/cholespy/blob/485f82c385c188f8f0e87757580fe43e952c4d2d/src/cuda_driver.cpp#L124
As a sanity check, if you can control the number of GPUs on which you run your code, could you try setting it to 1 and see if it works then?
It would also be helpful to have a minimal reproducer (if possible) to try to reproduce the issue on my end.
Resolving the multiple GPU case will require a few API changes to allow the user to explicitly specify a device. You would then be able to specify it for each thread in your parallel job.
In the meantime, you should be able to work around this issue by masking out available devices on each thread via the CUDA_VISIBLE_DEVICES
environment variable:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = device_id # Mark only the desired device as visible
# ...
When I used
joblib.Parallel
withloky
backend to launch multiple jobs in parallel, the below error occurred:Also, the GPU memory allocation was strange: multiple processes allocated memory on GPU 0.![image](https://user-images.githubusercontent.com/33593707/187580934-e82224ff-0da1-4193-a78d-a6f5a82b88d5.png)
I tried to delete the corresponding code but it did not work :cry:. Would your mind give any suggestions? Thanks in advance!