rgl-epfl / cholespy

An easily integrable Cholesky solver on CPU and GPU
BSD 3-Clause "New" or "Revised" License
210 stars 18 forks source link

[BUG REPORT] "an illegal memory access was encountered" and "nanobind leak" #10

Open xk-huang opened 1 year ago

xk-huang commented 1 year ago

When I used joblib.Parallel with loky backend to launch multiple jobs in parallel, the below error occurred:

cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:473.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:474.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:475.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:476.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:477.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:478.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:479.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:480.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:481.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:482.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:483.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:484.

Also, the GPU memory allocation was strange: multiple processes allocated memory on GPU 0. image

I tried to delete the corresponding code but it did not work :cry:. Would your mind give any suggestions? Thanks in advance!

xk-huang commented 1 year ago

I tried to build from the source with all the cuda_check() deleted, but I still encountered the issue "RuntimeError: CUDA error: an illegal memory access was encountered".

My building command is CC=gcc-8 CXX=g++-8 pip install . since directly building with pip install . failed.

xk-huang commented 1 year ago

There is also a leak issue reported by nanobind:

nanobind: leaked 4 instances!
nanobind: leaked 2 types!
 - leaked type "CholeskySolverF"
 - leaked type "MatrixType"
nanobind: leaked 2 functions!
 - leaked function "solve"
 - leaked function "__init__"
nanobind: this is likely caused by a reference counting issue in the binding code.
bathal1 commented 1 year ago

Hi,

Thanks for the report. I haven't tested cholespy on multiple GPU setups, so it's possible that memory allocation is broken there.

Deleting the cuda_check calls is absolutely not going to solve your issue, as this is a wrapper that analyses the return code of CUDA API calls and generate those error messages.

From what you described, it sounds like cholespy only uploads data to GPU 0 so the other ones can't access it. That makes sense since the module is initializing the CUDA context on device 0: https://github.com/rgl-epfl/cholespy/blob/485f82c385c188f8f0e87757580fe43e952c4d2d/src/cuda_driver.cpp#L124

As a sanity check, if you can control the number of GPUs on which you run your code, could you try setting it to 1 and see if it works then?

It would also be helpful to have a minimal reproducer (if possible) to try to reproduce the issue on my end.

bathal1 commented 1 year ago

Resolving the multiple GPU case will require a few API changes to allow the user to explicitly specify a device. You would then be able to specify it for each thread in your parallel job.

In the meantime, you should be able to work around this issue by masking out available devices on each thread via the CUDA_VISIBLE_DEVICES environment variable:

import os

os.environ['CUDA_VISIBLE_DEVICES'] = device_id # Mark only the desired device as visible

# ...