omlins / ParallelStencil.jl

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs
BSD 3-Clause "New" or "Revised" License
301 stars 31 forks source link

GPU memory management issue when running multi-GPU code #146

Closed AlexandreChern closed 3 months ago

AlexandreChern commented 4 months ago

This is a great package for multi-GPU code in Julia. I've been experimenting it with several examples to learn how things work. I was able to run most tests successfully, but I encountered this issue when running this multi-GPU example. I am trying to understand why I would encounter this bug. Is this related to CUDA.jl or is it because I can not manage the device buffer if I am not a system manager.

alexandrechen@saturn:~/code-samples/ParallelStencil.jl/examples$ julia --project=../ diffusion3D_multigpucpu_novis_noperf.jl Global grid: 256x256x256 (nprocs: 1, dims: 1x1x1; device support: CUDA) ERROR: LoadError: ArgumentError: Cannot free an unmanaged buffer. Stacktrace: [1] unsafe_free!(xs::CUDA.CuArray{Float64, 1, CUDA.Mem.HostBuffer}, stream::CUDA.CuStream) @ CUDA ~/.julia/packages/CUDA/BbliS/src/array.jl:73 [2] unsafe_free!(xs::CUDA.CuArray{Float64, 1, CUDA.Mem.HostBuffer}) @ CUDA ~/.julia/packages/CUDA/BbliS/src/array.jl:70 [3] (::ImplicitGlobalGrid_CUDAExt.var"#free_cubufs#1")(bufs::Vector{Vector{Any}}) @ ImplicitGlobalGrid_CUDAExt ~/.julia/packages/ImplicitGlobalGrid/WHNmB/src/CUDAExt/update_halo.jl:39 [4] free_update_halo_cubuffers() @ ImplicitGlobalGrid_CUDAExt ~/.julia/packages/ImplicitGlobalGrid/WHNmB/src/CUDAExt/update_halo.jl:28 [5] free_update_halo_cubuffers @ ~/.julia/packages/ImplicitGlobalGrid/WHNmB/src/CUDAExt/update_halo.jl:6 [inlined] [6] free_update_halo_buffers() @ ImplicitGlobalGrid ~/.julia/packages/ImplicitGlobalGrid/WHNmB/src/update_halo.jl:104 [7] finalize_global_grid(; finalize_MPI::Bool) @ ImplicitGlobalGrid ~/.julia/packages/ImplicitGlobalGrid/WHNmB/src/finalize_global_grid.jl:17 [8] finalize_global_grid @ ~/.julia/packages/ImplicitGlobalGrid/WHNmB/src/finalize_global_grid.jl:15 [inlined] [9] diffusion3D() @ Main /storage/users/alexandrechen/code-samples/ParallelStencil.jl/examples/diffusion3D_multigpucpu_novis_noperf.jl:51 [10] top-level scope @ /storage/users/alexandrechen/code-samples/ParallelStencil.jl/examples/diffusion3D_multigpucpu_novis_noperf.jl:54 in expression starting at /storage/users/alexandrechen/code-samples/ParallelStencil.jl/examples/diffusion3D_multigpucpu_novis_noperf.jl:54

luraess commented 4 months ago

Hi, thanks for reporting this issue. To help, could you please provide the output of versioninfo() within Julia, and the output of CUDA.versioninfo()? Also, which version of ParallelStencil are you using? Thanks

AlexandreChern commented 4 months ago

Hi, thanks for your prompt response! I am using the latest ParallelStencil from the main branch (git pulled today) I got the following output for versioninfo() and CUDA.versioninfo()

julia> versioninfo() Julia Version 1.9.4 Commit 8e5136fa297 (2023-11-14 08:46 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 208 × Intel(R) Xeon(R) Platinum 8367HC CPU @ 3.20GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-14.0.6 (ORCJIT, cooperlake) Threads: 1 on 208 virtual cores Environment: LD_LIBRARY_PATH_modshare = /packages/cuda/12.3/nsight-compute-2023.3.1/target/linux-desktop-glibc_2_11_3-x64:1::1:/packages/cuda/12.3/lib64:1:/home/users/alexandrechen/code-samples/amgx/include:1:/packages/cuda/12.3/nsight-systems-2023.3.3/target-linux-x64:1:/packages/cuda/12.3/extras/CUPTI/lib64:1:/opt/openmpi-3.0.0/lib:1 LD_RUN_PATH_modshare = /packages/cuda/12.3/bin:1 LD_RUN_PATH = /packages/cuda/12.3/bin LD_LIBRARY_PATH = /packages/cuda/12.3/nsight-systems-2023.3.3/target-linux-x64:/packages/cuda/12.3/nsight-compute-2023.3.1/target/linux-desktop-glibc_2_11_3-x64:/packages/cuda/12.3/extras/CUPTI/lib64:/packages/cuda/12.3/lib64:/home/users/alexandrechen/code-samples/amgx/include:/opt/openmpi-3.0.0/lib:

julia> CUDA.versioninfo() CUDA toolkit 11.7, artifact installation NVIDIA driver 525.85.12, for CUDA 12.0 CUDA driver 12.0

Libraries:

Toolchain:

3 devices: 0: NVIDIA A100 80GB PCIe (sm_80, 73.863 GiB / 80.000 GiB available) 1: NVIDIA A100-PCIE-40GB (sm_80, 39.410 GiB / 40.000 GiB available) 2: NVIDIA A100-PCIE-40GB (sm_80, 39.293 GiB / 40.000 GiB available)

omlins commented 3 months ago

Thanks @AlexandreChern for opening this issue. Could you please open this issue instead in the repository of ImplicitGlobalGrid? And there please print also the CUDA.jl version (from your project folder --project=../)? With CUDA.jl v5.1.2, I could not reproduce the issue...

AlexandreChern commented 3 months ago

Thanks @omlins! For this repo, I got the CUDA.jl version in this project folder that seems to be outdated with the Project version.

(ParallelStencil) pkg> status --outdated Project ParallelStencil v0.6.0 Status /storage/users/alexandrechen/code-samples/ParallelStencil.jl/Project.toml ⌅ [052768ef] CUDA v3.13.1 (<v5.2.0) [compat]

I will update everything and see if I still have the same issues. If so, I will open the issue in the ImplicitGlobalGrid repo and provide CUDA.jl version information.

AlexandreChern commented 3 months ago

The problem is likely due to old dependencies. After I cleaned up the repo and upgraded every dependency to the latest version, multiple issues that I experienced previously were gone. Thanks for the help @omlins @luraess!

omlins commented 3 months ago

Thanks @omlins! For this repo, I got the CUDA.jl version in this project folder that seems to be outdated with the Project version.

@AlexandreChern , could you tell me where you got the outdated Project from? I did not quite get that...

AlexandreChern commented 3 months ago

Thanks @omlins! For this repo, I got the CUDA.jl version in this project folder that seems to be outdated with the Project version.

@AlexandreChern , could you tell me where you got the outdated Project from? I did not quite get that...

It cloned this repo (ParallelStencil.jl) 2 years ago, even though I did git pull recently, the dependencies in the project environment were not updated, especially CUDA.jl. That's probably why I have these issues.

omlins commented 3 months ago

@AlexandreChern : you should install Julia packages using the package manager, see here: https://pkgdocs.julialang.org/v1/managing-packages/#Adding-registered-packages Then you won't run into these and many other issues...

AlexandreChern commented 3 months ago

@AlexandreChern : you should install Julia packages using the package manager, see here: https://pkgdocs.julialang.org/v1/managing-packages/#Adding-registered-packages Then you won't run into these and many other issues...

Sure I will do that in the future when I use Julia packages. I cloned the git repo because I thought it would be easier to access sample code and source code, but I guess this is not intended and it would mess up dependencies and package environments.

luraess commented 3 months ago

You could dev the package then and it would install into .julia/dev.