Closed AlexandreChern closed 3 months ago
Hi, thanks for reporting this issue. To help, could you please provide the output of versioninfo()
within Julia, and the output of CUDA.versioninfo()
? Also, which version of ParallelStencil are you using? Thanks
Hi, thanks for your prompt response! I am using the latest ParallelStencil from the main branch (git pulled today) I got the following output for versioninfo() and CUDA.versioninfo()
julia> versioninfo() Julia Version 1.9.4 Commit 8e5136fa297 (2023-11-14 08:46 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 208 × Intel(R) Xeon(R) Platinum 8367HC CPU @ 3.20GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-14.0.6 (ORCJIT, cooperlake) Threads: 1 on 208 virtual cores Environment: LD_LIBRARY_PATH_modshare = /packages/cuda/12.3/nsight-compute-2023.3.1/target/linux-desktop-glibc_2_11_3-x64:1::1:/packages/cuda/12.3/lib64:1:/home/users/alexandrechen/code-samples/amgx/include:1:/packages/cuda/12.3/nsight-systems-2023.3.3/target-linux-x64:1:/packages/cuda/12.3/extras/CUPTI/lib64:1:/opt/openmpi-3.0.0/lib:1 LD_RUN_PATH_modshare = /packages/cuda/12.3/bin:1 LD_RUN_PATH = /packages/cuda/12.3/bin LD_LIBRARY_PATH = /packages/cuda/12.3/nsight-systems-2023.3.3/target-linux-x64:/packages/cuda/12.3/nsight-compute-2023.3.1/target/linux-desktop-glibc_2_11_3-x64:/packages/cuda/12.3/extras/CUPTI/lib64:/packages/cuda/12.3/lib64:/home/users/alexandrechen/code-samples/amgx/include:/opt/openmpi-3.0.0/lib:
julia> CUDA.versioninfo() CUDA toolkit 11.7, artifact installation NVIDIA driver 525.85.12, for CUDA 12.0 CUDA driver 12.0
Libraries:
Toolchain:
3 devices: 0: NVIDIA A100 80GB PCIe (sm_80, 73.863 GiB / 80.000 GiB available) 1: NVIDIA A100-PCIE-40GB (sm_80, 39.410 GiB / 40.000 GiB available) 2: NVIDIA A100-PCIE-40GB (sm_80, 39.293 GiB / 40.000 GiB available)
Thanks @AlexandreChern for opening this issue. Could you please open this issue instead in the repository of ImplicitGlobalGrid? And there please print also the CUDA.jl version (from your project folder --project=../
)? With CUDA.jl v5.1.2, I could not reproduce the issue...
Thanks @omlins! For this repo, I got the CUDA.jl version in this project folder that seems to be outdated with the Project version.
(ParallelStencil) pkg> status --outdated
Project ParallelStencil v0.6.0
Status /storage/users/alexandrechen/code-samples/ParallelStencil.jl/Project.toml
⌅ [052768ef] CUDA v3.13.1 (<v5.2.0) [compat]
I will update everything and see if I still have the same issues. If so, I will open the issue in the ImplicitGlobalGrid repo and provide CUDA.jl version information.
The problem is likely due to old dependencies. After I cleaned up the repo and upgraded every dependency to the latest version, multiple issues that I experienced previously were gone. Thanks for the help @omlins @luraess!
Thanks @omlins! For this repo, I got the CUDA.jl version in this project folder that seems to be outdated with the Project version.
@AlexandreChern , could you tell me where you got the outdated Project from? I did not quite get that...
Thanks @omlins! For this repo, I got the CUDA.jl version in this project folder that seems to be outdated with the Project version.
@AlexandreChern , could you tell me where you got the outdated Project from? I did not quite get that...
It cloned this repo (ParallelStencil.jl) 2 years ago, even though I did git pull recently, the dependencies in the project environment were not updated, especially CUDA.jl. That's probably why I have these issues.
@AlexandreChern : you should install Julia packages using the package manager, see here: https://pkgdocs.julialang.org/v1/managing-packages/#Adding-registered-packages Then you won't run into these and many other issues...
@AlexandreChern : you should install Julia packages using the package manager, see here: https://pkgdocs.julialang.org/v1/managing-packages/#Adding-registered-packages Then you won't run into these and many other issues...
Sure I will do that in the future when I use Julia packages. I cloned the git repo because I thought it would be easier to access sample code and source code, but I guess this is not intended and it would mess up dependencies and package environments.
You could dev
the package then and it would install into .julia/dev
.
This is a great package for multi-GPU code in Julia. I've been experimenting it with several examples to learn how things work. I was able to run most tests successfully, but I encountered this issue when running this multi-GPU example. I am trying to understand why I would encounter this bug. Is this related to CUDA.jl or is it because I can not manage the device buffer if I am not a system manager.
alexandrechen@saturn:~/code-samples/ParallelStencil.jl/examples$ julia --project=../ diffusion3D_multigpucpu_novis_noperf.jl Global grid: 256x256x256 (nprocs: 1, dims: 1x1x1; device support: CUDA) ERROR: LoadError: ArgumentError: Cannot free an unmanaged buffer. Stacktrace: [1] unsafe_free!(xs::CUDA.CuArray{Float64, 1, CUDA.Mem.HostBuffer}, stream::CUDA.CuStream) @ CUDA ~/.julia/packages/CUDA/BbliS/src/array.jl:73 [2] unsafe_free!(xs::CUDA.CuArray{Float64, 1, CUDA.Mem.HostBuffer}) @ CUDA ~/.julia/packages/CUDA/BbliS/src/array.jl:70 [3] (::ImplicitGlobalGrid_CUDAExt.var"#free_cubufs#1")(bufs::Vector{Vector{Any}}) @ ImplicitGlobalGrid_CUDAExt ~/.julia/packages/ImplicitGlobalGrid/WHNmB/src/CUDAExt/update_halo.jl:39 [4] free_update_halo_cubuffers() @ ImplicitGlobalGrid_CUDAExt ~/.julia/packages/ImplicitGlobalGrid/WHNmB/src/CUDAExt/update_halo.jl:28 [5] free_update_halo_cubuffers @ ~/.julia/packages/ImplicitGlobalGrid/WHNmB/src/CUDAExt/update_halo.jl:6 [inlined] [6] free_update_halo_buffers() @ ImplicitGlobalGrid ~/.julia/packages/ImplicitGlobalGrid/WHNmB/src/update_halo.jl:104 [7] finalize_global_grid(; finalize_MPI::Bool) @ ImplicitGlobalGrid ~/.julia/packages/ImplicitGlobalGrid/WHNmB/src/finalize_global_grid.jl:17 [8] finalize_global_grid @ ~/.julia/packages/ImplicitGlobalGrid/WHNmB/src/finalize_global_grid.jl:15 [inlined] [9] diffusion3D() @ Main /storage/users/alexandrechen/code-samples/ParallelStencil.jl/examples/diffusion3D_multigpucpu_novis_noperf.jl:51 [10] top-level scope @ /storage/users/alexandrechen/code-samples/ParallelStencil.jl/examples/diffusion3D_multigpucpu_novis_noperf.jl:54 in expression starting at /storage/users/alexandrechen/code-samples/ParallelStencil.jl/examples/diffusion3D_multigpucpu_novis_noperf.jl:54