intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.26k stars 741 forks source link

[CUDA][HIP] too many process spawned on multiple GPU systems #15251

Open tdavidcl opened 2 months ago

tdavidcl commented 2 months ago

Describe the bug

On multiple GPU systems, using HIP or CUDA, a process is spawned on all GPUs instead being spawned only on one of them. (See To reproduce section)

This result in memory leaks when SYCL is used with both mpich and openmpi as both GPUs ends up receiving the data, even though the program (in the following exemple a private HPC application) only use one of them per MPI ranks. This result in a graph like this (memory usage per process / time) mpirun -n 2 <...> Screenshot_2024-09-01_21-14-17 where the blue and red curve are the working GPU processes, and the two other growing ones are the threads on the wrong GPUs.

CUDA_VISIBLE_DEVICES can be used to circumvent the issue

mpirun \                                                                           
    -n 1 -x CUDA_VISIBLE_DEVICES=0 <...> : \
    -n 1 -x CUDA_VISIBLE_DEVICES=1 <...>

Screenshot_2024-09-01_21-45-25

To reproduce

#include <sycl/sycl.hpp>
#include <iostream>

std::vector<sycl::device> get_sycl_device_list() {
    std::vector<sycl::device> devs;
    const auto &Platforms = sycl::platform::get_platforms();
    for (const auto &Platform : Platforms) {
        const auto &Devices = Platform.get_devices();
        for (const auto &Device : Devices) {
            devs.push_back(Device);
            return devs;
        }
    }
     return devs;
}

int main(void){

    for (auto d : get_sycl_device_list()){
        auto DeviceName   = d.get_info<sycl::info::device::name>();
        std::cout <<DeviceName << std::endl;
    }
    std::cin.ignore();
}
intel-llvm-installdir/bin/clang++ -fsycl -fsycl-targets=nvidia_gpu_sm_80 test.cpp
./a.out

On a multiple GPU system, this code snippet result in processes being spawned on both GPUs, even though only one GPU should be initialized.

❯ nvidia-smi | grep ./a.out
|    0   N/A  N/A   2723339      C   ./a.out                                       202MiB |
|    1   N/A  N/A   2723339      C   ./a.out                                       202MiB |

Environment

Platforms: 1 Platform [#1]: Version : CUDA 12.6 Name : NVIDIA CUDA BACKEND Vendor : NVIDIA Corporation Devices : 2 Device [#0]: Type : gpu Version : 8.6 Name : NVIDIA RTX A5000 Vendor : NVIDIA Corporation Driver : CUDA 12.6 UUID : 1524713610692242731361804768205105967369 Num SubDevices : 0 Num SubSubDevices : 0 Aspects : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_cuda_async_barrier ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthImages are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime. ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_1d_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_external_memory_import ext_oneapi_external_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag ext_oneapi_virtual_mem ext_oneapi_image_array ext_oneapi_unique_addressing_per_dim ext_oneapi_bindless_images_sample_1d_usm ext_oneapi_bindless_images_sample_2d_usm info::device::sub_group_sizes: 32 Architecture: nvidia_gpu_sm_86 Device [#1]: Type : gpu Version : 8.6 Name : NVIDIA RTX A5000 Vendor : NVIDIA Corporation Driver : CUDA 12.6 UUID : 132261661332412015314176217761172047020650 Num SubDevices : 0 Num SubSubDevices : 0 Aspects : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_cuda_async_barrier ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthImages are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime. ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_1d_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_external_memory_import ext_oneapi_external_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag ext_oneapi_virtual_mem ext_oneapi_image_array ext_oneapi_unique_addressing_per_dim ext_oneapi_bindless_images_sample_1d_usm ext_oneapi_bindless_images_sample_2d_usm info::device::sub_group_sizes: 32 Architecture: nvidia_gpu_sm_86 default_selector() : gpu, NVIDIA CUDA BACKEND, NVIDIA RTX A5000 8.6 [CUDA 12.6] accelerator_selector() : No device of requested type available. cpu_selector() : No device of requested type available. gpu_selector() : gpu, NVIDIA CUDA BACKEND, NVIDIA RTX A5000 8.6 [CUDA 12.6] custom_selector(gpu) : gpu, NVIDIA CUDA BACKEND, NVIDIA RTX A5000 8.6 [CUDA 12.6] custom_selector(cpu) : No device of requested type available. custom_selector(acc) : No device of requested type available.



### Additional context

_No response_
JackAKirk commented 2 months ago

Forgive me if I've misunderstood the problem since I'm not sure there is enough information or maybe I misread, but it looks to me like this is expected behaviour. As you point out you can resolve this problem via using CUDA_VISIBLE_DEVICES as we documented here at the bottom:

https://developer.codeplay.com/products/oneapi/nvidia/2024.2.1/guides/MPI-guide#mapping-mpi-ranks-to-specific-devices

Now I think what you are asking is a way to map cpu ranks to gpus without using CUDA_VISIBLE_DEVICES. We have documented how to do this in the link above. Essentially you query RANK using your chosen MPI implementation, and map it to a chosen device.

This is identical to how you do MPI with native CUDA, and this is generally the case; we have tried to emphasize this in

https://developer.codeplay.com/products/oneapi/nvidia/2024.2.1/guides/MPI-guide and https://developer.codeplay.com/products/oneapi/amd/2024.2.1/guides/MPI-guide

If I am wrong and there is a problem with using cuda-aware MPI in SYCL that is not documented in https://developer.codeplay.com/products/oneapi/nvidia/2024.2.1/guides/MPI-guide then probably it would help me understand what is happening by posting a more complete code example.

tdavidcl commented 2 months ago

Forgive me if I've misunderstood the problem since I'm not sure there is enough information or maybe I misread, but it looks to me like this is expected behaviour.

Indeed the situation described in (https://developer.codeplay.com/products/oneapi/nvidia/2024.2.1/guides/MPI-guide#mapping-mpi-ranks-to-specific-devices) is really close to what i'm doing internally.

Now I think what you are asking is a way to map cpu ranks to gpus without using CUDA_VISIBLE_DEVICES. We have documented how to do this in the link above.

As described in the same guide in doing something which looks like

std::vector<sycl::device> Devs;
for (const auto &plt : sycl::platform::get_platforms()) {
  if (plt.get_backend() == sycl::backend::cuda)
    Devs.push_back(plt.get_devices()[0]);
}
sycl::queue q{Devs[rank]};

However, correct me if I am wrong, the expected behavior would be if i do mpirun -n 2 ./a.out on a dual GPU system to have in Nvidia-smi one process on GPU 0 and the other on GPU 1.

❯ nvidia-smi | grep ./a.out
|    0   N/A  N/A   2723339      C   ./a.out                                       202MiB |
|    1   N/A  N/A   2723340      C   ./a.out                                       202MiB |

Currently by doing so you will instead get :

❯ nvidia-smi | grep ./a.out
|    0   N/A  N/A   2723339      C   ./a.out                                       202MiB |
|    1   N/A  N/A   2723339      C   ./a.out                                       202MiB |
|    1   N/A  N/A   2723340      C   ./a.out                                       202MiB |
|    0   N/A  N/A   2723340      C   ./a.out                                       202MiB |

i.e. all ranks start the process on all GPU, even if only one of them is used per processes.

The issue is that there is now way to disable streams on unused device. This confuses MPI which in turn, i suspect create the memory leak.

Maybe i was unclear in the initial post, but to reproduce the issue you can simply start a SYCL programm without MPI and observe that both GPUs show up in nvidia-smi.

Even if this can be fixed by using a proper binding script i suspect that this is not expected behavior of DPC++ ???

JackAKirk commented 2 months ago

Forgive me if I've misunderstood the problem since I'm not sure there is enough information or maybe I misread, but it looks to me like this is expected behaviour.

Indeed the situation described in (https://developer.codeplay.com/products/oneapi/nvidia/2024.2.1/guides/MPI-guide#mapping-mpi-ranks-to-specific-devices) is really close to what i'm doing internally.

Now I think what you are asking is a way to map cpu ranks to gpus without using CUDA_VISIBLE_DEVICES. We have documented how to do this in the link above.

As described in the same guide in doing something which looks like

However, correct me if I am wrong, the expected behavior would be if i do `mpirun -n 2 ./a.out` on a dual GPU system to have in Nvidia-smi one process on GPU 0 and the other on GPU 1.

❯ nvidia-smi | grep ./a.out | 0 N/A N/A 2723339 C ./a.out 202MiB | | 1 N/A N/A 2723340 C ./a.out 202MiB |


Currently by doing so you will instead get :

❯ nvidia-smi | grep ./a.out | 0 N/A N/A 2723339 C ./a.out 202MiB | | 1 N/A N/A 2723339 C ./a.out 202MiB | | 1 N/A N/A 2723340 C ./a.out 202MiB | | 0 N/A N/A 2723340 C ./a.out 202MiB |


Even if this can be fixed by using a proper binding script i suspect that this is not expected behavior of DPC++ ???

Yes that should be correct. I see what you mean. I have not seen such behaviour but I can try to reproduce it. I wonder first of all whether it is an artifact of some part of your program: First of all, have you tried our samples that we linked in the documentation? e.g. https://github.com/codeplaysoftware/SYCL-samples/blob/main/src/MPI_with_SYCL/send_recv_usm.cpp

As I understand it, you would expect to see the same behaviour for that sample, but I don't remember ever seeing duplicate processes.

If you do see the same issue with that sample, I suspect this might also be an artifact of your cluster setup. You might also want to confirm that you don't see the same behaviour with a simple cuda MPI program, e.g. https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/

I would be surprised if this is a dpc++ specific issue. Once the program is compiled, as far as MPI is concerned there is no distinction between it being compiled with dpc++ or nvcc.

tdavidcl commented 2 months ago

I will try, but the simplest exemple tends to already trigger the issue with dpcpp. I think that just looping on the list of device result in cuda init on each GPU.

This simple code on a dual GPU system shows the issue already without MPI:

#include <sycl/sycl.hpp>
#include <iostream>

std::vector<sycl::device> get_sycl_device_list() {
    std::vector<sycl::device> devs;
    const auto &Platforms = sycl::platform::get_platforms();
    for (const auto &Platform : Platforms) {
        const auto &Devices = Platform.get_devices();
        for (const auto &Device : Devices) {
            devs.push_back(Device);
            return devs;
        }
    }
     return devs;
}

int main(void){

    for (auto d : get_sycl_device_list()){
        auto DeviceName   = d.get_info<sycl::info::device::name>();
        std::cout <<DeviceName << std::endl;
    }
    std::cin.ignore();
}
❯ nvidia-smi | grep ./a.out
|    0   N/A  N/A   2723339      C   ./a.out                                       202MiB |
|    1   N/A  N/A   2723339      C   ./a.out                                       202MiB |

Here the process is initialised on both GPUs even though no queues have been created, and only the first device has been used (only to query its name).

Including MPI would do pretty much the same times 2. Send receives works fine with that setup, except for the weird memory leak (I've checked the allocations and it is not on my side).

JackAKirk commented 2 months ago

I will try, but the simplest exemple tends to already trigger the issue with dpcpp. I think that just looping on the list of device result in cuda init on each GPU.

This simple code on a dual GPU system shows the issue already without MPI:

#include <sycl/sycl.hpp>
#include <iostream>

std::vector<sycl::device> get_sycl_device_list() {
  std::vector<sycl::device> devs;
  const auto &Platforms = sycl::platform::get_platforms();
  for (const auto &Platform : Platforms) {
      const auto &Devices = Platform.get_devices();
      for (const auto &Device : Devices) {
          devs.push_back(Device);
          return devs;
      }
  }
     return devs;
}

int main(void){

  for (auto d : get_sycl_device_list()){
      auto DeviceName   = d.get_info<sycl::info::device::name>();
      std::cout <<DeviceName << std::endl;
  }
  std::cin.ignore();
}
❯ nvidia-smi | grep ./a.out
|    0   N/A  N/A   2723339      C   ./a.out                                       202MiB |
|    1   N/A  N/A   2723339      C   ./a.out                                       202MiB |

Here the process is initialised on both GPUs even though no queues have been created, and only the first device has been used (only to query its name).

Including MPI would do pretty much the same times 2. Send receives works fine with that setup, except for the weird memory leak (I've checked the allocations and it is not on my side).

This definitely isn't happening on my system (I just sanity checked it again using your code quoted above on a multi-gpu system). The most important point is that this shouldn't be running on the gpu at all, and therefore you should not be getting any output from nvidia-smi. This is what I see. Are you sure you don't have preexisting processes happening on your gpu?

al42and commented 2 months ago

For the record, I tried with oneAPI 2024.2.0 (and a matching Codeplay plugin) on a dual-GPU machine, and have the same output as @tdavidcl:

$ sycl-ls 
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2080 Ti 7.5 [CUDA 12.4]
[cuda:gpu][cuda:1] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2080 Ti 7.5 [CUDA 12.4]
$ nvidia-smi --query-compute-apps=pid,name,gpu_bus_id,used_gpu_memory --format=csv
pid, process_name, gpu_bus_id, used_gpu_memory [MiB]
$ /opt/tcbsys/intel-oneapi/2024.2.0/compiler/2024.2/bin/compiler/clang++ -fsycl test.cpp
$ ./a.out &
[1] 17000
$ Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
Press any key...
[1]+  Stopped                 ./a.out
$ nvidia-smi --query-compute-apps=pid,name,gpu_bus_id,used_gpu_memory --format=csv
pid, process_name, gpu_bus_id, used_gpu_memory [MiB]
17000, ./a.out, 00000000:17:00.0, 154 MiB
17000, ./a.out, 00000000:65:00.0, 154 MiB
JackAKirk commented 2 months ago

For the record, I tried with oneAPI 2024.2.0 (and a matching Codeplay plugin) on a dual-GPU machine, and have the same output as @tdavidcl:

$ sycl-ls 
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2080 Ti 7.5 [CUDA 12.4]
[cuda:gpu][cuda:1] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2080 Ti 7.5 [CUDA 12.4]
$ nvidia-smi --query-compute-apps=pid,name,gpu_bus_id,used_gpu_memory --format=csv
pid, process_name, gpu_bus_id, used_gpu_memory [MiB]
$ /opt/tcbsys/intel-oneapi/2024.2.0/compiler/2024.2/bin/compiler/clang++ -fsycl test.cpp
$ ./a.out &
[1] 17000
$ Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
Press any key...
[1]+  Stopped                 ./a.out
$ nvidia-smi --query-compute-apps=pid,name,gpu_bus_id,used_gpu_memory --format=csv
pid, process_name, gpu_bus_id, used_gpu_memory [MiB]
17000, ./a.out, 00000000:17:00.0, 154 MiB
17000, ./a.out, 00000000:65:00.0, 154 MiB

Thanks, I've now reproduced the issue. We think we understand the root cause, and someone on the team has a patch on the way. It isn't a MPI specific issue, but a problem with the usage of cuContext that affects all codes.

JackAKirk commented 2 months ago

Hi @tdavidcl @al42and

I opened a proposed fix for this here https://github.com/oneapi-src/unified-runtime/pull/2077 along with a code example for how this would change developer code here: https://github.com/codeplaysoftware/SYCL-samples/pull/33

If you have any feedback on this then feel free to post. Thanks

JackAKirk commented 3 weeks ago

Hi @tdavidcl @al42and

We have updated our MPI documentation to reflect this issue. See https://developer.codeplay.com/products/oneapi/nvidia/latest/guides/MPI-guide#mapping-mpi-ranks-to-specific-devices

Apologies that there is not a current solution other than to rely on environment variables as you have already done. Thank you very much for pointing this issue out to us.

I have opened up an issue on OPENMPI to try to understand this issue better https://github.com/open-mpi/ompi/issues/12848