llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.44k stars 12.16k forks source link

[Offload][OpenMP] Record-Replay not functioning - failure to allocate memory #90761

Open nmustakin opened 7 months ago

nmustakin commented 7 months ago

OpenMP offload recording is failing to allocate memory. It keeps requesting 0 bytes instead of the present LIBOMPTARGET_RR_DEVMEM_SIZE.

For example when running LIBOMPTARGET_DEBUG=1 LIBOMPTARGET_RR_DEVMEM_SIZE=4 LIBOMPTARGET_RR_SAVE_OUTPUT=1 OMP_TARGET_OFFLOAD=mandatory LIBOMPTARGET_NEXTGEN_PLUGINS=1 LIBOMPTARGET_RECORD=1 nvprof ./lulesh the output shows -

TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC
PluginInterface --> Request 0 bytes allocated at (nil)
PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0)
TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory
PluginInterface --> Allocated 14581039104 bytes at 0x7f33ce000000 for replay.
PluginInterface --> Record Replay Initialized (0x7f33ce000000) as starting address, 14581039104 Memory Size and set on status Recording
TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC
PluginInterface --> Request 0 bytes allocated at (nil)
PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0)
TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory
PluginInterface --> Allocated 14581039104 bytes at 0x7f305c000000 for replay.
PluginInterface --> Record Replay Initialized (0x7f305c000000) as starting address, 14581039104 Memory Size and set on status Recording
TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC
PluginInterface --> Request 0 bytes allocated at (nil)
PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0)
TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory
PluginInterface --> Allocated 14581039104 bytes at 0x7f2cea000000 for replay.
PluginInterface --> Record Replay Initialized (0x7f2cea000000) as starting address, 14581039104 Memory Size and set on status Recording
TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC
PluginInterface --> Request 0 bytes allocated at (nil)
PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0)
TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory
PluginInterface --> Allocated 14581039104 bytes at 0x7f2978000000 for replay.
PluginInterface --> Record Replay Initialized (0x7f2978000000) as starting address, 14581039104 Memory Size and set on status Recording
TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC
PluginInterface --> Request 0 bytes allocated at (nil)
PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0)
TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory
PluginInterface --> Allocated 14581039104 bytes at 0x7f2606000000 for replay.
PluginInterface --> Record Replay Initialized (0x7f2606000000) as starting address, 14581039104 Memory Size and set on status Recording

as well as -

omptarget --> Launching target execution __omp_offloading_821_1dc1092__ZL17CalcForceForNodesR6Domain_l1235 with pointer 0x0000555c717629f0 (index=1).
PluginInterface --> Launching kernel __omp_offloading_821_1dc1092__ZL17CalcForceForNodesR6Domain_l1235 with 931 blocks and 32 threads in SPMD mode
LLVM ERROR: Error retrieving data for target pointer

ending with only 1 out of 17 kernels being recorded

llvmbot commented 7 months ago

@llvm/issue-subscribers-openmp

Author: None (nmustakin)

OpenMP offload recording is failing to allocate memory. It keeps requesting 0 bytes instead of the present `LIBOMPTARGET_RR_DEVMEM_SIZE`. For example when running `LIBOMPTARGET_DEBUG=1 LIBOMPTARGET_RR_DEVMEM_SIZE=4 LIBOMPTARGET_RR_SAVE_OUTPUT=1 OMP_TARGET_OFFLOAD=mandatory LIBOMPTARGET_NEXTGEN_PLUGINS=1 LIBOMPTARGET_RECORD=1 nvprof ./lulesh` the output shows - ``` TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC PluginInterface --> Request 0 bytes allocated at (nil) PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0) TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory PluginInterface --> Allocated 14581039104 bytes at 0x7f33ce000000 for replay. PluginInterface --> Record Replay Initialized (0x7f33ce000000) as starting address, 14581039104 Memory Size and set on status Recording TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC PluginInterface --> Request 0 bytes allocated at (nil) PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0) TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory PluginInterface --> Allocated 14581039104 bytes at 0x7f305c000000 for replay. PluginInterface --> Record Replay Initialized (0x7f305c000000) as starting address, 14581039104 Memory Size and set on status Recording TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC PluginInterface --> Request 0 bytes allocated at (nil) PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0) TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory PluginInterface --> Allocated 14581039104 bytes at 0x7f2cea000000 for replay. PluginInterface --> Record Replay Initialized (0x7f2cea000000) as starting address, 14581039104 Memory Size and set on status Recording TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC PluginInterface --> Request 0 bytes allocated at (nil) PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0) TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory PluginInterface --> Allocated 14581039104 bytes at 0x7f2978000000 for replay. PluginInterface --> Record Replay Initialized (0x7f2978000000) as starting address, 14581039104 Memory Size and set on status Recording TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC PluginInterface --> Request 0 bytes allocated at (nil) PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0) TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory PluginInterface --> Allocated 14581039104 bytes at 0x7f2606000000 for replay. PluginInterface --> Record Replay Initialized (0x7f2606000000) as starting address, 14581039104 Memory Size and set on status Recording ``` as well as - ``` omptarget --> Launching target execution __omp_offloading_821_1dc1092__ZL17CalcForceForNodesR6Domain_l1235 with pointer 0x0000555c717629f0 (index=1). PluginInterface --> Launching kernel __omp_offloading_821_1dc1092__ZL17CalcForceForNodesR6Domain_l1235 with 931 blocks and 32 threads in SPMD mode LLVM ERROR: Error retrieving data for target pointer ``` ending with only 1 out of 17 kernels being recorded
tgymnich commented 7 months ago

https://github.com/llvm/llvm-project/blob/fe6f137e48ceee094d0fa42ca54c7e1226b45fde/openmp/libomptarget/src/device.cpp#L102

The second argument should probably not be zero here. => We cannot allocate 0 bytes.

jhuber6 commented 7 months ago

The R&R functionality isn't really tested. It's a global object that's only initialized on a single device so it's probably broken as well if you try to use more than one device.

tgymnich commented 7 months ago

This allocation error is caused directly by requesting 0 bytes of memory in the above mentioned call.

PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0)

Which is then rejected by the CUDA Device Plugin here: https://github.com/llvm/llvm-project/blob/1e82d506b0b2b4b8501bb1cae13d2e2f3405922d/offload/plugins-nextgen/cuda/src/rtl.cpp#L670

llvmbot commented 7 months ago

@llvm/issue-subscribers-offload

Author: None (nmustakin)

OpenMP offload recording is failing to allocate memory. It keeps requesting 0 bytes instead of the present `LIBOMPTARGET_RR_DEVMEM_SIZE`. For example when running `LIBOMPTARGET_DEBUG=1 LIBOMPTARGET_RR_DEVMEM_SIZE=4 LIBOMPTARGET_RR_SAVE_OUTPUT=1 OMP_TARGET_OFFLOAD=mandatory LIBOMPTARGET_NEXTGEN_PLUGINS=1 LIBOMPTARGET_RECORD=1 nvprof ./lulesh` the output shows - ``` TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC PluginInterface --> Request 0 bytes allocated at (nil) PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0) TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory PluginInterface --> Allocated 14581039104 bytes at 0x7f33ce000000 for replay. PluginInterface --> Record Replay Initialized (0x7f33ce000000) as starting address, 14581039104 Memory Size and set on status Recording TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC PluginInterface --> Request 0 bytes allocated at (nil) PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0) TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory PluginInterface --> Allocated 14581039104 bytes at 0x7f305c000000 for replay. PluginInterface --> Record Replay Initialized (0x7f305c000000) as starting address, 14581039104 Memory Size and set on status Recording TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC PluginInterface --> Request 0 bytes allocated at (nil) PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0) TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory PluginInterface --> Allocated 14581039104 bytes at 0x7f2cea000000 for replay. PluginInterface --> Record Replay Initialized (0x7f2cea000000) as starting address, 14581039104 Memory Size and set on status Recording TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC PluginInterface --> Request 0 bytes allocated at (nil) PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0) TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory PluginInterface --> Allocated 14581039104 bytes at 0x7f2978000000 for replay. PluginInterface --> Record Replay Initialized (0x7f2978000000) as starting address, 14581039104 Memory Size and set on status Recording TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC PluginInterface --> Request 0 bytes allocated at (nil) PluginInterface --> WARNING VA mapping failed, fallback to heuristic: (Error: Memory Map Size must be larger than 0) TARGET CUDA RTL --> Failure to alloc memory: Error in cuMemAlloc[Host|Managed]: out of memory PluginInterface --> Allocated 14581039104 bytes at 0x7f2606000000 for replay. PluginInterface --> Record Replay Initialized (0x7f2606000000) as starting address, 14581039104 Memory Size and set on status Recording ``` as well as - ``` omptarget --> Launching target execution __omp_offloading_821_1dc1092__ZL17CalcForceForNodesR6Domain_l1235 with pointer 0x0000555c717629f0 (index=1). PluginInterface --> Launching kernel __omp_offloading_821_1dc1092__ZL17CalcForceForNodesR6Domain_l1235 with 931 blocks and 32 threads in SPMD mode LLVM ERROR: Error retrieving data for target pointer ``` ending with only 1 out of 17 kernels being recorded