intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.21k stars 727 forks source link

`llvm-foreach` takes 100% cpu usage #15177

Open fwyzard opened 3 weeks ago

fwyzard commented 3 weeks ago

Describe the bug

While building SYCL code with Intel oneAPI, I noticed that llvm-foreach is almost always sitting at 100% cpu usage.

top:

%Cpu(s):  8.5 us,  5.0 sy,  0.0 ni, 85.8 id,  0.1 wa,  0.0 hi,  0.6 si,  0.0 st
MiB Mem :  64023.7 total,  27107.4 free,   6165.8 used,  30750.5 buff/cache
MiB Swap:  32958.0 total,  32958.0 free,      0.0 used.  53965.6 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                                           
  99440 fwyzard   20   0    4540   2176   2048 R  99.7   0.0   5:42.07 llvm-foreach                                                                                                                                                                                                      
 100326 fwyzard   20   0  309368 274324  48756 R  99.3   0.4   0:05.92 ocloc                                                                                                                                                                                                             

ps -xf:

  98325 pts/2    S+     0:00  |   |   |                               \_ /opt/intel/oneapi/compiler/2024.1/bin/compiler/clang++ @/tmp/icpx0294253703WMgiHH/icpxargD9hFos
  99440 pts/2    R+     5:42  |   |   |                                   \_ /opt/intel/oneapi/compiler/2024.1/bin/compiler/llvm-foreach --out-ext=out --in-file-list=/tmp/icpx-ff969312fd/Activemask-tgllp-63b648.txt --in-replace=/tmp/icpx-ff969312fd/Activemask-tgllp-63b648.txt --ou
 100326 pts/2    R+     0:06  |   |   |                                       \_ /usr/bin/ocloc -output /tmp/Activemask-tgllp-e57dbd-65fea9.out -file /tmp/icpx-ff969312fd/Activemask-tgllp-63b648-0e09e1.spv -output_no_suffix -spirv_input -device tgllp -options -g -cl-opt-disable

This seems to happen for any backend. I've observed this consistently with oneAPI 2024.0 (based on LLVM 17) and 2024.2 (based on LLVM 19), running on Ubuntu Linux 22.04.

To reproduce

Build any complex program with ahead-of-time compilation for multiple backends, e.g. multiple Intel GPUs.

Environment

Platforms: 4 Platform [#1]: Version : OpenCL 3.0 LINUX Name : Intel(R) OpenCL Vendor : Intel(R) Corporation Devices : 1 Device [#0]: Type : cpu Version : OpenCL 3.0 (Build 0) Name : 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz Vendor : Intel(R) Corporation Driver : 2024.18.7.0.11_160000 Aspects : cpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_oneapi_srgb ext_oneapi_native_assert ext_intel_legacy_image ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group info::device::sub_group_sizes: 4 8 16 32 64 Platform [#2]: Version : OpenCL 3.0 Name : Intel(R) OpenCL Graphics Vendor : Intel(R) Corporation Devices : 1 Device [#1]: Type : gpu Version : OpenCL 3.0 NEO Name : Intel(R) UHD Graphics Vendor : Intel(R) Corporation Driver : 24.22.29735.27 Aspects : gpu fp16 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations atomic64 ext_oneapi_srgb ext_intel_device_id ext_intel_legacy_image ext_intel_esimd ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group info::device::sub_group_sizes: 8 16 32 Platform [#3]: Version : 1.3 Name : Intel(R) Level-Zero Vendor : Intel(R) Corporation Devices : 1 Device [#0]: Type : gpu Version : 1.3 Name : Intel(R) UHD Graphics Vendor : Intel(R) Corporation Driver : 1.3.29735 Aspects : gpu fp16 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address ext_intel_gpu_eu_count ext_intel_gpu_eu_simd_width ext_intel_gpu_slices ext_intel_gpu_subslices_per_slice ext_intel_gpu_eu_count_per_subslice atomic64 ext_intel_device_info_uuid ext_intel_gpu_hw_threads_per_eu ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_width ext_intel_legacy_image ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_intel_esimd ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_oneapi_graph info::device::sub_group_sizes: 8 16 32 Platform [#4]: Version : CUDA 12.6 Name : NVIDIA CUDA BACKEND Vendor : NVIDIA Corporation Devices : 1 Device [#0]: Type : gpu Version : 8.6 Name : NVIDIA GeForce RTX 3050 Ti Laptop GPU Vendor : NVIDIA Corporation Driver : CUDA 12.6 Aspects : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations ext_intel_pci_address usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_native_assert ext_oneapi_bfloat16_math_functions ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthur_print: Images are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime. ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_interop_memory_import ext_oneapi_interop_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering info::device::sub_group_sizes: 32 default_selector() : gpu, Intel(R) Level-Zero, Intel(R) UHD Graphics 1.3 [1.3.29735] accelerator_selector() : No device of requested type available. Please chec... cpu_selector() : cpu, Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000] gpu_selector() : gpu, Intel(R) Level-Zero, Intel(R) UHD Graphics 1.3 [1.3.29735] custom_selector(gpu) : gpu, Intel(R) Level-Zero, Intel(R) UHD Graphics 1.3 [1.3.29735] custom_selector(cpu) : cpu, Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000] custom_selector(acc) : No device of requested type available. Please chec...



### Additional context

_No response_
bader commented 3 weeks ago

@fwyzard, the problem is ocloc tool. llvm-foreach just a simple launcher runs commands from a file and waits for them to complete. You can check the logic here - it's ~200 lines of code.

NOTE: ocloc tool is being developed in https://github.com/intel/intel-graphics-compiler/, so I would transfer this issue there.

fwyzard commented 3 weeks ago

@bader while it would definitely be nice if ocloc were faster, the issue is that llvm-foreach itself takes 100% cpu, in addition to ocloc taking up 100% cpu (on another core4).

Instead of tightly looping, would it be possible to make llvm-foreach sleep until a subprocess complete ? Or, at least, something like sleeping 100ms between each check ?

bader commented 3 weeks ago

I think we are going to new remove this tool soon. We are refactoring the compilation process for offload code and new approach won't use this tool or similar approach to detect the task completion. @asudarsa, @maksimsab, @sarnex, FYI.

fwyzard commented 3 weeks ago

@ivorobts FYI