[vulkan] Reduce the number of descriptor sets, remove cleanup in destructor

derek-gerstmann commented 3 weeks ago

In order to avoid using up all descriptor sets for complicated pipelines, this PR changes the Vulkan CodeGen such that we encode each Kernel entry-point into it's own SPIR-V module and bind them as separate shaders to avoid running out of resources on constrained devices.

Remove module attribute destructor (since we can't guarantee the driver doesn't register an atexit call which may get invoked before).

Fixes #8297 , #8296 , #8295 , and #8294. Re-enable performance wrap test #7559 .

derek-gerstmann commented 2 weeks ago

@steven-johnson Okay, testing this PR on the configs I have locally has everything passing. Can we re-enable testing on the Buildbots to see if there's any device or driver differences I need to catch?

steven-johnson commented 2 weeks ago

@steven-johnson Okay, testing this PR on the configs I have locally has everything passing. Can we re-enable testing on the Buildbots to see if there's any device or driver differences I need to catch?

https://github.com/halide/build_bot/pull/294

steven-johnson commented 2 weeks ago

@steven-johnson Okay, testing this PR on the configs I have locally has everything passing. Can we re-enable testing on the Buildbots to see if there's any device or driver differences I need to catch?

halide/build_bot#294

Done, please start testing

derek-gerstmann commented 2 weeks ago

I’ll investigate the three failures:

The following tests FAILED:
correctness_bounds 
correctness_debug_to_file
correctness_device_buffer_copies_with_profile

derek-gerstmann commented 2 weeks ago

Fixes added for #8466

derek-gerstmann commented 2 weeks ago

@steven-johnson I believe the GPU lifetime and device memory leak errors are actually caused by the Validation Layer shared lib itself when invoked with ctest --parallel. It doesn't seem very robust at all. It's useful for diagnosing issues when they come up, but it doesn't seem production worthy. I'd like to suggest we remove the VK_INSTANCE_LAYERS env var from the buildbot config, and keep the Vulkan tests enabled.

steven-johnson commented 2 weeks ago

@steven-johnson I believe the GPU lifetime and device memory leak errors are actually caused by the Validation Layer shared lib itself when invoked with ctest --parallel. It doesn't seem very robust at all. It's useful for diagnosing issues when they come up, but it doesn't seem production worthy. I'd like to suggest we remove the VK_INSTANCE_LAYERS env var from the buildbot config, and keep the Vulkan tests enabled.

Done, change made and buildbot master restarted -- please try again :-)

derek-gerstmann commented 2 weeks ago

So, other than the build failures from LLVM20 interface changes, all the Vulkan tests and apps are actually passing now and reporting "SUCCESS". However, there's exception's being thrown at shutdown that I can't reproduce.

derek-gerstmann commented 1 week ago

Ah, wait. I was able to get a build that seems to reproduce the shutdown issues. Investigating now.

steven-johnson commented 3 days ago

What's the status here, still underway?

derek-gerstmann commented 3 days ago

I'm trying to reproduce the latest shutdown failures after my fixes, but so far haven't succeeded. I'm currently trying to setup a build using clang as the system compiler with -D_LIBCPP_HARDENING_MODE=_LIBCPP_HARDENING_MODE_EXTENSIVE to see if I can find any source of badness.

derek-gerstmann commented 2 days ago

Well, running the tests with address sanitizer enabled is showing the latest Ubuntu 22.04 Vulkan loader v1.3.204.1-2 on Linux appears to have some issues. I'm not sure if this is contributing to the shutdown issues ...

> VK_INSTANCE_LAYERS=VK_LAYER_NV_optimus HL_JIT_TARGET=host-vulkan-vk_int8-vk_int16-vk_int64-vk_float16-vk_float64-vk_v13-asan ./test/correctness/correctness_hello_gpu 
Defining function...
Realizing function...
=================================================================
==2242256==ERROR: AddressSanitizer: heap-use-after-free on address 0x502000032350 at pc 0x55f2fb5c79d8 bp 0x7ffe616039d0 sp 0x7ffe61603160
READ of size 4 at 0x502000032350 thread T0
    #0 0x55f2fb5c79d7 in printf_common(void*, char const*, __va_list_tag*) /home/dg/Workspace/OpenSource/LLVM/Repos/llvm-v18.1.8/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors_format.inc:563:9
    #1 0x55f2fb5c8174 in __vsnprintf_chk /home/dg/Workspace/OpenSource/LLVM/Repos/llvm-v18.1.8/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:1657:1
    #2 0x7f422bc9db0b  (/lib/x86_64-linux-gnu/libvulkan.so.1+0x42b0b) (BuildId: f0c68e18c65acc33ade4f150c1f99e53fc3eb296)
    #3 0x7f422bc96151  (/lib/x86_64-linux-gnu/libvulkan.so.1+0x3b151) (BuildId: f0c68e18c65acc33ade4f150c1f99e53fc3eb296)
    #4 0x7f422bc99ab8  (/lib/x86_64-linux-gnu/libvulkan.so.1+0x3eab8) (BuildId: f0c68e18c65acc33ade4f150c1f99e53fc3eb296)
    #5 0x7f422bc8a425 in vkEnumerateInstanceExtensionProperties (/lib/x86_64-linux-gnu/libvulkan.so.1+0x2f425) (BuildId: f0c68e18c65acc33ade4f150c1f99e53fc3eb296)
    #6 0x7f422bc0e284  (<unknown module>)

0x502000032350 is located 0 bytes inside of 8-byte region [0x502000032350,0x502000032358)
freed by thread T0 here:
    #0 0x55f2fb63ba56 in free /home/dg/Workspace/OpenSource/LLVM/Repos/llvm-v18.1.8/compiler-rt/lib/asan/asan_malloc_linux.cpp:52:3
    #1 0x7f422bc960c2  (/lib/x86_64-linux-gnu/libvulkan.so.1+0x3b0c2) (BuildId: f0c68e18c65acc33ade4f150c1f99e53fc3eb296)

previously allocated by thread T0 here:
    #0 0x55f2fb63bcef in malloc /home/dg/Workspace/OpenSource/LLVM/Repos/llvm-v18.1.8/compiler-rt/lib/asan/asan_malloc_linux.cpp:68:3
    #1 0x7f422bc9d377  (/lib/x86_64-linux-gnu/libvulkan.so.1+0x42377) (BuildId: f0c68e18c65acc33ade4f150c1f99e53fc3eb296)

SUMMARY: AddressSanitizer: heap-use-after-free (/lib/x86_64-linux-gnu/libvulkan.so.1+0x42b0b) (BuildId: f0c68e18c65acc33ade4f150c1f99e53fc3eb296) 
Shadow bytes around the buggy address:
  0x502000032080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x502000032100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x502000032180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x502000032200: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x502000032280: fa fa fa fa fa fa fa fa fa fa fd fd fa fa fd fa
=>0x502000032300: fa fa fa fa fa fa fd fd fa fa[fd]fa fa fa 00 fa
  0x502000032380: fa fa fd fa fa fa fa fa fa fa fa fa fa fa fd fd
  0x502000032400: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
  0x502000032480: fa fa fd fa fa fa fa fa fa fa fa fa fa fa fd fa
  0x502000032500: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x502000032580: fa fa fa fa fa fa fa fa fa fa fd fa fa fa fd fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==2242256==ABORTING

derek-gerstmann commented 2 days ago

Newer versions of the Vulkan loader appear to fix this. I'll keep digging.

derek-gerstmann commented 1 day ago

Okay, I was able to diagnose the segfaults on the buildbot and they're caused by the Vulkan runtime module destructor being invoked, and for whatever reason, it seems like the API function pointers are invalid (even though the loader lib was loaded, and the func table initialized). Any call to a Vulkan API method in the destructor will immediately segfault. This doesn't happen on my local machine, so there may be some atexit ordering issues at play here.

derek-gerstmann commented 1 day ago

This seems bad. Running with LD_DEBUG=files, the link map for the NVIDIA driver gets destroyed before the vulkan runtime AOT destructor gets called. =/

> LD_DEBUG=files ../halide-build-dg/test/generator/generator_aotcpp_cleanup_on_error 
    197096: 
    197096: file=libVkLayer_MESA_device_select.so [0];  dynamically loaded by /lib/x86_64-linux-gnu/libvulkan.so.1 [0]
    197096: file=libVkLayer_MESA_device_select.so [0];  generating link map
    197096:   dynamic: 0x00007c2a32675da0  base: 0x00007c2a32667000   size: 0x00000000000114d8
    197096:     entry: 0x00007c2a32667000  phdr: 0x00007c2a32667040  phnum:                 11
    197096: calling init: /lib/x86_64-linux-gnu/libVkLayer_MESA_device_select.so
    197096: opening file=/lib/x86_64-linux-gnu/libVkLayer_MESA_device_select.so [0]; direct_opencount=1
    197096: opening file=/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.535.216.01 [0]; direct_opencount=3
    197096: 
    vk_load_vulkan_functions (user_context: 0x0)

...

halide_vulkan_finalize_kernels (user_context: 0x0, state_ptr: 0x2
    Time: 2.800000e-04 ms
Success!
    197096: 
    197096: calling fini: /lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1 [0]
    197096: file=/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1 [0];  destroying link map
    197096: closing file=/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.216.01; direct_opencount=1
    197096: closing file=/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.535.216.01; direct_opencount=2
    197096: closing file=/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.535.216.01; direct_opencount=1
    197096: calling fini: /lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.216.01 [0]
    197096: 
    197096: file=/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.216.01 [0];  destroying link map
    197096: calling fini: ../halide-build-dg/test/generator/generator_aotcpp_cleanup_on_error [0]
    197096: 
halide_vulkan_device_release (user_context: 0x0)
vk_destroy_context (user_context: 0x0)
Segmentation fault (core dumped)

derek-gerstmann commented 1 day ago

Any thoughts on how to handle this? @steven-johnson @zvookin

abadams commented 16 hours ago

There's a school of thought that says you should just leak everything at process exit, because the OS has to clean it up anyway, and the OS is probably better at it, so let the OS do it. If we're facing these kinds of problems it might be more robust to just adopt that mentality, and remove the destructor attribute from the clean-up function.

derek-gerstmann commented 12 hours ago

After some internal discussion, we agree with @abadams ... in this situation, since we don't have control over the driver itself, and only hold onto the Vulkan ICD Loader lib, there's not much we can do if the driver happens to register an atexit() handler which gets invoked before the module attribute destructor gets called since the pointers will be invalidated.

Latest commits remove the cleanup calls inside the module destructor, and update the object lifetime and leak tests to manually call the device_release() to validate we still cleanup correctly.

halide / Halide

[vulkan] Reduce the number of descriptor sets, remove cleanup in destructor #8452