Open derek-gerstmann opened 3 weeks ago
@steven-johnson Okay, testing this PR on the configs I have locally has everything passing. Can we re-enable testing on the Buildbots to see if there's any device or driver differences I need to catch?
@steven-johnson Okay, testing this PR on the configs I have locally has everything passing. Can we re-enable testing on the Buildbots to see if there's any device or driver differences I need to catch?
@steven-johnson Okay, testing this PR on the configs I have locally has everything passing. Can we re-enable testing on the Buildbots to see if there's any device or driver differences I need to catch?
Done, please start testing
I’ll investigate the three failures:
The following tests FAILED:
correctness_bounds
correctness_debug_to_file
correctness_device_buffer_copies_with_profile
Fixes added for #8466
@steven-johnson I believe the GPU lifetime and device memory leak errors are actually caused by the Validation Layer shared lib itself when invoked with ctest --parallel. It doesn't seem very robust at all. It's useful for diagnosing issues when they come up, but it doesn't seem production worthy. I'd like to suggest we remove the VK_INSTANCE_LAYERS env var from the buildbot config, and keep the Vulkan tests enabled.
@steven-johnson I believe the GPU lifetime and device memory leak errors are actually caused by the Validation Layer shared lib itself when invoked with ctest --parallel. It doesn't seem very robust at all. It's useful for diagnosing issues when they come up, but it doesn't seem production worthy. I'd like to suggest we remove the VK_INSTANCE_LAYERS env var from the buildbot config, and keep the Vulkan tests enabled.
Done, change made and buildbot master restarted -- please try again :-)
So, other than the build failures from LLVM20 interface changes, all the Vulkan tests and apps are actually passing now and reporting "SUCCESS". However, there's exception's being thrown at shutdown that I can't reproduce.
Ah, wait. I was able to get a build that seems to reproduce the shutdown issues. Investigating now.
What's the status here, still underway?
I'm trying to reproduce the latest shutdown failures after my fixes, but so far haven't succeeded. I'm currently trying to setup a build using clang as the system compiler with -D_LIBCPP_HARDENING_MODE=_LIBCPP_HARDENING_MODE_EXTENSIVE
to see if I can find any source of badness.
Well, running the tests with address sanitizer enabled is showing the latest Ubuntu 22.04 Vulkan loader v1.3.204.1-2 on Linux appears to have some issues. I'm not sure if this is contributing to the shutdown issues ...
> VK_INSTANCE_LAYERS=VK_LAYER_NV_optimus HL_JIT_TARGET=host-vulkan-vk_int8-vk_int16-vk_int64-vk_float16-vk_float64-vk_v13-asan ./test/correctness/correctness_hello_gpu
Defining function...
Realizing function...
=================================================================
==2242256==ERROR: AddressSanitizer: heap-use-after-free on address 0x502000032350 at pc 0x55f2fb5c79d8 bp 0x7ffe616039d0 sp 0x7ffe61603160
READ of size 4 at 0x502000032350 thread T0
#0 0x55f2fb5c79d7 in printf_common(void*, char const*, __va_list_tag*) /home/dg/Workspace/OpenSource/LLVM/Repos/llvm-v18.1.8/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors_format.inc:563:9
#1 0x55f2fb5c8174 in __vsnprintf_chk /home/dg/Workspace/OpenSource/LLVM/Repos/llvm-v18.1.8/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:1657:1
#2 0x7f422bc9db0b (/lib/x86_64-linux-gnu/libvulkan.so.1+0x42b0b) (BuildId: f0c68e18c65acc33ade4f150c1f99e53fc3eb296)
#3 0x7f422bc96151 (/lib/x86_64-linux-gnu/libvulkan.so.1+0x3b151) (BuildId: f0c68e18c65acc33ade4f150c1f99e53fc3eb296)
#4 0x7f422bc99ab8 (/lib/x86_64-linux-gnu/libvulkan.so.1+0x3eab8) (BuildId: f0c68e18c65acc33ade4f150c1f99e53fc3eb296)
#5 0x7f422bc8a425 in vkEnumerateInstanceExtensionProperties (/lib/x86_64-linux-gnu/libvulkan.so.1+0x2f425) (BuildId: f0c68e18c65acc33ade4f150c1f99e53fc3eb296)
#6 0x7f422bc0e284 (<unknown module>)
0x502000032350 is located 0 bytes inside of 8-byte region [0x502000032350,0x502000032358)
freed by thread T0 here:
#0 0x55f2fb63ba56 in free /home/dg/Workspace/OpenSource/LLVM/Repos/llvm-v18.1.8/compiler-rt/lib/asan/asan_malloc_linux.cpp:52:3
#1 0x7f422bc960c2 (/lib/x86_64-linux-gnu/libvulkan.so.1+0x3b0c2) (BuildId: f0c68e18c65acc33ade4f150c1f99e53fc3eb296)
previously allocated by thread T0 here:
#0 0x55f2fb63bcef in malloc /home/dg/Workspace/OpenSource/LLVM/Repos/llvm-v18.1.8/compiler-rt/lib/asan/asan_malloc_linux.cpp:68:3
#1 0x7f422bc9d377 (/lib/x86_64-linux-gnu/libvulkan.so.1+0x42377) (BuildId: f0c68e18c65acc33ade4f150c1f99e53fc3eb296)
SUMMARY: AddressSanitizer: heap-use-after-free (/lib/x86_64-linux-gnu/libvulkan.so.1+0x42b0b) (BuildId: f0c68e18c65acc33ade4f150c1f99e53fc3eb296)
Shadow bytes around the buggy address:
0x502000032080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x502000032100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x502000032180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x502000032200: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x502000032280: fa fa fa fa fa fa fa fa fa fa fd fd fa fa fd fa
=>0x502000032300: fa fa fa fa fa fa fd fd fa fa[fd]fa fa fa 00 fa
0x502000032380: fa fa fd fa fa fa fa fa fa fa fa fa fa fa fd fd
0x502000032400: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
0x502000032480: fa fa fd fa fa fa fa fa fa fa fa fa fa fa fd fa
0x502000032500: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x502000032580: fa fa fa fa fa fa fa fa fa fa fd fa fa fa fd fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==2242256==ABORTING
Newer versions of the Vulkan loader appear to fix this. I'll keep digging.
Okay, I was able to diagnose the segfaults on the buildbot and they're caused by the Vulkan runtime module destructor being invoked, and for whatever reason, it seems like the API function pointers are invalid (even though the loader lib was loaded, and the func table initialized). Any call to a Vulkan API method in the destructor will immediately segfault. This doesn't happen on my local machine, so there may be some atexit ordering issues at play here.
This seems bad. Running with LD_DEBUG=files
, the link map for the NVIDIA driver gets destroyed before the vulkan runtime AOT destructor gets called. =/
> LD_DEBUG=files ../halide-build-dg/test/generator/generator_aotcpp_cleanup_on_error
197096:
197096: file=libVkLayer_MESA_device_select.so [0]; dynamically loaded by /lib/x86_64-linux-gnu/libvulkan.so.1 [0]
197096: file=libVkLayer_MESA_device_select.so [0]; generating link map
197096: dynamic: 0x00007c2a32675da0 base: 0x00007c2a32667000 size: 0x00000000000114d8
197096: entry: 0x00007c2a32667000 phdr: 0x00007c2a32667040 phnum: 11
197096: calling init: /lib/x86_64-linux-gnu/libVkLayer_MESA_device_select.so
197096: opening file=/lib/x86_64-linux-gnu/libVkLayer_MESA_device_select.so [0]; direct_opencount=1
197096: opening file=/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.535.216.01 [0]; direct_opencount=3
197096:
vk_load_vulkan_functions (user_context: 0x0)
...
halide_vulkan_finalize_kernels (user_context: 0x0, state_ptr: 0x2
Time: 2.800000e-04 ms
Success!
197096:
197096: calling fini: /lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1 [0]
197096: file=/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1 [0]; destroying link map
197096: closing file=/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.216.01; direct_opencount=1
197096: closing file=/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.535.216.01; direct_opencount=2
197096: closing file=/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.535.216.01; direct_opencount=1
197096: calling fini: /lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.216.01 [0]
197096:
197096: file=/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.216.01 [0]; destroying link map
197096: calling fini: ../halide-build-dg/test/generator/generator_aotcpp_cleanup_on_error [0]
197096:
halide_vulkan_device_release (user_context: 0x0)
vk_destroy_context (user_context: 0x0)
Segmentation fault (core dumped)
Any thoughts on how to handle this? @steven-johnson @zvookin
There's a school of thought that says you should just leak everything at process exit, because the OS has to clean it up anyway, and the OS is probably better at it, so let the OS do it. If we're facing these kinds of problems it might be more robust to just adopt that mentality, and remove the destructor attribute from the clean-up function.
After some internal discussion, we agree with @abadams ... in this situation, since we don't have control over the driver itself, and only hold onto the Vulkan ICD Loader lib, there's not much we can do if the driver happens to register an atexit() handler which gets invoked before the module attribute destructor gets called since the pointers will be invalidated.
Latest commits remove the cleanup calls inside the module destructor, and update the object lifetime and leak tests to manually call the device_release() to validate we still cleanup correctly.
In order to avoid using up all descriptor sets for complicated pipelines, this PR changes the Vulkan CodeGen such that we encode each Kernel entry-point into it's own SPIR-V module and bind them as separate shaders to avoid running out of resources on constrained devices.
Remove module attribute destructor (since we can't guarantee the driver doesn't register an atexit call which may get invoked before).
Fixes #8297 , #8296 , #8295 , and #8294. Re-enable performance wrap test #7559 .