halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.86k stars 1.07k forks source link

[vulkan] Fix Vulkan SIMT mappings for GPU loop vars. #8259

Closed derek-gerstmann closed 3 months ago

derek-gerstmann commented 4 months ago

Previous refactoring accidentally used the fully qualified var name rather than the categorized vulkan intrinsic name.

See also #8258.

derek-gerstmann commented 4 months ago

Sadly, we missed this regression since we disabled the Vulkan backend from running on the build bots due to spurious errors when we were getting the last release out the door. With all the recent Vulkan fixes, perhaps we should reconsider turning them back on? @steven-johnson

steven-johnson commented 4 months ago

Looks like we are getting a segfault in generator_aot_cleanup_on_error for vulkan

derek-gerstmann commented 4 months ago

Looks like we are getting a segfault in generator_aot_cleanup_on_error for vulkan

Tried to reproduce on a Linux VM, and OSX but neither seem to trigger this. I'll try native Linux ...

Okay, was able to reproduce the error for generator_aot_cleanup_on_error. This seems like memory corruption due to the custom halide alloc routine, which then clobbers the shader code somehow. Investigating some more ...

derek-gerstmann commented 3 months ago

Confirmed that the SPIRV binary that's created during CodeGen when the generator is run, does not match the SPIRV binary that's passed to the runtime during the execution of the test. There's a semi regular pattern of bytes that have been overwritten (all binary values of 0D are replaced with OA). Strange!

steven-johnson commented 3 months ago

There's a semi regular pattern of bytes that have been overwritten (all binary values of 0D are replaced with OA). Strange!

something is treating it as a text file/stream and trying to convert LF <-> CR... verify that all files are opened in binary mode?

derek-gerstmann commented 3 months ago

There's a semi regular pattern of bytes that have been overwritten (all binary values of 0D are replaced with OA). Strange!

something is treating it as a text file/stream and trying to convert LF <-> CR... verify that all files are opened in binary mode?

Ah, that seems likely. But I'm not sure where this is coming from ... the binary gets injected into a buffer that's exposed through the _gpu_source_kernels. So there shouldn't be any text translation occuring. Still digging.

derek-gerstmann commented 3 months ago

Aha! Found it ... the host-side CodeGen that invokes the GPU runtime uses CodeGen_C to encode the GPU kernel source into a formatted string. I'll disable this formatting for Vulkan to avoid the text translation.

steven-johnson commented 3 months ago

Should we backport this to the 17.x release?

derek-gerstmann commented 3 months ago

Yes … I think both the GPU Vars and CodeGen_C string formatting are in v17.x