philipturner / metal-flash-attention

FlashAttention (Metal Port)
MIT License
381 stars 19 forks source link

simdgroup_async issues - Xcode Version 15.0.1 (15A507) / M3 Max 14.1.2 (23B2091) #14

Closed bpkeene closed 10 months ago

bpkeene commented 10 months ago

metal_config in the toolchain doesn't mention HAVE_SIMDGROUP_FUTURE, seems the headers referred to here: https://github.com/dougallj/applegpu/issues/28

were removed altogether from newer versions of Xcode (using 15.0.1 / 15A507)

I was able to find matching strings from the above github issue inside of libapplegpu-nt.dylib with the 15.0.1 toolchain:

objdump --disassemble --demangle /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/metal/macos/lib/libapplegpu-nt.dylib > /tmp/symbols.txt

127b704: 08 91 16 91 add x8, x8, #1444 ; literal pool for: "air.simdgroup_async_copy_1d" 127b708: 60 f6 04 f9 str x0, [x19, #2536] 127b70c: 68 e2 04 f9 str x8, [x19, #2496] 127b710: 28 08 00 b0 adrp x8, 261 ; 0x1380000 127b714: 08 01 17 91 add x8, x8, #1472 ; literal pool for: "air.simdgroup_async_copy_2d"

I didn't pursue this further though to see if things can still be patched up or if the functions are still usable & correct.

Is there interest / would there be positive reception to a PR using alternative read & write mechanisms in lieu of simdgroup_async?

Thanks!

philipturner commented 10 months ago

SIMD async copies are a hidden feature that provide large performance benefits on Apple GPU. They're the advantage MPS has over open-source GEMM libraries. Until MFA, which closed the gap.

The entire design of this library is so I can pre-compile with Xcode 14.2, so you don't have to go to the trouble of getting Xcode 14. Binaries are hosted on GitHub releases, but you can also follow instructions on the README to compile it yourself.

philipturner commented 10 months ago

Closing this as a duplicate issue.