Open oscarbg opened 1 week ago
Unfortunately I don't have an M4 Mac to test.
Running the code on the Mac would require a bit of refactoring. I'd remove all the UI and make the app command-line only. Then I'd tweak the writeReport()
function to save reports in a local directory. Finally, the current makefile is written for running on an external device, so it would need to be adjusted. Probably best to have two separate makefiles and use the setup
script to control the execution.
Pull requests are welcome!
thanks for details on porting.. unfortunately now I'm busy with other things but will revisit and try your suggestions when possible.. anyway I can test on a M4 Mac immediately if you/other can provide some testing macos branch..
Just tested it with an M4 Max. Since there is obviously one SME(2) engine per P-core cluster, you get double the performance with an M4 Pro/Max (1 TFLOP DP FMOPA, 4 TFLOP SP MOPA, 8 TFLOP Small Integer). I'm a little bit underwhelmed by its performance since the M1 Max achieved already about 800 GFLOP DP with Apple's AMX extension. In my real world large DGEMM application, the M4 Max with Accelerate is just half as fast as the 9950x with AOCL where you have full AVX512 support on each of the 16-cores. On the M4 Max, the OpenBLAS NEON DGEMM kernel is just about 1/3 slower than SME. Only major advantage: the power drain from the SME units is nearly constant for all operations, the system peaks somewhere below 30W total energy (lid closed). With the OpenBLAS kernel firing at all P-cores, power consumption peaks at over 90W until the thermal throttling kicks in after a few seconds...
@zinphi can you share instructions/code modifications needed to run? would be appreciated! thanks..
I just did the quickest possible hack to get it running under MacOS. Follow all instructions from https://github.com/scalable-analyses/sme/tree/main/MicrobenchmarkApp to set up the Xcode project but add the files and folders from /src of this project instead. In the file app.swift comment/delete line 2 and 23. This is it basically. Build and run it. I just inspected the command line output then...
thanks! will try and compare vs my M4 (basic) chip..
On Thu, Nov 14, 2024 at 11:32 PM Philipp Zingerle @.***> wrote:
I just did the quickest possible hack to get it running under MacOS. Follow all instructions from https://github.com/scalable-analyses/sme/tree/main/MicrobenchmarkApp to set up the Xcode project but add the files and folders from /src of this project instead. In the file app.swift comment/delete line 2 and 23. This is it basically. Build and run it. I just inspected the command line output then...
— Reply to this email directly, view it on GitHub https://github.com/tzakharko/m4-sme-exploration/issues/4#issuecomment-2477537670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFDM6IEXIE4NAYWNUJZ2HL2AUQH5AVCNFSM6AAAAABROOBOKKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZXGUZTONRXGA . You are receiving this because you authored the thread.Message ID: @.***>
@zinphi Thank you for sharing these! There are plenty of interesting stuff going on there. First, I'd expect the M4 Max to run at higher clock than the iPad, but the peak single-core SME FP32 outer product is still the same 2TFLOPS. Second, the system appears to automatically distribute the threads across clusters — you only need two high-priority threads to achieve 4TFLOPS. But using more threads seems to hurt the scheduling again. We don't see much contribution from the E-cluster here either, the peak is only 4150
FMOPA (FP32) | ILP=4 | VLx64 | threads 1H+0L | [0;32m 2007.8 GOP/s[0m (2.04 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 2H+0L | [0;32m 3973.61 GOP/s[0m (2.06 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 3H+0L | [0;32m 3249.13 GOP/s[0m (3.78 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 4H+0L | [0;32m 3954.98 GOP/s[0m (4.14 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 5H+0L | [0;32m 3698.18 GOP/s[0m (5.54 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 6H+0L | [0;32m 3219.74 GOP/s[0m (7.63 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 7H+0L | [0;32m 3895.84 GOP/s[0m (7.36 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 8H+0L | [0;32m 3501.01 GOP/s[0m (9.36 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 9H+0L | [0;32m 3807.67 GOP/s[0m (9.68 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 10H+0L | [0;32m 4189.56 GOP/s[0m (9.78 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 11H+0L | [0;32m 3971.5 GOP/s[0m (11.34 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 12H+0L | [0;32m 4179.68 GOP/s[0m (11.76 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 13H+0L | [0;32m 4006.19 GOP/s[0m (13.29 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 14H+0L | [0;32m 4159.22 GOP/s[0m (13.79 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 15H+0L | [0;32m 4035.8 GOP/s[0m (15.22 ms)
FMOPA (FP32) | ILP=4 | VLx64 | threads 16H+0L | [0;32m 4121.76 GOP/s[0m (15.9 ms)
The low-priority thread results are weird as well. They achieve 2TFLOPS on a single thread, leading me to believe that these threads do not run on E-cores at all.
All of this again highlights how difficult it is to combine SME and multithreading and achieve optimal results.
I could imagine that there is no SME co-processor on the e-core clusters at all and that the iPad is simply clocking the SME co-processor much lower if an SME command is dispatched from a low-priority thread/core to simulate e-core power draw. Could be that Apple sees this approach not necessary for larger devices. On the M4 Max it seems that low-priority threads can access/share one (p-core) SME co-processor (without down-clocking) and the other one is exclusive to high-priority threads. This also implies that low-priority threads can exhibit a much higher power draw on the M4 Max/MacOS than a programmer would expect when using SME commands. But this is all pure speculation...
@zinphi I do believe that M4 (iPad version) has an SME block on the E-cluster. The behavior is fundamentally different from the P-cluster SME, and maxing out the hardware threads results in higher performance than what is achievable on P-cores alone.
One could bring more clarity by sampling the performance counters that provide information about the core utilization and whether the thread runs on P- or E-core.
adding M4 Macos benchmarks in case useful.. using @zinphi instructions! benchm4full.zip
Hi, m4 is now in macOS devices.. how to test it with this code?