tzakharko / m4-sme-exploration

Exploring the scalable matrix extension of the Apple M4 processor
MIT License
134 stars 8 forks source link

MacOS support? #4

Open oscarbg opened 1 week ago

oscarbg commented 1 week ago

Hi, m4 is now in macOS devices.. how to test it with this code?

tzakharko commented 1 week ago

Unfortunately I don't have an M4 Mac to test.

Running the code on the Mac would require a bit of refactoring. I'd remove all the UI and make the app command-line only. Then I'd tweak the writeReport() function to save reports in a local directory. Finally, the current makefile is written for running on an external device, so it would need to be adjusted. Probably best to have two separate makefiles and use the setup script to control the execution.

Pull requests are welcome!

oscarbg commented 1 week ago

thanks for details on porting.. unfortunately now I'm busy with other things but will revisit and try your suggestions when possible.. anyway I can test on a M4 Mac immediately if you/other can provide some testing macos branch..

zinphi commented 4 days ago

Just tested it with an M4 Max. Since there is obviously one SME(2) engine per P-core cluster, you get double the performance with an M4 Pro/Max (1 TFLOP DP FMOPA, 4 TFLOP SP MOPA, 8 TFLOP Small Integer). I'm a little bit underwhelmed by its performance since the M1 Max achieved already about 800 GFLOP DP with Apple's AMX extension. In my real world large DGEMM application, the M4 Max with Accelerate is just half as fast as the 9950x with AOCL where you have full AVX512 support on each of the 16-cores. On the M4 Max, the OpenBLAS NEON DGEMM kernel is just about 1/3 slower than SME. Only major advantage: the power drain from the SME units is nearly constant for all operations, the system peaks somewhere below 30W total energy (lid closed). With the OpenBLAS kernel firing at all P-cores, power consumption peaks at over 90W until the thermal throttling kicks in after a few seconds...

SME_EXP_M4MAX12p4e128GB.out.zip

oscarbg commented 4 days ago

@zinphi can you share instructions/code modifications needed to run? would be appreciated! thanks..

zinphi commented 4 days ago

I just did the quickest possible hack to get it running under MacOS. Follow all instructions from https://github.com/scalable-analyses/sme/tree/main/MicrobenchmarkApp to set up the Xcode project but add the files and folders from /src of this project instead. In the file app.swift comment/delete line 2 and 23. This is it basically. Build and run it. I just inspected the command line output then...

oscarbg commented 4 days ago

thanks! will try and compare vs my M4 (basic) chip..

On Thu, Nov 14, 2024 at 11:32 PM Philipp Zingerle @.***> wrote:

I just did the quickest possible hack to get it running under MacOS. Follow all instructions from https://github.com/scalable-analyses/sme/tree/main/MicrobenchmarkApp to set up the Xcode project but add the files and folders from /src of this project instead. In the file app.swift comment/delete line 2 and 23. This is it basically. Build and run it. I just inspected the command line output then...

— Reply to this email directly, view it on GitHub https://github.com/tzakharko/m4-sme-exploration/issues/4#issuecomment-2477537670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFDM6IEXIE4NAYWNUJZ2HL2AUQH5AVCNFSM6AAAAABROOBOKKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZXGUZTONRXGA . You are receiving this because you authored the thread.Message ID: @.***>

tzakharko commented 4 days ago

@zinphi Thank you for sharing these! There are plenty of interesting stuff going on there. First, I'd expect the M4 Max to run at higher clock than the iPad, but the peak single-core SME FP32 outer product is still the same 2TFLOPS. Second, the system appears to automatically distribute the threads across clusters — you only need two high-priority threads to achieve 4TFLOPS. But using more threads seems to hurt the scheduling again. We don't see much contribution from the E-cluster here either, the peak is only 4150

FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 1H+0L  |       2007.8 GOP/s (2.04 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 2H+0L  |      3973.61 GOP/s (2.06 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 3H+0L  |      3249.13 GOP/s (3.78 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 4H+0L  |      3954.98 GOP/s (4.14 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 5H+0L  |      3698.18 GOP/s (5.54 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 6H+0L  |      3219.74 GOP/s (7.63 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 7H+0L  |      3895.84 GOP/s (7.36 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 8H+0L  |      3501.01 GOP/s (9.36 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 9H+0L  |      3807.67 GOP/s (9.68 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 10H+0L |      4189.56 GOP/s (9.78 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 11H+0L |       3971.5 GOP/s (11.34 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 12H+0L |      4179.68 GOP/s (11.76 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 13H+0L |      4006.19 GOP/s (13.29 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 14H+0L |      4159.22 GOP/s (13.79 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 15H+0L |       4035.8 GOP/s (15.22 ms)
FMOPA (FP32)                                       | ILP=4  | VLx64 | threads 16H+0L |      4121.76 GOP/s (15.9 ms)

The low-priority thread results are weird as well. They achieve 2TFLOPS on a single thread, leading me to believe that these threads do not run on E-cores at all.

All of this again highlights how difficult it is to combine SME and multithreading and achieve optimal results.

zinphi commented 3 days ago

I could imagine that there is no SME co-processor on the e-core clusters at all and that the iPad is simply clocking the SME co-processor much lower if an SME command is dispatched from a low-priority thread/core to simulate e-core power draw. Could be that Apple sees this approach not necessary for larger devices. On the M4 Max it seems that low-priority threads can access/share one (p-core) SME co-processor (without down-clocking) and the other one is exclusive to high-priority threads. This also implies that low-priority threads can exhibit a much higher power draw on the M4 Max/MacOS than a programmer would expect when using SME commands. But this is all pure speculation...

tzakharko commented 2 days ago

@zinphi I do believe that M4 (iPad version) has an SME block on the E-cluster. The behavior is fundamentally different from the P-cluster SME, and maxing out the hardware threads results in higher performance than what is achievable on P-cores alone.

One could bring more clarity by sampling the performance counters that provide information about the core utilization and whether the thread runs on P- or E-core.

oscarbg commented 2 days ago

adding M4 Macos benchmarks in case useful.. using @zinphi instructions! benchm4full.zip