Very slow speed on Mac M1

Urrshak commented 3 years ago

Newest build for M1 is getting extremely slow compared to Intel for NFT plotting.

Intel version was making Phase 1 in 3200s (Phase 2,3,4 in total in ~ 2000s) and total plot time in 5300s M1 (Apple) version makes Phase 1 in 7400s (Phase 2,3,4 in total in ~ 2000s) and total plot time in 9800s. Seems the speed drops massively after Table 1 (Phase 1) causing whole Phase 1 to slow down by ~4000s (Table 1 takes same 47s on Intel and M1...next 6 tables are getting way more time in Phase 1).

This is only if we talk about making Intel and M1 even...without mentioning M1 is supposed to be even faster. Help me to understand but this doesn't make much sense.

stanchiam commented 3 years ago

you are going to destroy your m1 ssd by plotting with it, better go and check what is left for the disk life

SebMoore commented 3 years ago

yeah I would never plot on anything that doesn't have a user-replaceable ssd

obeykarma commented 3 years ago

I have this problem too and I found the chia_plot process will overuse ram to above 16GB.

Urrshak commented 3 years ago

I never mentioned plotting on internal ssd😀 I use external NVMe.

SebMoore commented 3 years ago

I never mentioned plotting on internal ssd😀 I use NVMe.

Phew! Had me worried.

stanchiam commented 3 years ago

external nvme through usb? that will explain why it is slow

Urrshak commented 3 years ago

external nvme through usb? that will explain why it is slow

Not really, using tb3 enclosure via tb3 ports. Exactly the same setup i used to plot via mad max on intel version for OG plots. M1 build (same setup) is 2x slower

AndyRPH commented 3 years ago

I know it's less than helpful, but I've got an M1 as well as early mad max builds were substantially faster for me than latter builds using the same hardware. Couldn't quite pin down specific commits that impacted it.

dmk42 commented 3 years ago

It definitely isn't the 40Gbps Thunderbolt 3 enclosure. That doesn't hinder highly parallel plotting with chia plots create.

I don't know why the current version might be slower than previous versions, but I thought it was annoying and awfully suspicious that the M1 was taking so close to double the time of x86_64, and only in phase 1, so I took a look at the source code.

Phase 1 uses the BLAKE3 library, which takes advantage of ARM's NEON SIMD unit. That unit is 128 bits wide. BLAKE3 also takes advantage of x86_64's AVX2 and AVX512 units, though, and those are 256 and 512 bits wide, respectively. Therefore, it makes a lot of sense that phase 1 on an M1 would be about twice as slow as an AVX2 x86_64. It's getting half the parallelism per instruction.

MaD088 commented 3 years ago

I am getting the same issue on Mac mini M1 16gb. Table I phase I take 28 sec then table 2 is so long.. I was able to use MadMax for OG Plots and it took me 1h20-1h30mn/plot. Is there a way to get such result for NFT plot?

AndyRPH commented 3 years ago

I don't think it's related to OG or NFT plots at all, I think some optimization of max max a month or two ago just dropped performance on M1 macs. I don't really have the motivation to compile a bunch of older versions and isolate the change that impacted M1 plotting, but if you've got the time...

MaD088 commented 3 years ago

I don't really know how to do it.. I am going to try to build it on an old Mac then copy on M1 and see if it's better. That's how I was using it may be I will get better results.

dmk42 commented 3 years ago

I took a look at the source code.

Phase 1 uses the BLAKE3 library, which takes advantage of ARM's NEON SIMD unit. That unit is 128 bits wide. BLAKE3 also takes advantage of x86_64's AVX2 and AVX512 units, though, and those are 256 and 512 bits wide, respectively. Therefore, it makes a lot of sense that phase 1 on an M1 would be about twice as slow as an AVX2 x86_64. It's getting half the parallelism per instruction.

It is worse than I thought. Digging deeper, I found that the NEON acceleration is not being used at all. I edited the CMakeLists.txt manually to enable the NEON code, and found that it did not help, so I profiled phase 1 of chia_plot using the Mac's sample profiling tool.

It turns out that phase 1 is spending about 90% of its run time in a function called blake3_compress_xof_portable. There are several versions of blake3_compress_xof to take advantage of various hardware accelerators, and the portable version used on the Mac is what we get when none of the accelerated versions work on our machine. The hardware-specific versions process 128-bit chunks at a time, and the portable version only works on 32 bits at a time.

There is a blake3_compress_xof_avx512, for example. There ought to be a blake3_compress_xof_neon, but it would not help because NEON 128-bit loads and stores depend on the data being 128-bit aligned, which apparently is not guaranteed to be the case in the BLAKE3 library. (See the fake 128-bit loads and stores in lib/BLAKE3/c/blake3_neon.c, which are actually calls to memcpy.) Therefore, they did not bother to write a blake3_compress_xof_neon.

What this means is that due to architectural limitations of the NEON SIMD unit of the M1 processor, which require SIMD data to be 128-bit aligned, the M1 is only performing its most expensive operation 32 bits at a time, compared to 128 bits for most x86_64 processors.

To eliminate the speed penalty, we would need to either 128-bit-align the data, or work around the misalignment, to write our own blake3_compress_xof_neon function.

AndyRPH commented 3 years ago

Hmm. Going to run a plot using it compiled for Intel then and see what the difference is when Rosetta works it's magic.

Edit: Phase 1 more than twice as fast! Edit: Entire table took 120 mins exactly. Despite Rosetta!

AndyRPH commented 3 years ago

Is it possible to see what Rosetta 2 is doing and do that natively instead?

dmk42 commented 3 years ago

It turns out that the comments in the BLAKE3 code are, at best, outdated. I was relying on them for my information about how to implement BLAKE3 in NEON. There may have been a time when NEON on 32-bit ARM required alignment, but in the current 32-bit version, there is a choice between aligned and unaligned instructions.

In 64-bit ARM, like the M1, there is no distinction between aligned and unaligned instructions. If the data are aligned, the operations are automatically faster, but alignment is not necessary.

So, the comments on NEON in the BLAKE3 code must have been written based on old information. It is possible to write a NEON version of the offending function.

My guess is Rosetta 2 is translating the AVX-style instructions into NEON instructions.

This is great news. Unfortunately, I might not have a chance to work on a NEON implementation for some time. If someone else wants to jump in, feel free.

In the meantime, it sounds like running the Intel code on the M1 is a decent improvement on the current situation. Thanks for investigating that, @AndyRPH .

AndyRPH commented 3 years ago

Reformatted my SSD (via TB3 enclosure) to APFS instead of exFAT, now routinely getting plots done in 80-90 minutes on the M1 using the intel compiled version of madmax.

Urrshak commented 3 years ago

Reformatted my SSD (via TB3 enclosure) to APFS instead of exFAT, now routinely getting plots done in 80-90 minutes on the M1 using the intel compiled version of madmax.

Care to share the build? I tried to use Intel build compiled on Intel Mac and just tried to launch on M1 Mac (worked before for OG plots) but not with Pool plotting. Getting this error: dyld: Library not loaded: /usr/local/opt/libsodium/lib/libsodium.23.dylib Referenced from: /Users/Urrshak/chia-plotter/./build/chia_plot Reason: image not found zsh: abort ./build/chia_plot --help

AndyRPH commented 3 years ago

Same... copy the /use/local/opt/libsodium folder from the Intel Mac to the M1 Mac also.

Urrshak commented 3 years ago

Same... copy the /use/local/opt/libsodium folder from the Intel Mac to the M1 Mac also.

Ehm...copied all files from /usr/local/opt/libsodium Intel Mac to M1 Mac. Getting this error now:

dyld: Library not loaded: /opt/homebrew/opt/libsodium/lib/libsodium.23.dylib Referenced from: /Users/Urrshak/chia-plotter/./build/chia_plot Reason: no suitable image found. Did find: /opt/homebrew/opt/libsodium/lib/libsodium.23.dylib: mach-o, but wrong architecture /opt/homebrew/opt/libsodium/lib/libsodium.23.dylib: stat() failed with errno=1 /opt/homebrew/Cellar/libsodium/1.0.18_1/lib/libsodium.23.dylib: mach-o, but wrong architecture zsh: abort ./build/chia_plot -n 1 -r 8 -u 256 -t /Volumes/Samsung970/2/ -2 -d -c -f

Any advice?

AndyRPH commented 3 years ago

Sorry not familiar enough with the nuances of cross compiling.

dmk42 commented 3 years ago

I tried two shortcuts to getting native code, but they did not pan out.

First, I tried the sse2neon.h header file designed to turn Intel intrinsics into ARM NEON code. I made a copy of blake3_sse41.c and modified it to include sse2neon.h instead of immintrin.h. I let the result run for a few hours, but it never finished the second subphase of phase 1. It seemed to be caught in an infinite loop, or else it was extremely slow.

Then I converted blake3_sse41.c to use gcc/clang portable __builtin intrinsics for SIMD instead of Intel-specific instructions. My thinking behind this was that it would not only make it possible to generate SIMD code for the Apple M1 processor, but also future SIMD units such as upcoming ones for RISC V. The code was also much easier to read once it was no longer processor-specific. The result did indeed generate ARM NEON code that worked, and I used it to generate a valid plot. I hit it with 3000 challenges instead of the usual 30, and it passed with flying colors. Unfortunately, the generated code was not sufficiently tuned to the specific processor, and it ran slower than the non-SIMD code. The generic SIMD version took 9.9k seconds for phase 1 and 12.5k seconds overall, compared to 7.8k seconds for phase 1 and 10.3k seconds overall for the non-SIMD code, native on an M1 Macbook Pro with an external 40Gbps Thunderbolt 3 NVME enclosure.

If anyone wants to play around with the portable SIMD code, it's here, but I'm abandoning it.

It appears that the performance of SIMD code is fragile enough that nothing is going to compete with Rosetta 2's performance short of just writing straight NEON assembly language by hand. This is one of the very few areas where a compiler cannot do as well as a human, since SIMD instructions are so odd and non-orthogonal.

Since Rosetta 2 already performs so well, I'll probably lose interest before banging out the roughly 2000 lines of crypto-oriented assembly code required (see blake3_sse41_x86-64_unix.S for an example). Most of my space has been plotted now anyway. It was fun to try, though.

AndyRPH commented 3 years ago

Is there an easy way to adjust this so that madmax compiles the x86 code if you've got an M1 computer? Or do I just need to compile it on my Intel MacBook and copy the binary over each time?

dmk42 commented 3 years ago

[Edited to reflect new information after I finally succeeded in building an x86_64 version]

Follow the directions here to create a Rosetta Terminal.
Install homebrew again in your Rosetta terminal. You will now have two copies, the x86_64 version in /usr/local and the arm64 version in /opt/homebrew .
Edit your rc files (.bash_profile or whatever you use) to put /opt/homebrew/bin first in your path when you are in arm64 mode, and put /usr/local/bin first in your path when you are in x86_64 mode. If you use bash, this is easy to detect with the HOSTTYPE environment variable.
In your Rosetta terminal, brew install libsodium cmake libtool pkg-config .
Follow the chia-plotter instructions for x86_64 Macs.
chia_plot will now build as an x86_64 application when you build it in your Rosetta Terminal.

dmk42 commented 3 years ago

@AndyRPH , I edited my previous response because I finally succeeded in building an x86_64 version. Just doing this so it will alert you.

AndyRPH commented 3 years ago

Thanks!

fabiobi82 commented 3 years ago

[Edited to reflect new information after I finally succeeded in building an x86_64 version]

Follow the directions here to create a Rosetta Terminal.

Install homebrew again in your Rosetta terminal. You will now have two copies, the x86_64 version in /usr/local and the arm64 version in /opt/homebrew .

Edit your rc files (.bash_profile or whatever you use) to put /opt/homebrew/bin first in your path when you are in arm64 mode, and put /usr/local/bin first in your path when you are in x86_64 mode. If you use bash, this is easy to detect with the HOSTTYPE environment variable.

In your Rosetta terminal, brew install libsodium cmake libtool pkg-config .

Follow the chia-plotter instructions for x86_64 Macs.

chia_plot will now build as an x86_64 application when you build it in your Rosetta Terminal.

Hi, would you mind explain point .3? How do I edit rc files? where can I find them? Thank you

dmk42 commented 3 years ago

I do not recommend the procedure unless you are familiar with Unix and its command-line interface. That is the audience for the instructions.

The rc files got that name because they came from "run command" files that were executed at startup. Whatever shell you are using executes the commands from a certain file when you log in or launch a new Terminal. If you are using bash, you will find the startup commands in .bash_profile in your home directory. (You could also use .bashrc but that gets executed with every new shell rather than just at startup.) If you are using csh, the file is .cshrc. I do not know what other shells use.

If you are not already familiar with your shell's command language and how to edit it in an rc file, you will likely do damage and it could prevent you from logging in to your own computer, so please don't do it if you are not familiar with Unix. In that case, just build the chia plotter on an Intel Mac and copy it over.

[Edited to add libsodium information] If you copy the chia plotter over from an Intel Mac, which is really the easiest way, you will also need to copy the /usr/local/lib/libsodium.* files from the Intel Mac to the M1 Mac.

madMAx43v3r / chia-plotter

Very slow speed on Mac M1 #795