Open nazar-ch opened 3 years ago
M1 is an 8-core processor, yet you are assigning only 3-threads, which essentially use 2-threads which is same as official plotter + overhead of pipelining. As Steve Jobs would say, "You're holding it wrong." :)
@fiveangle I'm using it in the best way I could based on experiments with different settings. It has 4 high performance + 4 low performance cores (and low are like 15% of high, they make negligible difference for plotting).
But it's irrelevant here. The point is that with the same settings this plotter is slower than official for P1 on M1, but it's much faster on Intel/AMD.
Fair enough... if there is any condition where the gm plotter is faster than the new mm algo, the mm plotter would ideally sub in gm algo for its built-in algo, but in the end, does it really matter ? The goal of mm is to maximize performance and assigning just 3x of 8x cores is not the typical use case. unless you're saying that assigning 4x cores also has the same problem ?
btw, I suspect the reason the 686 results are far better is not that there is a bottleneck on M1, but because it appears the threads
parameter is ignored in some way on 686, because cpu usage is > "200%" on the process set to 2 threads on 686. See 2 jobs (1x mm, 1x gm) started at same time:
Table 1 took 196.312 sec on M1 4 cores MadMax (256 buckets) Table 1 took 137.027 sec on M1 2 cores standard plotter (128 buckets)
Have you tried mm-plotter with 128 buckets? Will probably use more ram but with the ram integrated in the M1 there could be benefits. I will at least try in my next attempt.
are you guys sure you're taken care of all the necessary requirements ?
Yes
It utilises the cores almost to max:
So with the performance is still suboptimal with 4-8 threads, that is actually real news !
I have a dual-core system kicking around somewhere in storage that I might try a test with gm vs mm to see if in fact the mm plotter is far faster on 686 when the plotter doesn't have >2 cores to potentially grab.
Phase 1 x 3 threads is very slow and takes near 12000 sec on this plotter and 8000 sec on the official.
Phases 2-3 are a lot faster.
I'm running this plotter and the official on Mac Mini M1 8Gb in parallel and the times (see the logs) are consistent (±5%) for near 10 plots.
Increasing or decreasing the number of threads changes P1 time proportionally and CPU load matches the number of threads. Looks like some part of the code in P1 is very slow on M1.
I tried to compile blake3 with Neon optimization but there was no noticeable difference (but I'm not sure I did it correctly).
Official plotter: