P1 is slower then with the official plotter on Apple M1

nazar-ch commented 3 years ago

Phase 1 x 3 threads is very slow and takes near 12000 sec on this plotter and 8000 sec on the official.

Phases 2-3 are a lot faster.

I'm running this plotter and the official on Mac Mini M1 8Gb in parallel and the times (see the logs) are consistent (±5%) for near 10 plots.

Increasing or decreasing the number of threads changes P1 time proportionally and CPU load matches the number of threads. Looks like some part of the code in P1 is very slow on M1.

I tried to compile blake3 with Neon optimization but there was no noticeable difference (but I'm not sure I did it correctly).

Number of Threads: 3
Number of Buckets: 2^8 (256)
Working Directory:   /Volumes/Ssd/
Working Directory 2: /Volumes/Ssd/
[P1] Table 1 took 72.4602 sec
[P1] Table 2 took 2076.52 sec, found 4294984960 matches
[P1] Table 3 took 1913.38 sec, found 4294983971 matches
[P1] Table 4 took 1994.51 sec, found 4294897596 matches
[P1] Table 5 took 1977.62 sec, found 4294694313 matches
[P1] Table 6 took 1927.94 sec, found 4294420559 matches
[P1] Table 7 took 1854.06 sec, found 4293831663 matches
Phase 1 took 11817.1 sec
[P2] max_table_size = 4294984960
[P2] Table 7 scan took 67.0795 sec
[P2] Table 7 rewrite took 167.657 sec, dropped 0 entries (0 %)
[P2] Table 6 scan took 49.6439 sec
[P2] Table 6 rewrite took 161.086 sec, dropped 581335439 entries (13.537 %)
[P2] Table 5 scan took 48.5578 sec
[P2] Table 5 rewrite took 165.725 sec, dropped 762006235 entries (17.743 %)
[P2] Table 4 scan took 77.3157 sec
[P2] Table 4 rewrite took 172.827 sec, dropped 828937658 entries (19.3005 %)
[P2] Table 3 scan took 63.3097 sec
[P2] Table 3 rewrite took 173.473 sec, dropped 855163362 entries (19.9107 %)
[P2] Table 2 scan took 63.343 sec
[P2] Table 2 rewrite took 173.166 sec, dropped 865626392 entries (20.1544 %)
Phase 2 took 1384.03 sec
Wrote plot header with 268 bytes
[P3-1] Table 2 took 178.291 sec, wrote 3429358568 right entries
[P3-2] Table 2 took 140.773 sec, wrote 3429358568 left entries, 3429358568 final
[P3-1] Table 3 took 204.599 sec, wrote 3439820609 right entries
[P3-2] Table 3 took 129.617 sec, wrote 3439820609 left entries, 3439820609 final
[P3-1] Table 4 took 205.747 sec, wrote 3465959938 right entries
[P3-2] Table 4 took 133.933 sec, wrote 3465959938 left entries, 3465959938 final
[P3-1] Table 5 took 210.537 sec, wrote 3532688078 right entries
[P3-2] Table 5 took 136.023 sec, wrote 3532688078 left entries, 3532688078 final
[P3-1] Table 6 took 245.516 sec, wrote 3713085120 right entries
[P3-2] Table 6 took 148.065 sec, wrote 3713085120 left entries, 3713085120 final
[P3-1] Table 7 took 202.205 sec, wrote 4293831663 right entries
[P3-2] Table 7 took 162.538 sec, wrote 4293831663 left entries, 4293831663 final
Phase 3 took 2098.65 sec, wrote 21874743976 entries to final plot
[P4] Starting to write C1 and C3 tables
[P4] Finished writing C1 and C3 tables
[P4] Writing C2 table
[P4] Finished writing C2 table
Phase 4 took 173.8 sec, final plot size is 108820706842 bytes
Total plot creation time was 15473.7 sec

Official plotter:

Plot size is: 32
Buffer size is: 2100MiB
Using 128 buckets
Using 3 threads of stripe size 65536

Starting phase 1/4: Forward Propagation into tmp files... Tue Jun 15 11:28:38 2021
Computing table 1
F1 complete, time: 151.115 seconds. CPU (121.940%) Tue Jun 15 11:31:09 2021
Computing table 2
Forward propagation table time: 676.289 seconds. CPU (256.780%) Tue Jun 15 11:42:25 2021
Computing table 3
Forward propagation table time: 1083.887 seconds. CPU (195.780%) Tue Jun 15 12:00:29 2021
Computing table 4
Forward propagation table time: 1559.755 seconds. CPU (161.350%) Tue Jun 15 12:26:29 2021
Computing table 5
Forward propagation table time: 1844.393 seconds. CPU (129.990%) Tue Jun 15 12:57:13 2021
Computing table 6
Forward propagation table time: 1494.853 seconds. CPU (163.980%) Tue Jun 15 13:22:08 2021
Computing table 7
Forward propagation table time: 1238.328 seconds. CPU (175.620%) Tue Jun 15 13:42:47 2021
Time for phase 1 = 8048.673 seconds. CPU (168.760%) Tue Jun 15 13:42:47 2021
Time for phase 2 = 2379.965 seconds. CPU (89.520%) Tue Jun 15 14:22:27 2021
Time for phase 3 = 7283.112 seconds. CPU (88.040%) Tue Jun 15 16:23:50 2021
Time for phase 4 = 496.174 seconds. CPU (75.050%) Tue Jun 15 16:32:06 2021
Total time = 18207.924 seconds. CPU (123.560%) Tue Jun 15 16:32:06 2021

fiveangle commented 3 years ago

M1 is an 8-core processor, yet you are assigning only 3-threads, which essentially use 2-threads which is same as official plotter + overhead of pipelining. As Steve Jobs would say, "You're holding it wrong." :)

nazar-ch commented 3 years ago

@fiveangle I'm using it in the best way I could based on experiments with different settings. It has 4 high performance + 4 low performance cores (and low are like 15% of high, they make negligible difference for plotting).

But it's irrelevant here. The point is that with the same settings this plotter is slower than official for P1 on M1, but it's much faster on Intel/AMD.

fiveangle commented 3 years ago

Fair enough... if there is any condition where the gm plotter is faster than the new mm algo, the mm plotter would ideally sub in gm algo for its built-in algo, but in the end, does it really matter ? The goal of mm is to maximize performance and assigning just 3x of 8x cores is not the typical use case. unless you're saying that assigning 4x cores also has the same problem ?

btw, I suspect the reason the 686 results are far better is not that there is a bottleneck on M1, but because it appears the threads parameter is ignored in some way on 686, because cpu usage is > "200%" on the process set to 2 threads on 686. See 2 jobs (1x mm, 1x gm) started at same time:

ghost commented 3 years ago

Table 1 took 196.312 sec on M1 4 cores MadMax (256 buckets) Table 1 took 137.027 sec on M1 2 cores standard plotter (128 buckets)

Have you tried mm-plotter with 128 buckets? Will probably use more ram but with the ram integrated in the M1 there could be benefits. I will at least try in my next attempt.

ghost commented 3 years ago

are you guys sure you're taken care of all the necessary requirements ?

Yes

It utilises the cores almost to max:

fiveangle commented 3 years ago

So with the performance is still suboptimal with 4-8 threads, that is actually real news !

I have a dual-core system kicking around somewhere in storage that I might try a test with gm vs mm to see if in fact the mm plotter is far faster on 686 when the plotter doesn't have >2 cores to potentially grab.

madMAx43v3r / chia-plotter

P1 is slower then with the official plotter on Apple M1 #486