madscientist159 / xmrig

Monero (XMR) CPU miner
GNU General Public License v3.0
4 stars 2 forks source link

Tuning for Power8 #1

Open Balzhur opened 5 years ago

Balzhur commented 5 years ago

Continuing from xmr-stak #1924.

So S824 Power8, 24 cores machine, 22 cores available for Ubuntu, SMT=2. With a thread config like this

        { "low_power_mode": 2, "affine_to_cpu": 0 },
        { "low_power_mode": 2, "affine_to_cpu": 1 },
        { "low_power_mode": 2, "affine_to_cpu": 8 },
        { "low_power_mode": 2, "affine_to_cpu": 9 },
        { "low_power_mode": 2, "affine_to_cpu": 16 },
        { "low_power_mode": 2, "affine_to_cpu": 17 },
        { "low_power_mode": 2, "affine_to_cpu": 24 },
        { "low_power_mode": 2, "affine_to_cpu": 25 },
        { "low_power_mode": 2, "affine_to_cpu": 32 },
        { "low_power_mode": 2, "affine_to_cpu": 33 },
        { "low_power_mode": 2, "affine_to_cpu": 40 },
        { "low_power_mode": 2, "affine_to_cpu": 41 },
        { "low_power_mode": 2, "affine_to_cpu": 48 },
        { "low_power_mode": 2, "affine_to_cpu": 49 },
        { "low_power_mode": 2, "affine_to_cpu": 56 },
        { "low_power_mode": 2, "affine_to_cpu": 57 },
        { "low_power_mode": 2, "affine_to_cpu": 64 },
        { "low_power_mode": 2, "affine_to_cpu": 65 },
        { "low_power_mode": 2, "affine_to_cpu": 72 },
        { "low_power_mode": 2, "affine_to_cpu": 73 },
        { "low_power_mode": 2, "affine_to_cpu": 80 },
        { "low_power_mode": 2, "affine_to_cpu": 81 },
        { "low_power_mode": 2, "affine_to_cpu": 88 },
        { "low_power_mode": 2, "affine_to_cpu": 89 },
        { "low_power_mode": 2, "affine_to_cpu": 96 },
        { "low_power_mode": 2, "affine_to_cpu": 97 },
        { "low_power_mode": 2, "affine_to_cpu": 104 },
        { "low_power_mode": 2, "affine_to_cpu": 105 },
        { "low_power_mode": 2, "affine_to_cpu": 112 },
        { "low_power_mode": 2, "affine_to_cpu": 113 },
        { "low_power_mode": 2, "affine_to_cpu": 120 },
        { "low_power_mode": 2, "affine_to_cpu": 121 },
        { "low_power_mode": 2, "affine_to_cpu": 128 },
        { "low_power_mode": 2, "affine_to_cpu": 129 },
        { "low_power_mode": 2, "affine_to_cpu": 136 },
        { "low_power_mode": 2, "affine_to_cpu": 137 },
        { "low_power_mode": 2, "affine_to_cpu": 144 },
        { "low_power_mode": 2, "affine_to_cpu": 145 },
        { "low_power_mode": 2, "affine_to_cpu": 152 },
        { "low_power_mode": 2, "affine_to_cpu": 153 },
        { "low_power_mode": 2, "affine_to_cpu": 160 },
        { "low_power_mode": 2, "affine_to_cpu": 161 },
        { "low_power_mode": 2, "affine_to_cpu": 168 },
        { "low_power_mode": 2, "affine_to_cpu": 169 }

I'm having around 1750 H/s which is slightly better than 1587 H/s with xmr-stak but still way less than 4300ish H/s with MoneroV7 with xmr-stak.

Do you think I need to tune something like 'av' or 'hw-aes'?

madscientist159 commented 5 years ago

@Balzhur To be honest I haven't even tried tuning for POWER8. POWER9 is proving to be a real nuisance in and of itself, the Monero devs chose something that Ryzen / Xeon (ME / PSP issues) are good at and not much else is really good at. It feels like we've just exchanged one ASIC (mining ASIC) for another (locked x86 ASIC). :neutral_face:

Balzhur commented 5 years ago

BTW, getting fair amount of "Share above target" which is not the case with xmr-stak... Need to experiment further.

[2018-10-18 12:09:13] rejected (15/5) diff 100001 "Share above target." (97 ms)
[2018-10-18 12:09:57] rejected (15/6) diff 100001 "Share above target." (89 ms)
madscientist159 commented 5 years ago

@Balzhur Good to know. I'm also seeing some amount of downclocking on the larger chips for power reasons (makes some sense since the VSX pipelines are now being used quite heavily), which isn't helping anything. I had something that seemed good to extrapolate to the larger chips, but a bottleneck has shown up and I need time to debug. Worst case the Monero developers have locked CPU mining to non-owner-controllable devices and there's not much we can do about that other than call attention to it.

madscientist159 commented 5 years ago

@Balzhur Since you're on POWER8 it might be worth checking to see if perf spits out any obvious hotspots. My gut says the limited VSX engines in POWER8 aren't going to allow CNv8 to work well though; POWER9 saw throughput improvements to VSX and it's still lagging behind.

madscientist159 commented 5 years ago

For reference, on a 36 core POWER9 box, I've currently got 80% of the hash I had on CVn7 using the current code, powersave==1, and thread affinity of 0,1,4,5,8,9,etc.. The CPU core clock is slightly lower than with CNv7 as well. While the results are sort of decent, and better than xmr-stak, I'm still trying to figure out where the bottleneck is on these larger POWER9 devices. It's definitely a quirky core...

Balzhur commented 5 years ago

@madscientist159, by powersave you mean what, tuning the server's power saving? Could you please provide a command?

madscientist159 commented 5 years ago

@madscientist159 I am referring to the powersave variable in config.json for xmrig :smile: What the variable actually does is try to run some operations in a roughly parallel fashion; I had been recommending powersave == 2 but to get 80% of CNv7 on POWER9 I had to change that to powersave == 1.

Even with that change, I'm seeing nasty cache thrashing that I will need to debug tomorrow; it's super late here.

EDIT: I can't do math -- it's too late I guess. It's not 80%, it's far, far worse. POWER does not mine CNv8 well and I don't know that it can be fixed -- I guess it makes sense since the Monero devs only use AMD and Intel systems to design their algorithms.

Balzhur commented 5 years ago

@madscientist159, erm... I don't see "powersave" in my fresh config, only "low_power_mode", guess you're referring to this?

So far my best result is 2080H/s with SMT=4, av=1, so 4 threads per psysical core.

madscientist159 commented 5 years ago

@Balzhur Yeah that's it. Sorry, it's been a long day and this was a bit of bad news I didn't need. For some reason the smaller chips were working better, these larger ones are showing some kind of internal design weakness I don't fully understand yet. It could be the cache bandwidth, in which case it's time to mine on something else.

madscientist159 commented 5 years ago

OK, so on the higher core count chips I'm seeing caching anomalies related to the shared L3 per chiplet. Skipping every second chiplet restores the 100H/s/core but at the expense of losing half the cores. It's very possible we're exceeding chiplet bandwidth at this point, but still have a couple idea left to try.

Balzhur commented 5 years ago

@madscientist159, thanks. I'm satisfied (kind of) with the current results:

[2018-10-18 13:32:49] speed 10s/60s/15m 2157.7 2156.9 n/a H/s max 2174.7 H/s

this is with SMT=4, affinity, low_power_mode=1 and cpu-priority=5. Way better than xmr-stak! (and almost half of what it was with CNv7)...

Take your time and have a good sleep, maybe you'll be able to improve it in the future.

madscientist159 commented 5 years ago

Well, half the hashrate for the same electrical power means it's not exactly economical to mine on POWER, but at least it's better than no mining at all I guess. There is something in the POWER cache design that is puzzling me to no end, I've hit it before on other projects -- it just seems to saturate far too early and shows a very "peaky" pattern depending on the exact type and quantity of accesses being requested.

Anyway I'll keep looking into it, the 4 and 8 core chips reached ~60% of the CNv7 hashrate, but they have a higher L3 cache bandwidth for all intents and purposes.

Balzhur commented 5 years ago

@madscientist159, if this helps - I can send you a deep-dive P8 presentation by IBM lead processor dev. It's not confidential, but you cannot get hold of in internet I think... There are some slides about L3 cache architecture.

madscientist159 commented 5 years ago

@Balzhur Does it have more information than in https://wiki.raptorcs.com/wiki/File:POWER9_um_OpenPOWER_v20GA_09APR2018_pub.pdf ? If so then yes, please send it along...we're at the level where every cycle in and out of that cache counts.

madscientist159 commented 5 years ago

@Balzhur I've got P9 to 68% of CNv7 now for the 18 core devices, still tracking down a few additional oddities that are slowing things down. The magic tuning so far is low power mode alternating between 2 and 3 on each SMT1 core.

Balzhur commented 5 years ago

@madscientist159, looked briefly at the doc you provided, well... I'm more worried about P8 since I do not have P9 :) If you can introduce some logic different for P8 and P9 during compilation - would be nice. Please note, that I'm not a programmer at all so some things you're saying is a rocket science for me :)

I'll email you those slides I've mentioned.

cryptoeight commented 5 years ago

I am running P9 24 Cores and get the best performance on SMT2 with low_power=2 (2931H/s 61 per Core). I tried alternating between low_power 2 and 3 which was also the best performance on CNv7 (4700H/s) for me, but actually went down to 2700H/s on CNv8.

madscientist159 commented 5 years ago

@cryptoeight Yeah, that matches what I've been able to squeeze out of the version posted here. I've got another internal version that does slightly better (~75H/s/core), but I'm still tuning it. OCC is the main enemy, CNv8 uses a lot more electrical power and basically it's tripping the chips internal downclocking (IMO prematurely).

Balzhur commented 5 years ago

Power8 and Power9 processors differ greatly in SMT processing logic and architecture, P9 should handle SMT better and provide more performance per thread.

I've tested more variations of SMT and low_power_mode and the results so far are (only listing best result for each SMT mode, 22 Power8 cores):

SMT low_power_mode H/s
1 3 1436.5
2 alteration between 1 and 2 1784.9
4 1 2156.0
8 1 1449.2

Threads config is:

                { "low_power_mode": 1, "affine_to_cpu": 0 },
                { "low_power_mode": 1, "affine_to_cpu": 1 },
                { "low_power_mode": 1, "affine_to_cpu": 2 },
                { "low_power_mode": 1, "affine_to_cpu": 3 },
                { "low_power_mode": 1, "affine_to_cpu": 8 },
                { "low_power_mode": 1, "affine_to_cpu": 9 },
                { "low_power_mode": 1, "affine_to_cpu": 10 },
                { "low_power_mode": 1, "affine_to_cpu": 11 },
[...]
                { "low_power_mode": 1, "affine_to_cpu": 160 },
                { "low_power_mode": 1, "affine_to_cpu": 161 },
                { "low_power_mode": 1, "affine_to_cpu": 162 },
                { "low_power_mode": 1, "affine_to_cpu": 163 },
                { "low_power_mode": 1, "affine_to_cpu": 168 },
                { "low_power_mode": 1, "affine_to_cpu": 169 },
                { "low_power_mode": 1, "affine_to_cpu": 170 },
                { "low_power_mode": 1, "affine_to_cpu": 171 }
Balzhur commented 5 years ago

On the contrary - my home Intel I7-8700K is not impacted at all, still provides around 440 H/s with CNv8, same as for CNv7. (it is weird processor in regards of monero mining cause sometimes when you start xmrig it gives 160ish H/s, sometimes 250ish or 440ish, but mostly 350ish H/s, you have to restart xmrig several times to get best result).

madscientist159 commented 5 years ago

@Balzhur That's because the Monero developers specifically designed their new algorithm around the Intel core design. That's why I've called it switching one ASIC for another; I don't own Intel devices due to the ME issues, and certainly won't go buy one to mine Monero as I'd have no other use for it (like an ASIC).

madscientist159 commented 5 years ago

I'm splitting POWER8 and POWER9 off to reduce confusion, as they tune very differently from one another and I've had decent success getting POWER9 back on its feet, so to speak, with a tuned miner as described in #2. This bug report should focus on POWER8 only from this point forward.

tayore commented 5 years ago

how do you even compile for power8? There are no specific documentation for Power compilation.I assume disable ASM, use advance toolchain 11(gcc version 7.3.1 20180207)? I failed multiple times.

Balzhur commented 5 years ago

I use AT10 (gcc 6.3), but AT11 should work as well. You also need to install libssl-dev , libuv1, libuv1-dev, then

export LD_LIBRARY_PATH=/usr/lib/powerpc64le-linux-gnu/
export PATH=/opt/at10.0/bin:$PATH

cd xmrig
mkdir build && cd build
cmake .. -DUV_LIBRARY=/usr/lib/powerpc64le-linux-gnu/libuv.a -DWITH_HTTPD=OFF
make
madscientist159 commented 5 years ago

@tayore On my side, on a POWER9 box running Debian, it 'just works' via:

mkdir build
cd build
cmake ..
make

That's with the stock compiler and the system development packages for libuv, libmicrohttpd, etc. already installed via apt-get.

tayore commented 5 years ago

I was able to compile with advance toolchain and default gcc. libuv was the issue. Let me make a quick test about performance

tayore commented 5 years ago

around 1500H/s for 16 P8 cores, like half of the v7.

Balzhur commented 5 years ago

@tayore, I'd say it's normal result for V8 and xmrig, at least it's proportional to mine (2150H/s for 22 P8 cores). Maybe @madscientist159 will be able to tune xmrig for P8, let's hope.

Balzhur commented 5 years ago

Update: upgraded Ubuntu to 18.04, recompiled with default gcc/7.3.0 (no IBM AT). Result +200H/s

thsitthisak commented 5 years ago

I found error

please help

sith@ubuntuSrv:~/temp/xmrig/build$ sudo make [ 2%] Building CXX object CMakeFiles/xmrig.dir/src/api/NetworkState.cpp.o cc1plus: error: € -mfloat128-hardware € requires full ISA 3.0 support CMakeFiles/xmrig.dir/build.make:62: recipe for target 'CMakeFiles/xmrig.dir/src/api/NetworkState.cpp.o' failed make[2]: [CMakeFiles/xmrig.dir/src/api/NetworkState.cpp.o] Error 1 CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/xmrig.dir/all' failed make[1]: [CMakeFiles/xmrig.dir/all] Error 2 Makefile:83: recipe for target 'all' failed make: *** [all] Error 2

Thank a lot