nioroso-x3 / xmr-stak-power

Port of the xmr-stak-cpu monero miner to ppc64le
GNU General Public License v3.0
18 stars 9 forks source link

Optimizing Monero for POWER9 #5

Open agangidi53 opened 6 years ago

agangidi53 commented 6 years ago

@nioroso-x3 I want to contribute to porting this / optimizing this on POWER9 to increase the reported hash rate . I don't think the hash-rate of 3KH/s is the best POWER9 can do. I have access to POWER8 / POWER9 for development purposes Can you provide some pointers on what needs to change for POWER9 to improve hash-rate ? Do you have any idea on if using AES / SHA on-chip accelerators, will help increase hash-rate ?

madscientist159 commented 6 years ago

Hey there, I'm the one that tuned the POWER9 hash rate to where it is now. Glad someone else is interested!

The current code is already using the on-chip AES accelerators. Bear in mind that the 12-core and higher parts share the L2/L3 cache between a pair of cores, and as such the best I've been able to push the preproduction CPUs to is the ~93H/s/core when run in SMT4 with 3 threads per core. I fully expect these numbers to increase on the production parts, but for now that's the best I've been able to attain. Perf indicates that most of the time the cores are waiting for data to be written back to memory, so I suspect there's an internal bottleneck on this particular CPU revision.

agangidi53 commented 6 years ago

Hi @madscientist159 That doesn't make sense. POWER9 with a shorter pipeline and built in AES accelerator should do better than POWER8 which only had in-core enhancements.

Your numbers (93/H/core) is using little endian set to true right ? What parts do you have access to ? (if I may ask ). DD1.x or DD2.0.1 or DD2.1 ? I have pretty recent parts (DD2.0.1)and I'm still able to run only at 93H/s/core as you indicate. Which points to us missing something. Overall in other CPU benchmarks, POWER9 is at-least 50% better than POWER8, if not more.

What tracer are using to observe the cores waiting on data ? I want to replicate that on LaGrange CPU (I'm guessing you replicated it on Sforza) and then put some of my performance engineers on this.

madscientist159 commented 6 years ago

@agangidi53 You are correct, DD2.01 Sforza (dual 16 core CPUs in one of our Talos prototypes). Yes, LE mode -- switching to BE caused a drop in hash rate as the cores started to stall on internal data manipulation instructions.

We are using the "perf" tool to dig into the processor hardware events and "watch" the running miner process. "perf record -p " for a few seconds, then "perf annotate" will show you the hot loop and the instructions inside it that are causing the most delay.

One thing to consider is that the DD2.01 parts are only running at 2.9GHz, and WOF is not functional on our sample parts. Not sure what sustained core frequencies you are using to generate your POWER8 results? There is a strong linear response from hash rate as the core clocks are raised -- we pushed one of our parts up to 3.3GHz base and were able to get hash rates around 105H/s/core. For reference, the Talos systems will ship with 4/8 core CPUs that are clocked near that range for base clocks, with ~3.8GHz WOF.

Overall, it would be great to get your performance engineers to look at this. From what I'm seeing out of perf there may be some way to indicate to the processor that it needs to prefetch / pre-clear part of the cache to be able to store new results without a stall, but every attempt I've made to do this has made the loop run slower, not faster.

One other factor to look into is L3 clock speed. I have no hard numbers on that, and don't know offhand if it can be adjusted independently of the cores. It's in a different pervasive segment, but that doesn't say much about clocking unfortunately.

nioroso-x3 commented 6 years ago

I think the best enhancement would be since you guys have contacts at IBM is tell them to make the crypto and altivec units have a real little endian mode. We should not have to flip bytes around, big endian is dead.

madscientist159 commented 6 years ago

@nioroso-x3 Looking at the performance data from POWER9 less than 0.5% of the time is spent flipping bytes. Whatever IBM did is working nearly perfectly on that front, to the point of the crypto units effectively having a LE mode. In fact, you could probably abstract away the BE hardware via a few wrapper functions on POWER9 and not notice the performance difference at all.

On POWER8 this is obviously a different story.

nioroso-x3 commented 6 years ago

Hmm, so POWER9 is slower due to a hardware bug or low clocked L3 then? About the prefetch, x86 SIMD has instructions for that, but I coudn't find anything equivalent for POWER without using inline asm, so I just removed it in the end. (https://www.gnu.org/software/gcc/projects/prefetch.html#altivec) You tried adding those instructions inside the implode and explode funcions, like in the x86 version?

Also, how big is the L3 cache for POWER9? Wikipedia says it has 120MB total, so it's 5MB for the smt4 and 10MB for smt8?

Have you tried mixing single and double threads (lowpowermode false and true) ? For example the old AMD FX cpus have their optimal speed with 5 threads, one double and four single, all pinned of course.

madscientist159 commented 6 years ago

@nioroso-x3 No idea at this point; we'll need DD2.1 hardware to know for sure, and that looks like sometime in December right now. Would definitely appreciate a double-check from @agangidi53 or his performance engineers as it's still possible I'm missing something.

For the 4/8 core SMT4 CPUs shipped with Talos AFAIK the L3 caches will be 10MB/core. For the 12+ core SMT4 parts that drops to 5MB/core. All current testing is done on the 5MB/core early sample parts.

agangidi53 commented 6 years ago

@madscientist159 @nioroso-x3

Here is the cache on POWER9 I'm testing with:

POWER9 (12 Core) L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K

3.2GHz with WOF functional.

@madscientist159 Can you tell me the build you are using and gcc version ? If its different than the default AT10 gcc. I think we should highlight that in the repo. I'm having my perf engineers look at this. Will update with high level findings soon.