nioroso-x3 / xmr-stak-power

Port of the xmr-stak-cpu monero miner to ppc64le
GNU General Public License v3.0
18 stars 9 forks source link

Best Config for POWER8 #6

Open agangidi53 opened 6 years ago

agangidi53 commented 6 years ago

One thing that's not entirely clear to me from this repo is: the best config for any random POWER8 CPU.

So I did some experiments with 1 core on POWER8. What are your thoughts ? Is your "best hash-rate" config also: "Fasle-False" config for all 4 cores in SMT-4 config ?

Frequency: 3.5 GHz

Cache L1d cache: 64K L1i cache: 32K L2 cache: 512K L3 cache: 8192K

Per-core Hash-rate with 3.5 GHz frequency

Low Power Mode----Little Endian Mode------Threads-------H/s True-------------------------True--------------------4-----------132 H/s True------------------------False--------------------4------------135 H/s False-----------------------False--------------------4-----------191 H/s False-----------------------True---------------------4------------153 H/s

Low Power Mode----Little Endian Mode------Threads-------H/s True-------------------------True--------------------8-----------147 H/s True-------------------------False--------------------8-----------148 H/s False-------------------------False-------------------8------------152 H/s False------------------------True---------------------8------------150 H/s

nioroso-x3 commented 6 years ago

Yeah, thats the optimal config I found for power8, and the one set up for 20 core machines in the config file. I dont set it in smt4 mode though, since the servers are used for running other jobs, so they are in smt8 mode.

nioroso-x3 commented 6 years ago

I just tested low power mode, you can achieve the same hash with half the threads in POWER8. There is a fork of xmr-stak-power that does more 3, 4 and 5 hashes per thread, maybe i'll backport them and see what happens.

agangidi53 commented 6 years ago

@nioroso-x3 Which fork is that ?

nioroso-x3 commented 6 years ago

This one https://github.com/fireice-uk/xmr-stak/pull/168 Oh, I meant xmr-stak-cpu. Sorry.

agangidi53 commented 6 years ago

@nioroso-x3 Wow! that would put POWER8 with Large Centaur buffers at a super high speed!!!! Any luck ? I'm trying as well.

nioroso-x3 commented 6 years ago

Are the centaur buffers a true L4 cache? Or can they only store copies of the data that resides in the ram that they control? That intel CPU has a unified 128MB L4, so it doesn't matter where the data is.

Balzhur commented 6 years ago

S824 machine with 2x12 core 3.52GHz Power8 Processors. LPAR with Ubuntu 16.04 LE, 20 cores, SMT=2.

Best result is SMT=2, two threads per physical core, e.g:

{ "low_power_mode" : true, "little_endian_mode" : false, "affine_to_cpu" : 0 },
{ "low_power_mode" : true, "little_endian_mode" : false, "affine_to_cpu" : 1 },

Got a bit lower results with SMT=4 or 8.

results, H/s:

low_power_mode little_endian_mode 1 thread/1 core 2 threads/1 core 20 threads/10 cores
FALSE FALSE 63.0 116.6
TRUE FALSE 109.3 197.8 1932.7
TRUE TRUE 100.7 174.3 1719.2
FALSE TRUE 58.7 109.1

Running two instances of xmr-stak-power, configured for 20 threads binding to 10 physical cores each. One instance hashrate is around 1947H/s, so whole 20 cores (40 threads) provide around 3900H/s.

HASHRATE REPORT
| ID | 2.5s |  60s |  15m | ID | 2.5s |  60s |  15m |
|  0 | 95.2 | 96.1 | 96.2 |  1 | 95.7 | 96.5 | 96.5 |
|  2 | 97.1 | 96.6 | 96.6 |  3 | 96.7 | 96.3 | 96.2 |
|  4 | 97.0 | 96.5 | 96.5 |  5 | 96.6 | 96.2 | 96.1 |
|  6 | 96.7 | 96.2 | 96.2 |  7 | 97.1 | 96.5 | 96.6 |
|  8 | 95.9 | 96.5 | 96.5 |  9 | 95.5 | 96.0 | 96.1 |
| 10 | 98.0 | 97.5 | 97.5 | 11 | 97.7 | 97.2 | 97.1 |
| 12 | 98.0 | 97.5 | 97.5 | 13 | 97.7 | 97.1 | 97.1 |
| 14 | 97.7 | 97.2 | 97.1 | 15 | 98.0 | 97.6 | 97.5 |
| 16 | 96.6 | 97.5 | 97.5 | 17 | 96.3 | 97.1 | 97.1 |
| 18 | 98.0 | 97.5 | 97.5 | 19 | 97.6 | 97.2 | 97.1 |
-----------------------------------------------------
Totals:   1939.1 1936.9 1936.8 H/s
Highest:  1947.0 H/s
agangidi53 commented 6 years ago

Interesting, @Balzhur The same config, with OPAL firmware , instead of PowerVM firmware would yield you roughly twice the result.

Balzhur commented 6 years ago

@agangidi53, hrm... Same processor? PowerVM cannot influence the result that much. My LPAR is with dedicated cores... I don't have LC models to check, unfortunately.

Arukadox commented 6 years ago

Ok I'am bit new to IBM machines but I got one which I can use strictly for fun and testing, It's S814 with that kind of cpu: lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 8 Core(s) per socket: 1 Socket(s): 4 NUMA node(s): 2 Model: 2.1 (pvr 004b 0201) Model name: POWER8 (architected), altivec supported Hypervisor vendor: horizontal Virtualization type: full L1d cache: 64K L1i cache: 32K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-15 NUMA node1 CPU(s): 16-31

cat /proc/cpuinfo processor : 0 cpu : POWER8 (architected), altivec supported clock : 3026.000000MHz revision : 2.1 (pvr 004b 0201)

processor : 1 cpu : POWER8 (architected), altivec supported clock : 3026.000000MHz revision : 2.1 (pvr 004b 0201) ...... processor : 27 cpu : POWER8 (architected), altivec supported clock : 3026.000000MHz revision : 2.1 (pvr 004b 0201)

timebase : 512000000 platform : pSeries model : IBM,8286-41A

It's LPAR running Ubuntu Ubuntu 16.04.3 LTS and I have that kind of results with SMT=4:

| ID | 2.5s | 60s | 15m | ID | 2.5s | 60s | 15m | | 0 | 14.5 | 14.4 | (na) | 1 | 13.3 | 13.4 | (na) | | 2 | 11.5 | 11.6 | (na) | 3 | 12.0 | 11.9 | (na) | | 4 | 13.0 | 12.9 | (na) | 5 | 14.9 | 14.9 | (na) | | 6 | 12.5 | 12.0 | (na) | 7 | 12.2 | 12.2 | (na) | | 8 | 10.9 | 11.1 | (na) | 9 | 13.1 | 12.8 | (na) | | 10 | 14.2 | 14.2 | (na) | 11 | 15.1 | 14.9 | (na) | | 12 | 11.3 | 11.2 | (na) | 13 | 10.7 | 10.6 | (na) | | 14 | (na) | 8.5 | (na) | 15 | 12.3 | 12.6 | (na) | | 16 | 12.9 | 12.9 | (na) | 17 | (na) | 11.1 | (na) | | 18 | (na) | 8.6 | (na) | 19 | 12.5 | 12.4 | (na) | | 20 | 14.0 | 13.6 | (na) | 21 | 13.1 | 12.8 | (na) | | 22 | 13.2 | 13.2 | (na) | 23 | 13.8 | 14.0 | (na) | | 24 | 11.9 | 12.2 | (na) | 25 | 13.8 | 13.5 | (na) | | 26 | 13.3 | 13.4 | (na) | 27 | 10.3 | 10.3 | (na) | | 28 | 11.7 | 11.6 | (na) | 29 | 12.8 | 12.4 | (na) | | 30 | 12.3 | 12.4 | (na) | 31 | (na) | 10.4 | (na) |

Totals: (na) 393.9 (na) H/s

Isn't it horribly low?

madscientist159 commented 6 years ago

Yes, that's really low. Here's what you should be seeing (ballpark, taken from a 16 core machine in SMT1 mode):

HASHRATE REPORT
| ID | 2.5s |  60s |  15m | ID | 2.5s |  60s |  15m |
|  0 | 79.1 | 80.4 | 79.8 |  1 | 60.4 | 77.7 | 77.6 |
|  2 | 82.6 | 83.1 | 80.6 |  3 | 84.3 | 80.8 | 78.1 |
|  4 | 84.3 | 78.0 | 77.7 |  5 | 80.7 | 74.2 | 77.2 |
|  6 | 78.5 | 76.2 | 79.7 |  7 | 75.0 | 82.6 | 79.9 |
|  8 | 79.5 | 61.2 | 70.9 |  9 | 66.1 | 76.2 | 78.2 |
| 10 | 84.2 | 78.0 | 76.5 | 11 | 63.9 | 75.4 | 76.7 |
| 12 | 77.3 | 77.7 | 80.1 | 13 | 84.4 | 80.9 | 78.0 |
| 14 | 79.8 | 82.3 | 77.7 | 15 | 84.4 | 82.5 | 79.5 |
-----------------------------------------------------
Totals:   1244.5 1247.2 1248.1 H/s
Highest:  1345.2 H/s

Look into ppc64_cpu --smt and follow the optimization advice.

Arukadox commented 6 years ago

I've set SMT to 4 since this gives me best results so far. If I set it to 8 which is default settings my hashrate drops by half. Changing it to 2 gives me results around ~300. I tried Centos7 and Ubuntu so far and lost over 3h trying different smt settings and different settings in config.txt file but 400 is max i got.

Is it possible that I f....ed up something in HMC or some other configuration? Or maybe during xmr-stak-power compilation? SSE or something.

Balzhur commented 6 years ago

@Arukadox, accoriding to lscpu output you've provided - your machine has only 4 physical cores. To get the best performance use either SMT=2 or 4, also play with different false/true settings for low_power_mode and little_endian_mode. You should not use more than 4 threads per physical core for mining. To see the which thread belongs to which physical core use ppc64_cpu --info you should add only threads marked with asterisk into xmr-stak-power config.

Anyways - 4 physical cores at 3GHz mean 150ish per core, e.g. 600ish per machine.

Arukadox commented 6 years ago

@Balzhur, let me understand this architecture correctly cos I'm little confused right now: HMC claims that I have 4 processors and all of them are allocated to this LPAR as dedicated ones. LPAR = Ubuntu / lscpu claims that i have 32 CPU's (I understand that as a core for example in i7) but cpuinfo shows only from 0-27 which gives mu 28 CPU's (I understand that as a core for example in i7).

So how should I understand this? Do I have 4 cpu's all with one core but somehow available of running 8 threads per core which gives me 32 cpus in lscpu and then basically for lscpu 1 thread is one cpu. Or maybe I have 1 processor with 4 cores and each of this 4 cores can handle 8 threads and than 4x8=32.

Anyway trying do do some comparison I read your previous post in which you wrote: S824 machine with 2x12 core 3.52GHz Power8 Processors. LPAR with Ubuntu 16.04 LE, 20 cores, SMT=2.

So you you have 2 processors each with 12 cores and your LPAR using 20 physical cores right?

What I don't understand why ubuntu and centos claims that I have 32cpu's. Can you show me your lscpu output?

ppc64_cpu --info shows

Core 0: 0 1 2 3 4 5 6 7 Core 1: 8 9 10 11 12 13 14 15 Core 2: 16 17 18 19 20 21 22 23 Core 3: 24 25 26 27 28 29 30 31

so I ran 16 threads (16 lines in config.txt) and attached them to threads with asterisk and it boostes up my hasrate to ~450 in peak.

I see that your one core at 3.5GHz mining ~195H/s so this confirms your words about 150ish per core.

What bothers me at the end is number of cores but I think that output of your lscpu and ppc64_cpu --info should clarify that.

Balzhur commented 6 years ago

@Arukadox, it's offtopic here, but here goes: S814 machine - stands for 'S' - scale out (small IBM servers), '8' - Power8 processor, '1' - 1 socket (processor), '4' - 4 unit server height.

Forget x86 architecture, it's Power. You have 1 processor that has 4 cores. Each core is capable of running 8 threads and this is regulated by SMT.

Try running SMT2 with little_endian_mode=true and SMT4 with little_endian_mode=false. For me the result is almost the same, so I chose SMT2 (less threads, less xmr-stack instances for nicehash).

And yes, my S824 machine has 2 processors with 12 cores each. I gave 20 cores to LPAR cause I need the rest to run VIOS and other LPARs.

Linuxes just report how many "CPUs" you have and by CPU they mean "thread".

Arukadox commented 6 years ago

@Balzhur thanks man now I know everything!

xmrcrypt commented 6 years ago

Has anyone here used perf to do analysis on the L3 cache usage? I've been playing with the newer intel xeon's and they've got a great tool for look at LLC metrics.

CORE IPC MISSES LLC[KB] MBL[MB/s] MBR[MB/s]

Perf gave some decent metrics, but the two tools (intel vs. perf) were way off in terms of cache misses. One thing to note, most of the miners we've been testing have failed to utilize the cache efficiently although they are following correct coding practice...

It would be interesting to see the cache hit rate ratio, especially with optimized miner code. I'd like to discuss further with anyone who's interested in low level cache optimizations with our miner.

thsitthisak commented 6 years ago

Please help to expain, help @agangidi53 @Balzhur

| 48 | 28.8 | 29.0 | (na) | 49 | 25.3 | 25.4 | (na) | | 50 | 28.0 | 29.8 | (na) | 51 | 29.0 | 28.9 | (na) | | 52 | 28.3 | 28.3 | (na) | 53 | 29.3 | 29.1 | (na) | | 54 | 28.1 | 28.3 | (na) | 55 | 28.4 | 28.4 | (na) | | 56 | 29.2 | 29.1 | (na) | 57 | 24.8 | 24.6 | (na) | | 58 | 29.4 | 29.4 | (na) | 59 | 24.7 | 25.0 | (na) | | 60 | 28.9 | 28.9 | (na) | 61 | 29.5 | 29.6 | (na) | | 62 | 27.6 | 27.8 | (na) | 63 | 28.7 | 28.5 | (na) | | 64 | 29.6 | 29.1 | (na) | 65 | 28.6 | 28.4 | (na) | | 66 | 28.4 | 28.3 | (na) | 67 | 37.7 | 33.7 | (na) | | 68 | 27.6 | 27.8 | (na) | 69 | 27.6 | 27.8 | (na) | | 70 | 26.2 | 26.0 | (na) | 71 | 28.3 | 28.3 | (na) | | 72 | 28.5 | 28.5 | (na) | 73 | 25.3 | 25.0 | (na) | | 74 | 28.6 | 28.3 | (na) | 75 | 25.9 | 23.7 | (na) | | 76 | 25.9 | 26.1 | (na) | 77 | 20.7 | 20.8 | (na) | | 78 | 27.4 | 27.8 | (na) | 79 | 28.7 | 28.5 | (na) |

Totals: 2244.1 2249.1 (na) H/s Highest: 2271.3 H/s

Result after run but while show on supportXMR.com is zero

Network: 829.85 MH/s Pool: 40.54 MH/s You: 0 H/s