taking advantage of modern CPU stuff

Gingeropolous commented 5 years ago

So in the effort to reduce the gap between a consumer CPU and a custom CPU, one approach is exploiting some of the stuff baked in to modern CPUs. This probably just buys us time, similar to the AES-NI of cryptonight.

For instance, does randomX do things that take advantage of

CLMUL instruction set

or any of the things on the extension list?

https://en.wikipedia.org/wiki/Kaby_Lake

tevador commented 5 years ago

All 16 SSE registers are currently occupied for floating point math, so there is no room to add CLMUL support without reducing floating point.

The main program loop of RandomX uses only basic operations that are common in all 64 bit CPUs including ARMv8.

From the list of extensions, only SSE2 and AES-NI are used at the moment (AES-NI only during scratchpad init/hash steps).

tevador commented 5 years ago

This is the list of operations I'm working with at the moment. _R suffix refers to a register operand, _M to a memory operand and _C means a constant operand.

    namespace InstructionType {
        constexpr int IADD_R = 0;
        constexpr int IADD_M = 1;
        constexpr int IADD_RC = 2;
        constexpr int ISUB_R = 3;
        constexpr int ISUB_M = 4;
        constexpr int IMUL_9C = 5;
        constexpr int IMUL_R = 6;
        constexpr int IMUL_M = 7;
        constexpr int IMULH_R = 8;
        constexpr int IMULH_M = 9;
        constexpr int ISMULH_R = 10;
        constexpr int ISMULH_M = 11;
        constexpr int IDIV_C = 12;
        constexpr int ISDIV_C = 13;
        constexpr int INEG_R = 14;
        constexpr int IXOR_R = 15;
        constexpr int IXOR_M = 16;
        constexpr int IROR_R = 17;
        constexpr int IROL_R = 18;
        constexpr int ISWAP_R = 19;
        constexpr int FPSWAP_R = 20;
        constexpr int FPADD_R = 21;
        constexpr int FPADD_M = 22;
        constexpr int FPSUB_R = 23;
        constexpr int FPSUB_M = 24;
        constexpr int FPNEG_R = 25;
        constexpr int FPMUL_R = 26;
        constexpr int FPMUL_M = 27;
        constexpr int FPDIV_R = 28;
        constexpr int FPDIV_M = 29;
        constexpr int FPSQRT_R = 30;
        constexpr int COND_R = 31;
        constexpr int COND_M = 32;
        constexpr int CFROUND = 33;
        constexpr int ISTORE = 34;
        constexpr int FSTORE = 35;
        constexpr int NOP = 36;
    }

The following x86 instructions can be generated with the above list of RandomX instructions:

x86

mov, add, sub, lea, mul, imul, and, shr, sar, sbb, neg,
or, xor, ror, rol, xchg, cmp, setbe, seta, sets, setns, 
seto, setno, setl, setge

sse2

shufpd, xorpd, andpd, addpd, 
subpd, mulpd, divpd, sqrtpd, 
maxpd, movapd, cvtdq2pd, ldmxcsr

Gingeropolous commented 5 years ago

@tevador , i apologize if this is naive, but is randomx design currently taking advantage of out of order execution machinery that seems to be built into all modern CPUs? From what I gather of branch prediction, branch prediction is sort of part of OOOE.

I guess in general the question is : what else is on a CPU that we haven't utilized that would make building an ASIC effectively the same as building a modern CPU? The critique is always "an asic designer can just strip away all the extra stuff we don't use." - well, what can be done with randomx so it uses everything?

tevador commented 5 years ago

Out of order execution is absolutely used by RandomX. The main body of the program runs with 2.1+ instructions per clock (IPC). An in-order machine would achieve only around 0.5 IPC due to frequent memory and dependency stalls.

Branch prediction was investigated early in RandomX development and we didn't find a way how to take advantage of it, so it's currently not explicitly used. See https://github.com/tevador/RandomX/issues/2

an asic designer can just strip away all the extra stuff we don't use

Let's take an 8 core Ryzen CPU as an example.

The whole CPU die looks like this:

An ASIC can remove: PCI-E controller, USB, South Bridge, SATA controller and most of the Infinity Fabric circuits (interchip communication). There is no way to use these in RandomX.

If we look at the CCX:

Large part of the CCX is the L3 cache, which is fully used by RandomX.

Zen core:

Here: L1D and L2 are fully used, L1I cache is not used 100% - instead the uOP cache is used to save power. Whole ALU is used, load-store unit, scheduler and a large part of the FPU.

What can be removed: branch prediction unit (BPU) and a small part of the FPU that is not used (like CLMUL and other extensions). The decoder can be smaller than x86, but since we are using the uOP cache, the decoders are powered down most of the time anyways.

Overall, I think RandomX uses a lot of the CPU circuitry. Even if we used some exotic instruction sets, the wasted die area would not decrease much as it's dominated by uncore parts (SATA, PCI-E controllers etc.).

Gingeropolous commented 5 years ago

thanks for the thorough response! I apologize for missing that testing round for branch prediction. I may go ahead and do it just to see what the numbers are for different processors.

tevador / RandomX

taking advantage of modern CPU stuff #16