tevador / RandomX

Proof of work algorithm based on random code execution
BSD 3-Clause "New" or "Revised" License
1.42k stars 306 forks source link

Specification #1

Closed tevador closed 5 years ago

tevador commented 5 years ago

https://github.com/tevador/RandomX

@timolson @hyc Can you review/comment the initial draft?

There is still a lot of work to do, but it should be a lot simpler than RandomJS to implement and audit while being at least equally ASIC resistant.

hyc commented 5 years ago

Looks pretty good. I wonder if an 8KB program is large enough. This is only 2^18 total permutations. Perhaps you should allow STOre instructions into the program space, and thus self-modifying code.

The spec seems to assume that only valid floating point numbers will ever appear, doesn't seem to account for NaNs or divide-by-zero.

Hashing all registers (essentially all of the VM state) is good. Perhaps we should do this every X instructions, so that intermediate states are required to be executed faithfully.

tevador commented 5 years ago

Looks pretty good. I wonder if an 8KB program is large enough. This is only 2^18 total permutations.

There should be at least ~101782 unique opcode sequences (551024) and the number of unique programs is even larger due to additional instruction parameters.

The intention is for the program to fit into the L1 cache, so there are no stalls during execution. CPUs have ~32 KiB of L1 cache, so there is not much space to make the program larger due to additional overhead.

Perhaps you should allow STOre instructions into the program space, and thus self-modifying code.

I don't think it would make the PoW more CPU-friendly. I think we should allow compilation of the program directly into machine code for maximum performance.

The spec seems to assume that only valid floating point numbers will ever appear, doesn't seem to account for NaNs or divide-by-zero.

Good point. However, since all floating point numbers are produced only by conversion from uint64_t, only the following operations will result in NaN: 0/0, ∞/∞, 0*∞ or ∞-∞ (link).

I think it's beneficial to also allow for NaN values, so that an ASIC has to implement as much of the IEEE-754 specs as possible.

Integer division by zero has to be dealt with (probably the way RISC-V defines it).

Hashing all registers (essentially all of the VM state) is good. Perhaps we should do this every X instructions, so that intermediate states are required to be executed faithfully.

I think this should be already achieved by XORing the output registers instead of overwriting them.

SChernykh commented 5 years ago

It looks good overall. My comments from IRC:

the only thing missing there is any mentions of L1/L2 caches Only read-only 4 GB RAM block this virtual machine must also read/write to 16 KB block of data and less frequently to another 240 KB block of data to make use of current generation CPU caches these numbers (16 KB/240 KB/4 GB) can be changed every fork

SChernykh commented 5 years ago

CPUs have ~32 KiB of L1 cache, so there is not much space to make the program larger due to additional overhead.

There is also uOP cache on Intel CPUs starting with Sandy Bridge, and on AMD CPUs starting with Ryzen. It can hold 1536 uOPs (around 1000 average instructions) max. You should account for this. Code execution from uOP cache is faster, and it's slowed down when code doesn't fit in uOP cache.

Reference: https://www.agner.org/optimize/microarchitecture.pdf page 123:

Code that runs out of the µop cache are not subject to the limitations of the fetch and decode units. It can deliver a throughput of 4 (possibly fused) µops or the equivalent of 32 bytes of code per clock cycle.

SChernykh commented 5 years ago

Note on FPU: there is no instruction to set FPU rounding mode. It's easy to add and will force any implementation to support all IEEE-754 rounding modes.

hyc commented 5 years ago

Is it worthwhile to consider vector instruction sets, or is that needless complexity at this point?

SChernykh commented 5 years ago

Vector instructions are not needed at this point. Another note: consider adding AES and SHA instructions. They're both supported on hardware level in current x86/ARM processors.

hyc commented 5 years ago

While we're tossing out suggestions - we could also toggle big/little endianness on different invocations. Though perhaps this would only favor ARM CPUs, and might be too easy for ASICs.

SChernykh commented 5 years ago

@tevador

After C is calculated, every ALU instruction performs r(p3) ^= C ... ... After C is calculated, every FPU instruction performs f(p3) += C ... ... then pops a return value retval from the stack and performs r(p3) ^= retval

What? Why is it needed? I think it's unnecessary complexity. When I execute C=A+B, I don't expect XOR to happen in the end.

hyc commented 5 years ago

That prevents faking execution and skipping intermediate states

SChernykh commented 5 years ago

But how does it prevent it? You basically have 2 instructions by doing this, second instruction is XOR.

SChernykh commented 5 years ago

Adding 16 KB and 240 KB read/write buffers accessible to all instructions will solve "fake execution" problem.

tevador commented 5 years ago

But how does it prevent it? You basically have 2 instructions by doing this, second instruction is XOR.

Normally you would overwrite the target register with the new value. This means that the result of the previous instruction might have been unused. To prevent this, registers are not overwritten, but combined using XOR or addition.

this virtual machine must also read/write to 16 KB block of data

There is the stack, which could grow to at least several KiB and the registers are together 512 bytes.

SChernykh commented 5 years ago

There is the stack, which could grow to at least several KiB and the registers are together 512 bytes.

Good, but it'll be better to explicitly use bigger buffers that fit in L1/L2 cache. Instruction results can be stored there.

This means that the result of the previous instruction might have been unused.

Store it in this big buffer instead of doing XOR. It's better because more intermediate data is saved and can later be used by different instructions.

tevador commented 5 years ago

Note on FPU: there is no instruction to set FPU rounding mode. It's easy to add and will force any implementation to support all IEEE-754 rounding modes.

Good suggestion. What are the corresponding x86/ARM instructions?

There is also uOP cache on Intel CPUs starting with Sandy Bridge, and on AMD CPUs starting with Ryzen. It can hold 1536 uOPs (around 1000 average instructions) max. You should account for this. Code execution from uOP cache is faster, and it's slowed down when code doesn't fit in uOP cache.

I think 1000 uOPs should be enough since branch mispredictions will happen before than this can be exhausted.

Good, but it'll be better to explicitly use bigger buffers that fit in L1/L2 cache. Instruction results can be stored there.

If we use a 16 KiB "register file", operand addresses need to be at least 11 bits. I will try to think how we could encode it.

SChernykh commented 5 years ago

Good suggestion. What are the corresponding x86/ARM instructions?

For x86 it's stmxcsr/ldmxcsr, I don't know how it's controlled on ARM. But ARM must also have corresponding instructions.

I think 1000 uOPs should be enough since branch mispredictions will happen before than this can be exhausted.

uOP cache handles branches too. As long as all code fits in uOP cache it'll execute without fetch/decode overhead which makes x86 CPU basically pure RISC processor. So it's important.

If we use a 16 KiB "register file", operand addresses need to be at least 11 bits. I will try to think how we could encode it.

No, it's better to have explicit memory. Like this 4 GB block of memory. We can still have this 4 GB block, just allow writes to first 256 KB. Implementations will just allocate separate 256 KB buffer for this.

tevador commented 5 years ago

I have updated the specs:

At the moment, all results written into the integer registers are encrypted using 1 round of AES. This is needed to get pseudorandom read addresses (maybe there is a better way to do it).

tevador commented 5 years ago

I have modified the specs based on yesterday's IRC discussion.

I see the following issues that should be resolved at the moment:

  1. The DRAM generation algorithm should be designed so that a light client doesn't need 4 GiB of RAM to verify a hash. I'm not sure if the current idea of AES initialization satisfies this.
  2. The DRAM generation algorithm should be designed to that there is no feasible memory-time tradeoff, for example by calculating the DRAM values on the fly. I'm not sure if the current idea of AES initialization satisfies this.
  3. Probably something like the Cryptonight implode should be adopted for hashing the scratchpad to avoid high result calculation overhead.

The remaining undecided parameters (instruction weights, number of instructions etc.) will require testing.

hyc commented 5 years ago
  1. I'm not sure how this can be achieved, unless you do segmenting as I suggested before, and thus only access a portion of this 4 GiB space at a time. Perhaps using the instruction counter as a segment register. This also assumes that the 4 GiB space is generated in a single pass, with no cross-segment dependencies. That effectively means the entire 4 GiB space only needs to exist in its entirety on a multithreaded miner.
SChernykh commented 5 years ago

That effectively means the entire 4 GiB space only needs to exist in its entirety on a multithreaded miner.

Any miner would need full 4 GiB space, if used segment changes every nonce. Light client can just generate a segment on the fly to check the hash, but generation needs to be slow enough to make it unfeasible for actual mining.

P.S. On the other hand, segmenting 4 GiB block kind of destroys its purpose. ASIC can just have on-chip block of memory for each worker to store segment.

hyc commented 5 years ago

Actually I'm now questioning what kind of light client we're talking about. What kind of client ever needs to verify a hash?

tevador commented 5 years ago

@hyc For example if you wanted to run a full node on a Raspberry Pi, which has only 1 GB of RAM.

Ethereum generates its "DAG" from a 16 MB chunk and light clients can use just 16 MB to verify a hash at the expense of computation time.

I think if we use Cryptonight initialization algorithm on the whole 4 GiB, verifying a hash without the whole buffer would be too expensive since every 16 byte block depends on 1/8 of all previous blocks.

hyc commented 5 years ago

At this point I think we should be ignoring boxes with such low specs as Raspberry Pi. They all top out at 20MB/sec max I/O speed (on any/all peripherals) which already makes them horrible for syncing/storing/serving the blockchain. Their CPUs lack hardware AES, which makes them even more horrible for running Monero nodes.

There are plenty of cheap single board computers out there that can be used as full nodes, but no version of Raspberry Pi is worthwhile for this application. Lots of products from Pine64.org would do fine, for example https://www.pine64.org/?product=rock64-media-board-computer

tevador commented 5 years ago

When I'm thinking about it, there is actually a pretty linear memory-time tradeoff. If you store just the first 128 bytes of every 256 KiB chunk, you need just 2 MiB and recreating a memory block requires encrypting 128 KiB on average. If an average program accesses 32 blocks, you need to AES-encrypt about 4 MiB to verify a hash. That's comparable to Cryptonight.

The question is if this can be abused by ASICs to mine without DRAM.

SChernykh commented 5 years ago

The question is if this can be abused by ASICs to mine without DRAM.

Of course. If you only do Cryptonight-explode to initialize 4 GiB block, it's straight-forward time-memory tradeoff. You also need to do random shuffle per each segment to disable this optimization.

tevador commented 5 years ago

I have pushed an updated specification:

  1. Removed 16-bit operations. 64-bit and 32-bit are supported natively both by x86-64 and ARMv8.
  2. New variants of integer multiplication.
  3. New DRAM reading pattern: 256 bytes sequentially at a time and the next read address is calculated 8 reads ahead of time.
  4. Removed the DCALL instruction. Dynamic jump offsets require an 8 KiB jump table, which would put CPU at a disadvantage compared to an ASIC.
  5. Preliminary instruction weights (~62% ALU, ~25% FPU, ~13% jump).
  6. Scratchpad initialization routine.
tevador commented 5 years ago

The latest commit contains a Python generator, which generates C code for a random program according to the specification (except DRAM, which is not implemented yet).

Run:

python rx2c.py > rx-sample.c
gcc -O2 -maes rx-sample.c -o rx-sample

With 65536 executed VM instructions, the runtime is around ~400 µs and varies very little between different programs (Intel Xeon E3-1245). That's the good news.

The bad news is that the compiled program is around 150 KiB, which doesn't fit into the L1I cache as was intended. The code inlines everything except DRAM read, which is a separate call. Still, address generation is a significant overhead compared to just the primitive operation of each VM instruction.

The code size could be decreased by generating optimized assembly code, but I'm not sure if it's possible to fit into 32 KiB with current program size (1024 VM instructions).

tevador commented 5 years ago

Example of the generated code:

RandomX: ADD_32 r(a) = 7, loc(a) = 4, gen = 2, loc(b) = 6, imm1 = 1027856444

C:

r7 ^= (uint32_t)_mm_cvtsi128_si32(G);
addr_t addr = r7;
r7 = __rolq(r7, 32);
G = _mm_shuffle_epi32(G, _MM_SHUFFLE(1, 2, 3, 0));
uint32_t A = SCRATCHPAD_256K(addr).u32;
uint32_t B = 1027856444L;
SCRATCHPAD_16K(addr).u32 = A + B;

x86:

     ba3:       66 0f 7e c0             movd   %xmm0,%eax
     ba7:       66 0f 70 c0 6c          pshufd $0x6c,%xmm0,%xmm0
     bac:       48 31 d0                xor    %rdx,%rax
     baf:       49 89 c5                mov    %rax,%r13
     bb2:       c1 e8 03                shr    $0x3,%eax
     bb5:       89 c2                   mov    %eax,%edx
     bb7:       25 ff 7f 00 00          and    $0x7fff,%eax
     bbc:       49 c1 c5 20             rol    $0x20,%r13
     bc0:       8b 84 c4 c0 00 00 00    mov    0xc0(%rsp,%rax,8),%eax
     bc7:       81 e2 ff 07 00 00       and    $0x7ff,%edx
     bcd:       05 3c d8 43 3d          add    $0x3d43d83c,%eax
     bd2:       49 83 f8 01             cmp    $0x1,%r8
     bd6:       89 84 d4 c0 00 00 00    mov    %eax,0xc0(%rsp,%rdx,8)
SChernykh commented 5 years ago

This is the part where ASIC will be an order of magnitude faster:

r7 ^= (uint32_t)_mm_cvtsi128_si32(G);
addr_t addr = r7;
r7 = __rolq(r7, 32);
G = _mm_shuffle_epi32(G, _MM_SHUFFLE(1, 2, 3, 0));

This is the part where ASIC and CPU will be 1:1

uint32_t A = SCRATCHPAD_256K(addr).u32;
uint32_t B = 1027856444L;
SCRATCHPAD_16K(addr).u32 = A + B;

Address generation - all these rotations and XORs are what slows down real CPU. It looks very artificial. Maybe just store random address in instruction encoding and leave only XOR? Instructions don't have to be limited by 8 bytes. It's actually better to have each instruction field occupy 8*N bits for easier processing on real CPU.

tevador commented 5 years ago

Maybe just store random address in instruction encoding and leave only XOR?

That might help to bring the code size down at the cost of reduced memory access randomness. I will test it.

tevador commented 5 years ago

I managed to decrease the main program size to ~35 KiB of x86 code (when compiled with -Os in GCC). With manually optimized assembly, the whole program should easily fit into the L1 cache.

Changes:

  1. Removed dynamic address generation. Static address masks are encoded within the instruction.
  2. Instruction size increased to 16 bytes due to the above change.
  3. Thanks to more available instruction space, all flags are aligned to an 8-bit boundary.
  4. Added separate location flag for storing the result.
  5. Decreased program size to 512 VM instructions.
  6. Maximum jump distance decreased to 128.
  7. Increased program runtime to 1048576 instructions.
  8. Simplified scratchpad address calculation.
  9. Updated initialization and result calculation algorithms.

With the above changes, the runtime is about ~3 ms per program on Intel Xeon E3-1245. ~100 µs of that is scratchpad initialization

An average program reads about 4 MiB of DRAM, so current DRAM read speed is about 1.3 GiB/s (assuming runtime will stay the same with actual DRAM reading). This may still increase as the machine code is optimized and instruction weights are tweaked.

Example of the new code:

//MULH_64
r6 ^= 71330693;
addr_t addra = r6;
uint64_t A = readDram(&mmu, addr).u64;
uint64_t B = r2;
r1 = ((uint128_t)A * B) >> 64;

x86

     700:   48 81 74 24 10 85 6b    xorq   $0x4406b85,0x10(%rsp)
     707:   40 04 
     709:   4c 8d 8c 24 a0 00 00    lea    0xa0(%rsp),%r9
     710:   00 
     711:   8b 74 24 10             mov    0x10(%rsp),%esi
     715:   4c 89 cf                mov    %r9,%rdi
     718:   e8 e3 87 00 00          callq  8f00 <readDram>
     71d:   49 f7 e6                mul    %r14
     720:   49 83 f8 01             cmp    $0x1,%r8
     724:   49 89 d5                mov    %rdx,%r13
tevador commented 5 years ago

I have added actual DRAM buffer into the generator. Can someone help with testing?

Generate a random program: python rx2c.py > rx-sample.c.

Test with fake DRAM reads:

gcc -O2 -maes rx-sample.c -o rx-sample
./rx-sample

Test with real DRAM reads (requires more than 4 GiB of RAM):

gcc -O2 -DRAM -maes rx-sample.c -o rx-sample
./rx-sample

Please post your hardware configuration with the results. Hardware AES is required.

My tests seem to show that prefetching is very efficient because the DRAM version runs faster than the version without DRAM, which instead uses one multiplication, one addition and one rotation to calculate a pseudorandom value.

SChernykh commented 5 years ago

I get this error:

Traceback (most recent call last):
  File "rx2c.py", line 796, in <module>
    writeCode(file, i, CodeSymbol(random.getrandbits(128)))
  File "rx2c.py", line 624, in writeCode
    opcodeMap.get(symbol.opcode)(file, i, symbol)
  File "rx2c.py", line 345, in write_CALL
    file.write("\t\t\tgoto i_{0};\n".format((i + 1 + (symbol.imm0 & (PROGRAM_SIZE/4 - 1))) & (PROGRAM_SIZE - 1)));
TypeError: unsupported operand type(s) for &: 'int' and 'float'

Edit: fixed it by writing (PROGRAM_SIZE>>2)-1 on line 345

SChernykh commented 5 years ago

I compiled it on Windows (msys64 + gcc 8.2.0). Version without DRAM:

r0 = 544554143                            f0 = 3.56861e+037
r1 = 4895808875850857830                  f1 = -6.20976e+035
r2 = 4057236428693614864                  f2 = -7.45159e+018
r3 = 8143837159020859850                  f3 = 6.20976e+035
r4 = 12328198475199347488                 f4 = 4.8873e+018
r5 = 5766562821171912795                  f5 = 1.06391e+019
r6 = 13055555769842653631                 f6 = 6.20976e+035
r7 = 8351295290705984606                  f7 = -6.20976e+035
scratchpad sum = 8113792448439975883
runtime: 0.000000

Runtime 0.000000?

tevador commented 5 years ago

@SChernykh My guess is that clock() doesn't have sufficient resolution on Windows. On Linux, it gives microsecond precision.

SChernykh commented 5 years ago

Replaced clock() with QueryPerformanceCounter(). Intel Core i5-3210M, version without DRAM:

r0 = 544554143                            f0 = 3.56861e+037
r1 = 4895808875850857830                  f1 = -6.20976e+035
r2 = 4057236428693614864                  f2 = -7.45159e+018
r3 = 8143837159020859850                  f3 = 6.20976e+035
r4 = 12328198475199347488                 f4 = 4.8873e+018
r5 = 5766562821171912795                  f5 = 1.06391e+019
r6 = 13055555769842653631                 f6 = 6.20976e+035
r7 = 8351295290705984606                  f7 = -6.20976e+035
scratchpad sum = 8113792448439975883
runtime: 0.003240
SChernykh commented 5 years ago

Version with DRAM = segmentation fault. I have 6GB RAM on this notebook.

Makkari-V commented 5 years ago

Version without DRAM: Segmentation fault(core dumped) Version with DRAM: DRAM buffer allocation failed (-System has less than 4GB Memory)

System specification: Memory: 3.8 GB Intel Core is-4005U CPU @ 1.70GHz x 4 (I'm running Ubuntu 18.04.1 on a partition with Windows)

tevador commented 5 years ago

TypeError: unsupported operand type(s) for &: 'int' and 'float'

Fixed. I guess it depends on python version.

Version with DRAM = segmentation fault.

Sometimes segfault happens due to stack overflow because the VM stack is only 256 KiB. I build with GCC 6.3.0 and it runs without issues. There might be some UB in the C code, though. Can you try gdb?

Makkari-V commented 5 years ago

okay. how?

SChernykh commented 5 years ago

Same program without DRAM, Ryzen 5 2600:

r0 = 544554143                            f0 = 3.56861e+037
r1 = 4895808875850857830                  f1 = -6.20976e+035
r2 = 4057236428693614864                  f2 = -7.45159e+018
r3 = 8143837159020859850                  f3 = 6.20976e+035
r4 = 12328198475199347488                 f4 = 4.8873e+018
r5 = 5766562821171912795                  f5 = 1.06391e+019
r6 = 13055555769842653631                 f6 = 6.20976e+035
r7 = 8351295290705984606                  f7 = -6.20976e+035
scratchpad sum = 8113792448439975883
runtime: 0.002521
tevador commented 5 years ago

okay. how?

Recompile: gcc -g -maes rx-sample.c -o rx-sample Run: gdb rx-sample Then in the console, type run When it crashes, do backtrace

SChernykh commented 5 years ago

Compiled it in Ubuntu, this time even no-DRAM version crashed:

#0  0x000055555555fc8e in _mm_store_si128 (__B=..., __P=0x7ffffff7de58) at /usr/lib/gcc/x86_64-linux-gnu/7/include/emmintrin.h:714
#1  aesInitialize (key=0x55555555ff20 <aesKey>, seed=0x55555555ff40 <aesSeed>, output=0x7ffffff7de58, count=262144) at rx-sample.c:141
#2  0x0000555555554732 in main () at rx-sample.c:187
tevador commented 5 years ago

@Makkari-A @SChernykh It's been fixed in the latest commit.

I forgot that _mm_store_si128 requires 16-byte aligned address, so it had a 50% chance of crashing depending how gcc aligned the scratchpad or DRAM buffer.

SChernykh commented 5 years ago

@tevador no-RAM version still crashes, RAM version prints "DRAM buffer allocation failed". I fixed it by fixing DRAM_SIZE definition: #define DRAM_SIZE (1ULL << 32) - you have to use 1ULL to make sure it's 64 bit on all platforms.

Edit: and RAM version also crashes when initializing buffer.

Makkari-V commented 5 years ago

screenshot from 2018-11-11 18-44-20

SChernykh commented 5 years ago

This time it crashed on PUSH_ADDRESS(&&i_50):

    i_49: { //CALL
        if(0 == ic--) goto end;
        r3 ^= 3280375020;
        addr_t addr = r3;
        uint64_t A = readDram(&mmu, addr).u64;
            PUSH_VALUE(A);
            PUSH_ADDRESS(&&i_50);
            goto i_121;
        }
tevador commented 5 years ago

@SChernykh Can you do print sp in gdb? Maybe the VM stack ran out. @Makkari-A Your scratchpad is still not 16-byte aligned (I can see address 7ffffff7dd78). Are you using the updated generator?

Makkari-V commented 5 years ago

Ran it again:

r0 = 3768568599862903236 f0 = 1.03431e+37 r1 = 18031090911360467476 f1 = -9.75396e+36 r2 = 5144317967385062238 f2 = -4.86703e+37 r3 = 1145352110 f3 = 2.47697e+18 r4 = 18195049297340700480 f4 = 7.02899e+36 r5 = 18446744070852387881 f5 = 5.32254e+18 r6 = 16098064435052689540 f6 = 2.16523e+09 r7 = 17840958837419859279 f7 = -1.91341e+28 scratchpad sum = 3247213043754887765 runtime: 0.016748

Makkari-V commented 5 years ago

with DRAM: it gives DRAM allocation buffer failed still.