Closed tevador closed 5 years ago
Looks pretty good. I wonder if an 8KB program is large enough. This is only 2^18 total permutations. Perhaps you should allow STOre instructions into the program space, and thus self-modifying code.
The spec seems to assume that only valid floating point numbers will ever appear, doesn't seem to account for NaNs or divide-by-zero.
Hashing all registers (essentially all of the VM state) is good. Perhaps we should do this every X instructions, so that intermediate states are required to be executed faithfully.
Looks pretty good. I wonder if an 8KB program is large enough. This is only 2^18 total permutations.
There should be at least ~101782 unique opcode sequences (551024) and the number of unique programs is even larger due to additional instruction parameters.
The intention is for the program to fit into the L1 cache, so there are no stalls during execution. CPUs have ~32 KiB of L1 cache, so there is not much space to make the program larger due to additional overhead.
Perhaps you should allow STOre instructions into the program space, and thus self-modifying code.
I don't think it would make the PoW more CPU-friendly. I think we should allow compilation of the program directly into machine code for maximum performance.
The spec seems to assume that only valid floating point numbers will ever appear, doesn't seem to account for NaNs or divide-by-zero.
Good point. However, since all floating point numbers are produced only by conversion from uint64_t
, only the following operations will result in NaN: 0/0
, ∞/∞
, 0*∞
or ∞-∞
(link).
I think it's beneficial to also allow for NaN values, so that an ASIC has to implement as much of the IEEE-754 specs as possible.
Integer division by zero has to be dealt with (probably the way RISC-V defines it).
Hashing all registers (essentially all of the VM state) is good. Perhaps we should do this every X instructions, so that intermediate states are required to be executed faithfully.
I think this should be already achieved by XORing the output registers instead of overwriting them.
It looks good overall. My comments from IRC:
the only thing missing there is any mentions of L1/L2 caches Only read-only 4 GB RAM block this virtual machine must also read/write to 16 KB block of data and less frequently to another 240 KB block of data to make use of current generation CPU caches these numbers (16 KB/240 KB/4 GB) can be changed every fork
CPUs have ~32 KiB of L1 cache, so there is not much space to make the program larger due to additional overhead.
There is also uOP cache on Intel CPUs starting with Sandy Bridge, and on AMD CPUs starting with Ryzen. It can hold 1536 uOPs (around 1000 average instructions) max. You should account for this. Code execution from uOP cache is faster, and it's slowed down when code doesn't fit in uOP cache.
Reference: https://www.agner.org/optimize/microarchitecture.pdf page 123:
Code that runs out of the µop cache are not subject to the limitations of the fetch and decode units. It can deliver a throughput of 4 (possibly fused) µops or the equivalent of 32 bytes of code per clock cycle.
Note on FPU: there is no instruction to set FPU rounding mode. It's easy to add and will force any implementation to support all IEEE-754 rounding modes.
Is it worthwhile to consider vector instruction sets, or is that needless complexity at this point?
Vector instructions are not needed at this point. Another note: consider adding AES and SHA instructions. They're both supported on hardware level in current x86/ARM processors.
While we're tossing out suggestions - we could also toggle big/little endianness on different invocations. Though perhaps this would only favor ARM CPUs, and might be too easy for ASICs.
@tevador
After C is calculated, every ALU instruction performs r(p3) ^= C ... ... After C is calculated, every FPU instruction performs f(p3) += C ... ... then pops a return value retval from the stack and performs r(p3) ^= retval
What? Why is it needed? I think it's unnecessary complexity. When I execute C=A+B, I don't expect XOR to happen in the end.
That prevents faking execution and skipping intermediate states
But how does it prevent it? You basically have 2 instructions by doing this, second instruction is XOR.
Adding 16 KB and 240 KB read/write buffers accessible to all instructions will solve "fake execution" problem.
But how does it prevent it? You basically have 2 instructions by doing this, second instruction is XOR.
Normally you would overwrite the target register with the new value. This means that the result of the previous instruction might have been unused. To prevent this, registers are not overwritten, but combined using XOR or addition.
this virtual machine must also read/write to 16 KB block of data
There is the stack, which could grow to at least several KiB and the registers are together 512 bytes.
There is the stack, which could grow to at least several KiB and the registers are together 512 bytes.
Good, but it'll be better to explicitly use bigger buffers that fit in L1/L2 cache. Instruction results can be stored there.
This means that the result of the previous instruction might have been unused.
Store it in this big buffer instead of doing XOR. It's better because more intermediate data is saved and can later be used by different instructions.
Note on FPU: there is no instruction to set FPU rounding mode. It's easy to add and will force any implementation to support all IEEE-754 rounding modes.
Good suggestion. What are the corresponding x86/ARM instructions?
There is also uOP cache on Intel CPUs starting with Sandy Bridge, and on AMD CPUs starting with Ryzen. It can hold 1536 uOPs (around 1000 average instructions) max. You should account for this. Code execution from uOP cache is faster, and it's slowed down when code doesn't fit in uOP cache.
I think 1000 uOPs should be enough since branch mispredictions will happen before than this can be exhausted.
Good, but it'll be better to explicitly use bigger buffers that fit in L1/L2 cache. Instruction results can be stored there.
If we use a 16 KiB "register file", operand addresses need to be at least 11 bits. I will try to think how we could encode it.
Good suggestion. What are the corresponding x86/ARM instructions?
For x86 it's stmxcsr/ldmxcsr, I don't know how it's controlled on ARM. But ARM must also have corresponding instructions.
I think 1000 uOPs should be enough since branch mispredictions will happen before than this can be exhausted.
uOP cache handles branches too. As long as all code fits in uOP cache it'll execute without fetch/decode overhead which makes x86 CPU basically pure RISC processor. So it's important.
If we use a 16 KiB "register file", operand addresses need to be at least 11 bits. I will try to think how we could encode it.
No, it's better to have explicit memory. Like this 4 GB block of memory. We can still have this 4 GB block, just allow writes to first 256 KB. Implementations will just allocate separate 256 KB buffer for this.
I have updated the specs:
At the moment, all results written into the integer registers are encrypted using 1 round of AES. This is needed to get pseudorandom read addresses (maybe there is a better way to do it).
I have modified the specs based on yesterday's IRC discussion.
g0
-g3
.I see the following issues that should be resolved at the moment:
The remaining undecided parameters (instruction weights, number of instructions etc.) will require testing.
That effectively means the entire 4 GiB space only needs to exist in its entirety on a multithreaded miner.
Any miner would need full 4 GiB space, if used segment changes every nonce. Light client can just generate a segment on the fly to check the hash, but generation needs to be slow enough to make it unfeasible for actual mining.
P.S. On the other hand, segmenting 4 GiB block kind of destroys its purpose. ASIC can just have on-chip block of memory for each worker to store segment.
Actually I'm now questioning what kind of light client we're talking about. What kind of client ever needs to verify a hash?
@hyc For example if you wanted to run a full node on a Raspberry Pi, which has only 1 GB of RAM.
Ethereum generates its "DAG" from a 16 MB chunk and light clients can use just 16 MB to verify a hash at the expense of computation time.
I think if we use Cryptonight initialization algorithm on the whole 4 GiB, verifying a hash without the whole buffer would be too expensive since every 16 byte block depends on 1/8 of all previous blocks.
At this point I think we should be ignoring boxes with such low specs as Raspberry Pi. They all top out at 20MB/sec max I/O speed (on any/all peripherals) which already makes them horrible for syncing/storing/serving the blockchain. Their CPUs lack hardware AES, which makes them even more horrible for running Monero nodes.
There are plenty of cheap single board computers out there that can be used as full nodes, but no version of Raspberry Pi is worthwhile for this application. Lots of products from Pine64.org would do fine, for example https://www.pine64.org/?product=rock64-media-board-computer
When I'm thinking about it, there is actually a pretty linear memory-time tradeoff. If you store just the first 128 bytes of every 256 KiB chunk, you need just 2 MiB and recreating a memory block requires encrypting 128 KiB on average. If an average program accesses 32 blocks, you need to AES-encrypt about 4 MiB to verify a hash. That's comparable to Cryptonight.
The question is if this can be abused by ASICs to mine without DRAM.
The question is if this can be abused by ASICs to mine without DRAM.
Of course. If you only do Cryptonight-explode to initialize 4 GiB block, it's straight-forward time-memory tradeoff. You also need to do random shuffle per each segment to disable this optimization.
I have pushed an updated specification:
The latest commit contains a Python generator, which generates C code for a random program according to the specification (except DRAM, which is not implemented yet).
Run:
python rx2c.py > rx-sample.c
gcc -O2 -maes rx-sample.c -o rx-sample
With 65536 executed VM instructions, the runtime is around ~400 µs and varies very little between different programs (Intel Xeon E3-1245). That's the good news.
The bad news is that the compiled program is around 150 KiB, which doesn't fit into the L1I cache as was intended. The code inlines everything except DRAM read, which is a separate call. Still, address generation is a significant overhead compared to just the primitive operation of each VM instruction.
The code size could be decreased by generating optimized assembly code, but I'm not sure if it's possible to fit into 32 KiB with current program size (1024 VM instructions).
Example of the generated code:
RandomX:
ADD_32 r(a) = 7, loc(a) = 4, gen = 2, loc(b) = 6, imm1 = 1027856444
C:
r7 ^= (uint32_t)_mm_cvtsi128_si32(G);
addr_t addr = r7;
r7 = __rolq(r7, 32);
G = _mm_shuffle_epi32(G, _MM_SHUFFLE(1, 2, 3, 0));
uint32_t A = SCRATCHPAD_256K(addr).u32;
uint32_t B = 1027856444L;
SCRATCHPAD_16K(addr).u32 = A + B;
x86:
ba3: 66 0f 7e c0 movd %xmm0,%eax
ba7: 66 0f 70 c0 6c pshufd $0x6c,%xmm0,%xmm0
bac: 48 31 d0 xor %rdx,%rax
baf: 49 89 c5 mov %rax,%r13
bb2: c1 e8 03 shr $0x3,%eax
bb5: 89 c2 mov %eax,%edx
bb7: 25 ff 7f 00 00 and $0x7fff,%eax
bbc: 49 c1 c5 20 rol $0x20,%r13
bc0: 8b 84 c4 c0 00 00 00 mov 0xc0(%rsp,%rax,8),%eax
bc7: 81 e2 ff 07 00 00 and $0x7ff,%edx
bcd: 05 3c d8 43 3d add $0x3d43d83c,%eax
bd2: 49 83 f8 01 cmp $0x1,%r8
bd6: 89 84 d4 c0 00 00 00 mov %eax,0xc0(%rsp,%rdx,8)
This is the part where ASIC will be an order of magnitude faster:
r7 ^= (uint32_t)_mm_cvtsi128_si32(G);
addr_t addr = r7;
r7 = __rolq(r7, 32);
G = _mm_shuffle_epi32(G, _MM_SHUFFLE(1, 2, 3, 0));
This is the part where ASIC and CPU will be 1:1
uint32_t A = SCRATCHPAD_256K(addr).u32;
uint32_t B = 1027856444L;
SCRATCHPAD_16K(addr).u32 = A + B;
Address generation - all these rotations and XORs are what slows down real CPU. It looks very artificial. Maybe just store random address in instruction encoding and leave only XOR? Instructions don't have to be limited by 8 bytes. It's actually better to have each instruction field occupy 8*N bits for easier processing on real CPU.
Maybe just store random address in instruction encoding and leave only XOR?
That might help to bring the code size down at the cost of reduced memory access randomness. I will test it.
I managed to decrease the main program size to ~35 KiB of x86 code (when compiled with -Os
in GCC). With manually optimized assembly, the whole program should easily fit into the L1 cache.
Changes:
With the above changes, the runtime is about ~3 ms per program on Intel Xeon E3-1245. ~100 µs of that is scratchpad initialization
An average program reads about 4 MiB of DRAM, so current DRAM read speed is about 1.3 GiB/s (assuming runtime will stay the same with actual DRAM reading). This may still increase as the machine code is optimized and instruction weights are tweaked.
Example of the new code:
//MULH_64
r6 ^= 71330693;
addr_t addra = r6;
uint64_t A = readDram(&mmu, addr).u64;
uint64_t B = r2;
r1 = ((uint128_t)A * B) >> 64;
x86
700: 48 81 74 24 10 85 6b xorq $0x4406b85,0x10(%rsp)
707: 40 04
709: 4c 8d 8c 24 a0 00 00 lea 0xa0(%rsp),%r9
710: 00
711: 8b 74 24 10 mov 0x10(%rsp),%esi
715: 4c 89 cf mov %r9,%rdi
718: e8 e3 87 00 00 callq 8f00 <readDram>
71d: 49 f7 e6 mul %r14
720: 49 83 f8 01 cmp $0x1,%r8
724: 49 89 d5 mov %rdx,%r13
I have added actual DRAM buffer into the generator. Can someone help with testing?
Generate a random program: python rx2c.py > rx-sample.c
.
Test with fake DRAM reads:
gcc -O2 -maes rx-sample.c -o rx-sample
./rx-sample
Test with real DRAM reads (requires more than 4 GiB of RAM):
gcc -O2 -DRAM -maes rx-sample.c -o rx-sample
./rx-sample
Please post your hardware configuration with the results. Hardware AES is required.
My tests seem to show that prefetching is very efficient because the DRAM version runs faster than the version without DRAM, which instead uses one multiplication, one addition and one rotation to calculate a pseudorandom value.
I get this error:
Traceback (most recent call last):
File "rx2c.py", line 796, in <module>
writeCode(file, i, CodeSymbol(random.getrandbits(128)))
File "rx2c.py", line 624, in writeCode
opcodeMap.get(symbol.opcode)(file, i, symbol)
File "rx2c.py", line 345, in write_CALL
file.write("\t\t\tgoto i_{0};\n".format((i + 1 + (symbol.imm0 & (PROGRAM_SIZE/4 - 1))) & (PROGRAM_SIZE - 1)));
TypeError: unsupported operand type(s) for &: 'int' and 'float'
Edit: fixed it by writing (PROGRAM_SIZE>>2)-1
on line 345
I compiled it on Windows (msys64 + gcc 8.2.0). Version without DRAM:
r0 = 544554143 f0 = 3.56861e+037
r1 = 4895808875850857830 f1 = -6.20976e+035
r2 = 4057236428693614864 f2 = -7.45159e+018
r3 = 8143837159020859850 f3 = 6.20976e+035
r4 = 12328198475199347488 f4 = 4.8873e+018
r5 = 5766562821171912795 f5 = 1.06391e+019
r6 = 13055555769842653631 f6 = 6.20976e+035
r7 = 8351295290705984606 f7 = -6.20976e+035
scratchpad sum = 8113792448439975883
runtime: 0.000000
Runtime 0.000000?
@SChernykh My guess is that clock()
doesn't have sufficient resolution on Windows. On Linux, it gives microsecond precision.
Replaced clock() with QueryPerformanceCounter(). Intel Core i5-3210M, version without DRAM:
r0 = 544554143 f0 = 3.56861e+037
r1 = 4895808875850857830 f1 = -6.20976e+035
r2 = 4057236428693614864 f2 = -7.45159e+018
r3 = 8143837159020859850 f3 = 6.20976e+035
r4 = 12328198475199347488 f4 = 4.8873e+018
r5 = 5766562821171912795 f5 = 1.06391e+019
r6 = 13055555769842653631 f6 = 6.20976e+035
r7 = 8351295290705984606 f7 = -6.20976e+035
scratchpad sum = 8113792448439975883
runtime: 0.003240
Version with DRAM = segmentation fault. I have 6GB RAM on this notebook.
Version without DRAM: Segmentation fault(core dumped) Version with DRAM: DRAM buffer allocation failed (-System has less than 4GB Memory)
System specification: Memory: 3.8 GB Intel Core is-4005U CPU @ 1.70GHz x 4 (I'm running Ubuntu 18.04.1 on a partition with Windows)
TypeError: unsupported operand type(s) for &: 'int' and 'float'
Fixed. I guess it depends on python version.
Version with DRAM = segmentation fault.
Sometimes segfault happens due to stack overflow because the VM stack is only 256 KiB. I build with GCC 6.3.0 and it runs without issues. There might be some UB in the C code, though. Can you try gdb?
okay. how?
Same program without DRAM, Ryzen 5 2600:
r0 = 544554143 f0 = 3.56861e+037
r1 = 4895808875850857830 f1 = -6.20976e+035
r2 = 4057236428693614864 f2 = -7.45159e+018
r3 = 8143837159020859850 f3 = 6.20976e+035
r4 = 12328198475199347488 f4 = 4.8873e+018
r5 = 5766562821171912795 f5 = 1.06391e+019
r6 = 13055555769842653631 f6 = 6.20976e+035
r7 = 8351295290705984606 f7 = -6.20976e+035
scratchpad sum = 8113792448439975883
runtime: 0.002521
okay. how?
Recompile: gcc -g -maes rx-sample.c -o rx-sample
Run: gdb rx-sample
Then in the console, type run
When it crashes, do backtrace
Compiled it in Ubuntu, this time even no-DRAM version crashed:
#0 0x000055555555fc8e in _mm_store_si128 (__B=..., __P=0x7ffffff7de58) at /usr/lib/gcc/x86_64-linux-gnu/7/include/emmintrin.h:714
#1 aesInitialize (key=0x55555555ff20 <aesKey>, seed=0x55555555ff40 <aesSeed>, output=0x7ffffff7de58, count=262144) at rx-sample.c:141
#2 0x0000555555554732 in main () at rx-sample.c:187
@Makkari-A @SChernykh It's been fixed in the latest commit.
I forgot that _mm_store_si128
requires 16-byte aligned address, so it had a 50% chance of crashing depending how gcc aligned the scratchpad or DRAM buffer.
@tevador no-RAM version still crashes, RAM version prints "DRAM buffer allocation failed". I fixed it by fixing DRAM_SIZE definition: #define DRAM_SIZE (1ULL << 32)
- you have to use 1ULL to make sure it's 64 bit on all platforms.
Edit: and RAM version also crashes when initializing buffer.
This time it crashed on PUSH_ADDRESS(&&i_50):
i_49: { //CALL
if(0 == ic--) goto end;
r3 ^= 3280375020;
addr_t addr = r3;
uint64_t A = readDram(&mmu, addr).u64;
PUSH_VALUE(A);
PUSH_ADDRESS(&&i_50);
goto i_121;
}
@SChernykh Can you do print sp
in gdb? Maybe the VM stack ran out.
@Makkari-A Your scratchpad is still not 16-byte aligned (I can see address 7ffffff7dd78
). Are you using the updated generator?
Ran it again:
r0 = 3768568599862903236 f0 = 1.03431e+37 r1 = 18031090911360467476 f1 = -9.75396e+36 r2 = 5144317967385062238 f2 = -4.86703e+37 r3 = 1145352110 f3 = 2.47697e+18 r4 = 18195049297340700480 f4 = 7.02899e+36 r5 = 18446744070852387881 f5 = 5.32254e+18 r6 = 16098064435052689540 f6 = 2.16523e+09 r7 = 17840958837419859279 f7 = -1.91341e+28 scratchpad sum = 3247213043754887765 runtime: 0.016748
with DRAM: it gives DRAM allocation buffer failed still.
https://github.com/tevador/RandomX
@timolson @hyc Can you review/comment the initial draft?
There is still a lot of work to do, but it should be a lot simpler than RandomJS to implement and audit while being at least equally ASIC resistant.