Possible ASIC design - Githubissues

tevador commented 5 years ago

EDIT: This design is outdated and no longer applies to the current RandomX version.

Similar idea was originally proposed by @cjdelisle for a GPU miner, but I think it's more applicable for an ASIC design.

RandomX ASIC miner:

~66 MiB of SRAM on-chip
4 GiB of HBM memory
1 dedicated core for dataset expansion
256 small decoder/scheduler cores
28 worker cores

During dataset expansion, the SRAM is used to store the 64 MiB cache, while dataset blocks are loaded into HBM memory.

When mining, the ASIC runs 256 programs in parallel. Memory allocation:

256 * 256 KiB scratchpad = 64 MiB of SRAM for scratchpads
256 * 8 KiB = 2 MiB of SRAM for program buffers
32 KiB for register files
~256 KiB for program stack (1 KiB per program) = maximum recursion depth of 64

Instructions are loaded into the decoder/scheduler core (each core has its own program buffer, program counter and register file). The scheduler cores handle only CALL and RET instructions and pass the rest to one of the 28 worker cores.

Each of the 28 worker cores implements exactly one RandomX instruction pipelined for throughput which matches the instruction weight (for example MUL_64 worker can handle 21 times more instructions per clock than DIV_64 worker). Each worker has an instruction queue fed by the scheduler cores.

The speed of the decoder/scheduler cores would be designed to keep every worker core 100% utilized.

Some complications of the design:

CALL/RET will stall the program if there is dependency on an instruction which is still in worker queue
FPROUND will stall all subsequent floating point instructions
instructions which use the same address register will create a dependency chain

There could be also some dedicated load/store ports for loading instruction operands.

~If limited only by HBM bandwidth, this ASIC could do around 120 000 programs per second (480 GiB/s memory read rate), or roughly the same as 30 Ryzen 1700 CPUs. This assumes that sufficient compute power can fit on a single chip. If we estimate power consumption at 300 W (= Vega 64), this ASIC would be around 9 times more power efficient than a CPU.~

Improved estimate: https://github.com/tevador/RandomX/issues/11#issuecomment-450039447

Dislaimer: I'm not an ASIC designer.

cjdelisle commented 5 years ago

The 66MiB of SRAM is probably a killer, but I suspect it can be avoided by doing async instructions and thousands or hundreds of thousands of threads.

BTW credit for this idea should go to: https://github.com/aggregate/MOG/

tevador commented 5 years ago

@cjdelisle 100 000 parallel threads would require ~24 GiB of memory just for scratchpads. Also if you store the scratchpad in HBM, your memory bandwidth usage will increase ~~by 50%~~ 3 times.

You could also compromise by storing only the first 16 KiB of scratchpad in SRAM, which would decrease the required amount of on-chip memory to ~6 MiB, which is probably more realistic.

cjdelisle commented 5 years ago

The idea is to use threads to replace the latency bottleneck with a memory bandwidth bottleneck.. This is basically the ring processor idea but I think the von neumann bottleneck eventually degrades to a sorting problem where you need to sort the opcodes and operands to be next to each other and then gather the completed operand/opcode bundles into a queue which is then fed to the processor, creating more opcodes with need to sort more operands...

SChernykh commented 5 years ago

66 MiB of SRAM is nothing unusual for ASIC, the speedup from SRAM will outweigh bigger chip area by large margin. Scrypt ASICs had 144 MiB SRAM per core, IIRC.

SChernykh commented 5 years ago

Anyway, HBM memory + memory controller + 256 CPU-like cores with 66 MiB SRAM on chip sound very similar to a GPU. It'll be still limited by computing power, not memory bandwidth. Power efficiency (H/s/W) will be maybe 2-3 times better than a CPU.

tevador commented 5 years ago

@SChernykh A GPU would be compute-bound because it cannot run RandomX efficiently. Most 64-bit instructions have to be emulated on GPUs and double precision runs 16 times slower than single precision. ASIC, on the other hand, can execute 1 instruction per cycle in ideal case.

@cjdelisle RandomX was not designed to be latency-bound. That's why the dataset is accessed sequentially. Only scratchpad access is latency-bound, but it can fit into SRAM.

Also the instructions executed in RandomX are not independent. There will be random dependency chains, typically due to instructions using the same register (very rarely using the same scratchpad word). This would slightly complicate the design.

I have contacted @timolson to help us assess the viability and possible performance of a RandomX ASIC.

timolson commented 5 years ago

On first glance it looks much improved from RandomJS, being a much closer model of the underlying hardware.

However, if I understand correctly, you can run many programs in parallel on chip, assuming enough scratchpad space. This is similar to CryptoNight and favors ASIC development. The limiting factor will probably be this scratchpad area, not the logic, and as you pointed out, 66MiB for a bunch of cores is no problem, being only about 41 mm2 in a 16nm process. If we conservatively assume logic doubles the area then an 80 mm2 chip might cost around $10-15 packaged. You're gonna crush CPUs with this.

One way to prevent this parallelism is to make the large DRAM table read/write. Then you need a memcontroller and DRAM set for every core, which is closer to the setup of consumer hardware. Being able to isolate the running program to just the chip die makes for a nice, efficient ASIC. Where ASIC's fall down is when they have to wait for external IO. Once you hit the logic board, it's like a Ferrari in rush hour traffic.

Another option is to somehow force nonces to be tried in serial instead of parallel. Then an ASIC can't beat CPU's by merely adding cores. An ASIC design could still be cost-efficient by eliminating the excess cores on a CPU and all the gates for cache control. Or maybe there's a way to limit parallelism to exactly 8 or some chosen number of cores. This would make CPU's closer to the optimal configuration for the problem.

I didn't look closely enough, but wanted to point out potential "serialization" attacks. If the program generator can (quickly) do a dependency analysis and sort the instructions such that the scratchpad is r/w in sequential order, then you can replace SRAM with DRAM. Also, parallelism may be discovered and exploited in the programs if there's not enough register overlap. You might consider using fewer registers that are updated more frequently to address this. Again, I'm not sure if it's an actual problem with your design, because I didn't look closely enough, but it should be mentioned.

A couple other miscellaneous comments:

I wouldn't waste die area or ASIC development time on the large table expansion. It can be done offline with a CPU... A miner will have an SoC controller anyway which would handle this.
HBM is a specialized process that's rather expensive. GDDR5 is probably a more efficient choice. GPU manufacturers live by "better performance" not "hashes-per-dollar," so they have moved to HBM in search of peak performance.
GPUs will be limited by their SRAM, which is 2MB per Nvidia SM, with only 64kB accessible to any one thread block. AMD's are less limiting but still problematic. Both CPUs and ASICs can outperform GPU's in this area. The scratchpad sizes you're using effectively prevent GPUs from being efficient.
Double precision float math on GPUs is 2-32x slower than single precision, but CPU's handle 64-bit math very well. Most GPUs also suck at all integer multiplication (addition is fine), although the GPGPU movement is changing this somewhat. The new 2000 series somewhat better, but they still emphasize float performance (deep learning training is primarily floating point multiplication.) So integer multiplication is another way CPU's and ASIC's can outperform GPU's. Here you can see instruction throughput numbers for Nvidia Notice how integer multiplication is so slow they don't even give a number! They just say "multiple instructions." Bit operations are also slow on GPU's. GPU's are really focused on float performance, and many games use half-precision 16-bit floats. It looks "good enough."

I'm currently writing a GPU miner for Grin, and since the genesis block is January 15th, I don't have much time to look deeper until later in January or February, sorry. I can quickly address specific concerns if you want to point something out, or if I overlooked something critical in my very brief review.

timolson commented 5 years ago

One more comment: An ASIC can have the "correct ratio" of logic handlers based on the frequency and latency of various instructions used in RandomX, which may be different from CPU's. As a simple example, let's assume only two instructions, int multiply and int add, randomly selected 50/50. If multiply takes 4 cycles and add takes 1, then an ASIC will have 4 mul units for every 1 add unit, whereas a CPU gets one of each. That may not be strictly true, but you should tune your probabilities such that the probability_of_instruction / latency_in_cpu is the same for all instructions. In the above case, you want adds to be 4x more frequent than multiplies (assuming the CPU has one of each)

timolson commented 5 years ago

Aaaaand one more thing... Although the 66 MiB chip you proposed would be $10-15, it's only gonna clock around 1GHz. Intel and AMD definitely do have lots of IP and scale to do efficient layouts and squeeze the maximum speeds out of their designs. If you can fix the multi-core problem running lots of programs in parallel, then a startup ASIC maker, even Bitmain, will not get close to the CPU performance of the incumbents. But you probably need to get within a factor of 3 or so.

tevador commented 5 years ago

@timolson Thanks for the review.

However, if I understand correctly, you can run many programs in parallel on chip, assuming enough scratchpad space.

Yes and assuming you can read from the dataset quickly enough. The dataset is too big to be stored on-chip, so external memory is unavoidable.

One way to prevent this parallelism is to make the large DRAM table read/write.

That is difficult to do while also allowing hash verification for light clients who might not have enough RAM. But perhaps a 1 GB read/write buffer would be possible. Even phones have at least 2 GB nowadays.

Another option is to somehow force nonces to be tried in serial instead of parallel.

I'm not aware of any way how to achieve this. I don't think it's possible without some central authority handing out nonces.

Currently, parallelism is limited only by DRAM bandwidth.

If the program generator can (quickly) do a dependency analysis and sort the instructions such that the scratchpad is r/w in sequential order, then you can replace SRAM with DRAM.

I don't think this is possible. Scratchpad read/write addresses are calculated from register values, which depend on previous results. Dataset is already being read sequentially and the reads cannot be reordered.

There are only 8 registers for address generation and register values change every instruction, so the sequence of independent instructions will be very short, perhaps 2-3 instructions.

Regarding GPU performance, I already suspected most of what you wrote.

SChernykh commented 5 years ago

But perhaps a 1 GB read/write buffer would be possible.

Per thread? Or one buffer for all threads? How are you going to synchronize them?

timolson commented 5 years ago

Currently, parallelism is limited only by DRAM bandwidth.

Make sure it's tuned such that typical DDR4 SODIMM speeds align with 8-core parallelism and I'd say you're getting close. However, GDDR5 and HBM crush DDR4 for bandwidth-per-dollar, so if you pin the PoW to memory bandwidth, CPUs will lose out. One way to address that dilemma may be to use random DRAM access instead of sequential. Some of GDDR's improved speed comes from using wider rows and some comes from a wider bus, but DDR4 is competitive for random access patterns. If you're only reading a single word at a time, it doesn't matter that GDDR grabs 32 words while DDR only gets 8, or whatever. They both have random access latencies of 40-45 ns.

SChernykh commented 5 years ago

The best way would be to read random 64 bytes at a time. DDR4/CPU cache is optimized for this burst read size.

timolson commented 5 years ago

The best way would be to read random 64 bytes at a time. DDR4/CPU cache is optimized for this burst read size.

Longer reads will favor GDDR & HBM because they have wider busses and also can push more bits-per-pin-per-cycle. I would suggest something smaller than 64 bytes, which would need 512 pin-cycles in DDR4 and only 128 pin-cycles in GDDR5. This is wider than consumer SODIMMs. 16 bytes is probably safe.

hyc commented 5 years ago

A 64byte burst is optimal for a typical 64bit memory channel, DDR4 is designed for 8-bit bursts per pin. While a GPU can grab that in a single cycle with a 512bit bus, it won't be able to burst any of its accesses.

SChernykh commented 5 years ago

I think the ideal case would be 64-byte random accesses that also saturate dual-channel DDR4 bandwidth on 8-core CPU.

tevador commented 5 years ago

@timolson If we make random memory accesses (latency-bound), the CPU cores will be basically idle and you can make an ASIC with 1% of power consumption of a CPU.

tevador commented 5 years ago

@SChernykh

Per thread? Or one buffer for all threads? How are you going to synchronize them?

One buffer per thread. So 8 GiB of memory for 8 parallel threads.

timolson commented 5 years ago

It depends on the frequency of reads also, right? If you're reading often, then yes. But what about semi-infrequent random DRAM reads? You can tune the number of computations-per-DRAM read to match what a CPU core can do. In this way, it is not DRAM-bound by either latency or bandwidth. The idea is similar to ProgPoW in this regard, where they tuned the number of random math ops per memory access to match GPU capabilities.

tevador commented 5 years ago

Fair enough.

Currently, it takes ~90 W of compute power (14 nm Ryzen CPU with 16 threads) to achieve ~16 GiB/s of DRAM read speed, which is about half of what dual channel DDR4 can do.

If you use GDDR5/HBM, you can easily read 20x faster, but how are you going to match the compute speed? Even if you improve efficiency by a factor of 2 over a CPU, that's 900 W of compute power.

RandomX uses primitive operations (add, sub, mul, div,, floating point), so I don't think you can make an ASIC much more power efficient than that. At most you can cut out some of the CPU parts like TLB, L3 cache, memory controller and IO.

cjdelisle commented 5 years ago

I didn't look closely enough, but wanted to point out potential "serialization" attacks. If the program generator can (quickly) do a dependency analysis and sort the instructions such that the scratchpad is r/w in sequential order, then you can replace SRAM with DRAM.

This could be a significant risk if it is feasible to run hundreds or thousands of threads with DRAM because one need not do any dependency analysis, just schedule each thread for one instruction only.

Creating large datasets is a good solution but it is not really possible to require the dataset to be used by one thread only because in order for a large datasets to be worth creating and storing in the first place, it needs to be reusable. Serial re-usability is going to require that your verifier performs the whole series of operations which is probably a non-starter so you end up having to allow at least a few hundred parallel executions to use the same buffer...

tevador commented 5 years ago

BTW, random reads already happen in RandomX. There is (on average) one random read per 2¹³ (8192) sequential reads.

cjdelisle commented 5 years ago

I made this little drawing of what I think a high latency high parallelism processor could look like: https://pixelfed.social/p/cjd/24845 Perhaps I'm wrong and smarter people will point out the error, but my intuition is that there's no way to win against this type of design. Even if your mining algorithm was "compile linux" (my informal benchmark for general purpose compute performance), I see no reason that one couldn't run 16k or more GCC processes with this type of architecture, given enough DRAM (especially with hardware scatter/gather support)

AFAICT there is no way to de-parallelize the mining beyond requiring memory for the verification process. If you require a different 50MB dataset per-nonce then the verifier needs 50MB and the solver using this architecture can run (AVAILABLE_DRAM / 50MB) parallel threads.

The method of requiring a precomputed dataset which is reusable for more than one nonce falls down because either the solver parallelizes all allowable permutations for one precomputed dataset or (if the number of allowed permutations is too low) he doesn't bother to precompute it at all and simply tests the same way as the verifier.

tevador commented 5 years ago

I improved the ASIC design estimate based on comments from @timolson.

Let's start with memory. We need 4 GiB of GDDR5 for maximum bandwidth. At least 4 memory chips are required since the capacity is 8 Gb per chip. Each chip has a 32-bit interface, so our maximum memory bandwidth will be 4 32 8 Gb/s = 128 GiB/s, assuming 2000 MHz memory.

The memory can support up to 128 GiB / 4 MiB = 32 768 programs per second. Now let's try to make a chip that has enough compute capability to actually push out 32 thousand programs per second.

I started with the AMD Zen die, which has an area of 7 mm². If we remove all cache, we have around 4 mm². Let's say we can optimize the design down to 2 mm² per core.

We know that the Ryzen core can do ~500 programs per second at 3350 MHz. Since our ASIC will run only at 1 GHz, our optimized core can do only ~150 programs per second.

We need ~218 such cores to saturate the memory bus. This amounts to about ~436 mm².

Additionally, we will need ~40 mm² of SRAM and a GDDR5 memory controller. The DDR4 memory controller in Ryzen is ~15 mm², so let's say we can make a minimal controller with just 5 mm².

In total, we have a huge ~480 mm² die, which is about the same size as a Vega 64 GPU.

Price estimate: ~$150 for the die ~$75 for memory (based on prices from digikey.com) ~$30 PCB + power delivery ~$25 cooling

Total per mining board: ~$280. This doesn't include any R&D or IP licensing costs.

We can safely assume a power consumption of around 300 W per board, same as a Vega 64 at full load.

Hashes per Joule:

Ryzen 1700: ~40
ASIC board: ~100

So about 2.5 times more efficient. And this is the best case scenario.

hyc commented 5 years ago

Zen+ should be about 10% more efficient than Zen. Has anyone tested a Ryzen 2700 yet?

Your math assumes 32768 cores saturating GDDR5 all performing sequential accesses. The throughput will be much less with random accesses. I think your area estimates for the ASIC are overly optimistic, as there'll be quite a complicated interconnect network to attach all those cores to the memory etc.

tevador commented 5 years ago

Your math assumes 32768 cores saturating GDDR5 all performing sequential accesses

Actually, there are only 218 cores. The design above assumes scratchpads are stored in SRAM, so a maximum of 256 programs can be run in parallel. The GDDR5 memory is just for the dataset, which is read mostly sequentially.

If you wanted to run thousands of programs in parallel, you'd have to store scratchpads in GDDR5 and use the design by @cjdelisle to hide random access latencies. However, in this case you would need 12 GDDR5 chips per board to get enough capacity (12 GiB) and bandwidth (384 GiB/s). The cost of the memory chips alone would be over $200 per board. Power consumption would probably also increase because GDDR5 is power hungry.

I think your area estimates for the ASIC are overly optimistic,

Yes, it's maybe too optimistic. The 2.5x efficiency figure is an upper estimate for an ASIC.

I still think a bandwidth-limited design is the way to go. If the design was purely latency-bound, an ASIC could use much cheaper DDR3 memory. This can be seen in Antminer E3.

cjdelisle commented 5 years ago

This seems like a reasonable design for the tech, consider that you can eliminate caches, registers and even split the components of the ALU into circuits (direct add insns to the adders, mul insns to the multipliers, etc). You need SRAM mostly for router-like buffers because the chip is basically a network.

Generally speaking, I think your approach of focusing on power consumption is a good heuristic to fit the problem to the hardware you have (though it might be worth also watching int ops and float ops to make sure there are no shortcuts). I'm hoping to fit the problem to the hardware I want to have so my design will be slightly different, focusing more on branching / prediction and use of lots of instructions with somewhat less power consumption.

That said, my whole design falls down if it turns out that the high bandwidth wiring is prohibitively expensive.

tevador commented 5 years ago

I'm experimenting with doubling the memory bandwidth requirements by increasing dataset read size from 8 to 16 bytes.

The performance drop for CPUs depends on the available memory bandwidth. On Ryzen 1700, it seemed to hit a bandwidth bottleneck with dual channel DDR4-2400, so I upgraded to DDR4-2933.

Here are the performance numbers:

dataset read	threads	memory	programs/s	read rate
8 bytes	8	DDR4-2400	3200	13 GB/s
8 bytes	16	DDR4-2400	4450	18 GB/s
16 bytes	16	DDR4-2400	3400	28 GB/s
16 bytes	8	DDR4-2933	3200	27 GB/s
16 bytes	16	DDR4-2933	3900	33 GB/s

With 16 threads, it's still slightly bandwidth-limited even with 2933 MHz memory. It seems that 3200 or 3466 MHz memory might be needed.

For the ASIC design, this would mean either halving the performance to ~16K programs per second per board (with corresponding halving of die area) or a forced upgrade to 256-bit GDDR5 interface with 8 memory chips, which would double the memory cost to ~$150 per board and put more strain on inter-core bandwidth.

One drawback of this change is that the execution units of CPUs would be slightly underutilized, which would make it easier for an ASIC to match the compute requirements. This could be solved by adding more compute per VM instruction.

What do you think about this change?

SChernykh commented 5 years ago

3200-3466 MHz memory is becoming more common now. We should aim for maxing out available DDR4 dual channel bandwidth and compensate with more computing to load CPU if needed.

timolson commented 5 years ago

For random access, the relevant number is transactions per second rather than bandwidth, because of random access latencies of about 40-45 ns.

On Sun, Dec 30, 2018 at 12:33 AM SChernykh notifications@github.com wrote:

3200-3466 MHz memory is becoming more common now. We should aim for maxing out available DDR4 dual channel bandwidth and compensate with more computing to load CPU if needed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tevador/RandomX/issues/11#issuecomment-450546459, or mute the thread https://github.com/notifications/unsubscribe-auth/ABnlW19Z2IroKoi8XZ1d5b2tF-Fl5Nihks5u-HpigaJpZM4ZgZlF .

tevador commented 5 years ago

Doubling the memory access width also increases the amount of data that a light client needs to generate on the fly when verifying. This would increase verification time.

Even with the current 8-byte reads, light client verification takes ~50 ms, which is a bit too slow for practical use. I think we need to aim for light client verification under 20 ms, otherwise RandomX is not a good replacement for Cryptonight.

With random accesses, the amount of data to be generated during verification could be lower. However, random accesses cannot saturate RAM bandwidth. CPUs simply don't have enough cores for that. This would increase the advantage of ASICs, because they would run more programs in parallel than a CPU for a given RAM budget.

timolson commented 5 years ago

However, random accesses cannot saturate RAM bandwidth. CPUs simply don't have enough cores for that. This would increase the advantage of ASICs, because they would run more programs in parallel than a CPU for a given RAM budget.

It would let ASIC’s use DDR3 like you said, but extra programs in parallel is not possible. ASIC’s would still be bound by the transactions-per-second of DRAM random access. Whether you write 1 byte or 8, it takes DRAM the same time to address the row. ASIC’s could use lots of small DRAM chips, sharding memory more than SODIMMs do to increase TPS, but this also costs wires/pins and there’s a practical limit on pins-per-die-area. On a per-transaction basis, DDR3 is about 75% the cost of DDR4 for the same amount of memory.

From your performance table, it looks like mostly sequential access, yes? Your point about light client validation is a good one... Why not try short, fully-random reads? You lose a little capex to DDR3 but you prevent the high-compute-parallelization problem.

** ASIC’s can have custom DRAMs on- or off-chip which are designed to reduce random access latency, but they would sacrifice bandwidth, density, and/or power. Custom DRAMs are also expensive compared to the highly commoditized, standardized DDR4 chips.

tevador commented 5 years ago

From your performance table, it looks like mostly sequential access, yes?

Yes, on average, 64 KiB are read sequentially before jumping to another dataset block.

Let's assume we want to max out the available DRAM transaction rate. I will use 40 ns per read transaction per memory channel as you wrote above.

That means dual channel memory can do 50 million random reads per second.

Now let's pair the memory with an 8 core CPU running at 4 GHz. We have 32G core-cycles per second, so we have to do 1 random read per 640 cycles.

Since we are doing completely random reads from a 4 GiB table, we can assume every read will be a cache miss. According to this, L3 cache miss penalty for Ryzen is 40 cycles + 90 ns. That means each core will be sitting idle for 400 cycles with each read. This gives us only 640 - 400 = 240 cycles when something can be done. CPU usage will be less than 40%.

This might be slighly offset by SMT or by making the accesses prefetchable, but you can see it will be much harder to keep the execution units busy with random reads.

SChernykh commented 5 years ago

According to this, L3 cache miss penalty for Ryzen is 40 cycles + 90 ns. That means each core will be sitting idle for 400 cycles with each read. This gives us only 640 - 400 = 240 cycles when something can be done. CPU usage will be less than 40%.

If you fetch data, do something with this data for 640 cycles and fetch next data - then yes. But if you you fetch data 400 cycles before you start processing it, there will be no idle time.

tevador commented 5 years ago

if you you fetch data 400 cycles before you start processing it, there will be no idle time

400 cycles is more than 50 RandomX instructions. Even if you somehow avoid data dependency for the next 50 instructions, I'm not sure if current CPUs can keep instruction backlog for that many cycles without stalling. The out-of-order buffers have limited capacity.

SChernykh commented 5 years ago

What dependency, just run prefetch instruction and then access the same memory 400 cycles later.

tevador commented 5 years ago

What dependency, just run prefetch instruction and then access the same memory 400 cycles later.

The 400 cycles figure will be different for each CPU. It can be anywhere from 100 to 500 cycles depending on RAM type and CPU clock. RandomX VM doesn't know about clock cycles, so the delay would have to be specified in terms of VM instructions. Each CPU will run a different number of VM instructions per cycle depending on the microarchitecture.

If you prefetch too early, the data might be already evicted from the cache by the time you want to use it. If you prefetch too late, the CPU will stall.

SChernykh commented 5 years ago

It can be anywhere from 100 to 500 cycles depending on RAM type and CPU clock.

So prefetch it at least 500 cycles before. L1 cache is big enough, fresh data simply can't be evicted from L1 after only 500 cycles.

tevador commented 5 years ago

I have pushed some test code with random accesses.

The memory access routine reads one random dataset cacheline (64 bytes) per 64 VM instructions. Reading happens 50 VM instructions after prefetching. With this delay, the read is esentially "free" because the data seems to be always ready in L1 cache.

However, the random access version is still ~25% slower than the sequential access version.

Additional ~15% slowdown comes from TLB thrashing. This can be fixed by using large pages (--largePages command line option).

Tested on Intel Core i7-8550U with 4 threads:

programs/s
sequential	1750
random - 2M pages	1310
random - 4K pages	1100

Most of the slowdown comes from this routine, which needs to be run every VM instruction. Particularly lines 229-236:

    mov eax, ebp
    and al, 63
    jz short L_prefetch     ;# "ic" divisible by 64 -> prefetch
    xor edx, edx
    cmp al, 14
    je short L_read         ;# "ic" = 14 (mod 64) -> random read
    cmovb edx, ecx          ;# "ic" < 14 (mod 64) -> modify random read address
    xor edi, edx

These 8 instructions account for most of the slowdown. I haven't found a way to make it run faster. Is there some optimization I'm missing?

tevador commented 5 years ago

Increased to ~1340 programs/s by changing and al, 63 to and eax, 63 and cmp al, 14 to cmp eax, 14.

SChernykh commented 5 years ago

@tevador Maybe just combine prefetch and read at the same step, i.e. read from current address and prefetch from the next address as soon as next address is available.

tevador commented 5 years ago

@SChernykh I have combined prefetch and read into one step. Performance increases to ~1440 programs/s with 4 threads. This can be increased up to ~1540 using 8 threads, which shows that a lot of the CPU resources are idle.

Compare that to the sequential version, which actually slows down from 1750 (4 threads) to 1710 (8 threads) This is because L1 and L2 cache sizes permit only 1 scratchpad per core on most Intel CPUs.

Overall, the sequential version is still 10-15% faster. Any other ideas?

If we cannot improve performance any further, does the benefit of random accesses outweigh the fact that CPUs will be underutilized? This would increase potential advantage of ASICs in terms of performance per watt or per die area.

tevador commented 5 years ago

I managed to increase the performance to ~1900 programs/s (8 threads) by partially inlining the calls to the memory read routines (now there is only 1 call per 64 VM instructions). The code size has slightly exceeded 32 KiB, though. If I can reproduce the performance numbers on AMD, I think there are no arguments left against random accesses.

1900 programs per second corresponds to roughly 6.3 cycles per VM instruction on this CPU (3.1-3.2 GHz).

BTW, without the memory read call, the performance is up to ~2700 programs/s.

tevador commented 5 years ago

I tested the random accesses on Ryzen 1700 with dual channel DDR4-2933.

1 random read per N VM instructions (64 bytes per read).

N	programs/s	MT/s	MB/s	power [W]	IPC	comment
8	1318	173	11000	63	0.46	memory-bound
16	2400	157	10000	78	0.72	memory-bound
32	3910	128	8200	91	1.05
64	4550	74	4770	92	1.15
128	4820	39	2530	93	1.19	compute bound?
256	4900	20	1280	92	1.20	compute-bound

MT/s = millions of memory transactions per second MB/s = memory read rate power = system power consumption over idle IPC = x86 instructions per clock (measured with perf)

Dual channel DDR4 can actually handle a lot more than 100 million transactions per second. Wikipedia says DDR4 has 2-4 memory bank groups, which might increase the parallelism that can be achieved over what we calculated before.

The current choice of N = 64 seems to be a good compromise, allowing faster CPUs to not hit memory bottleneck, while still noticeably limiting the possible parallelism per memory channel.

tevador commented 5 years ago

To allow a light client implementation, it must be possible to generate dataset items on the fly with 64-byte granularity for random accesses.

To keep roughly the same dataset initialization time, I can AES encrypt 64 bytes a maximum of 64 times for a total of 4 KiB of AES to generate one cache line on the fly. For one program, this equals 80 MiB of AES (including scratchpad initialization).

The question is if an ASIC could avoid having 4 GiB of RAM by simply having enough AES cores to generate dataset items on the fly?

To match a Ryzen 1700, the ASIC would need 64 MiB of SRAM and about 360 GiB/s of just AES ecryption capability. Is this feasible?

SChernykh commented 5 years ago

@tevador Developers of ProgPoW expressed their opinion on RandomX: https://imgur.com/a/WUe87rQ

timolson commented 5 years ago

lol I don't disagree with them, but it's the pot calling the kettle black. I'm not convinced by ProgPoW either.

hyc commented 5 years ago

One problem more naive people seem to have in discussing these things is treating bandwidth and latency as independent variables, when they are in fact inseparable. I suppose you have to have studied network engineering to really appreciate that fact, but it plays a huge factor in parallel computing.

I personally have already had the experience of programming for armies of "simple" cores in parallel, from 1024 CPU Thinking Machine Connection Machines in the 1990s, to 2048 CPU Oracle M8 just a couple years ago and every massively parallel implementation in between. They all always fall short of their theoretical throughput because control networks don't scale so well, and Amdahl's Law always bites hard. The Ultrasparc T1000 also tried a similar strategy and also failed to deliver. In the real world, you need individual cores to be fast, you can't just get away with thousands of slow cores. Aside from the actual computation, the control of sequencing is extremely demanding; the communication overhead starts to overshadow the computation cost.

timolson commented 5 years ago

What's the quote? Something like: "A supercomputer turns a compute-bound problem into an IO bound one."

SChernykh commented 5 years ago

@tevador Why don't we have 2 MB scratchpad (3 levels of cache: 16 KB -> 256 KB -> 2048 KB)? ARM cores can't be configured with more than 1 MB cache IIRC, and even specialized ASIC won't be very small anymore. Also, 2 MB scratchpad makes use of all CPU cache just like Cryptonight.

tevador / RandomX

Possible ASIC design #11