tevador / RandomX

Proof of work algorithm based on random code execution
BSD 3-Clause "New" or "Revised" License
1.43k stars 306 forks source link

Single-chip ASIC design #31

Closed tevador closed 5 years ago

tevador commented 5 years ago

I came across this discussion on the Grin forum: https://www.grin-forum.org/t/obelisk-grn1-chip-details/4571

The conversation contains a lot of interesting information. The main takeway is that ASIC manufacturers prefer single chip designs even if it may not seem as the best way.

For RandomX, I see two possible options how an efficient ASIC might look like:

  1. A single-chip design with lots of SRAM, similar to the GRIN ASIC. It would run in the "light" mode with the 256 MiB cache entirely on-die.
  2. A multi-chip design with HBM2 memory. It would run in the "full" mode with the 2 GiB dataset and run thousands of concurrent threads.

I did a ballpark estimate of the performance and power consumption of the first option (single-die RandomX ASIC). These are my assumptions, mainly based on the information from the Grin discussion:

  1. Largest feasible die size is about 800 mm2.
  2. The ASIC would use TMSC 16 nm process to maximize the yield.
  3. SRAM requires about 1 mm2 of die area per MiB.
  4. Power consumption is about 0.1 W/mm2 for SRAM and 0.4 W/mm2 for cores.
  5. 0.6 mm2 per CPU-like in-order core (based on ARM A53).
  6. All RandomX instructions have a 1-cycle latency.
  7. AES round has a 1-cycle latency.
  8. Scratchpad access has a 1-cycle latency.
  9. Cache access has a 4-cycle latency.
  10. One SquareHash round (multiplication + subtraction) has a 2-cycle latency.
  11. ASIC runs at 1 GHz.

I think some of the above assuptions are rather optimistic, but I'm interested in the best possible scenario here.

First, let's estimate the time to calculate 1 hash. This requires:

One program iteration consist of:

This gives 311 cycles for the program execution 704 cycles to calculate the dataset item. Overall 1015 cycles per iteration, which is about 1 μs.

Total time per hash is about 16.7 ms, which means exactly 60 H/s per scratchpad. We can see that 2/3 of the time, the core is waiting for the dataset item to be calculated, so we can have 1 core per 3 scratchpads to keep all cores busy 100% of the time.

To max-out our 800 mm2 die, we will have:

Final performance of this chip is 249*60 = 15 KH/s.

Power consumption estimate is 754*0.1 + 50*0.4 = 95 W.

Power efficiency is about 150-160 H/J. While this is about 3x more than a CPU, I would consider this as the upper bound of ASIC efficiency given the above assumptions may not be feasible in practice.

tevador commented 5 years ago

I invite anyone with a hardware design background to correct me if any of my assumptions are grossly off-base. @timolson @AirSquirrels

Also it would be interesting to have an estimate for the second HBM-based design.

Sonia-Chen commented 5 years ago

It seems you are using an instruction based model for calculation. All miner manufacturers we know use a pipeline based model to calculate AES or hash. The latency calculation looks largely correct, but the throughput calculation is totally off.

It's only a single die ASIC's rough model, an actual ASIC manufacturer would not be limited by this. The calculation is only one approach, but there are many other ways. Random instructions add some complexity but add very little difficulty to making an ASIC. Cost to design a RandomX ASIC is still way less than designing a good Bitcoin miner.

  1. Largest single die size is limited by photo mask size 26*33mm=858mm2 But there are a lot of ways to make chips bigger than that.

  2. I am not sure why TSMC 16nm was chosen but we can calculate based on it.

  3. SRAM density can reach 9Mb/mm2 based on compiler, 10Mb+/mm2 may be reachable by customized SRAM macro.

  4. Power can't simply be calculated by area. Different design can easily make a >100x difference. For example, working at 0.3V voltage or 1V voltage makes a 10x difference. Different circuit activity makes >100x difference (like ALU/FPU not running every cycle) Different frequency makes >10x difference. Different type of transistor (HVT/SVT/LVT/ULVT) makes 10x difference. Different circuit topology makes >5x difference Different process variant makes 3x difference Different working temperature makes 3x difference Big companies have large teams to analysis power on a single chip, and it's still difficult to predict power accurately. The 0.1W/0.4W/mm2 number can easily be off by >10x

  5. not logical to use an ARM core. An ARM core has way too many features for this application.

    • You only need a FSM based instruction decoder + ALU + FPU.
    • only 1/5 area
    • many parts such as FPU can share with mulitple core
    • several big cores with a lot of threads is way smaller than hundreds of small cores
  6. Can reach 2 instruction per cycle (average) without a problem. Random instructions are not strictly sequential, so more is also possible.

  7. AES can easily design 4 round per cycle accelerator (called pipeline 4x unroll), more rounds also possible, but that needs more power.

  8. Stratchpad access: 0 cycle (pre-fetch in advance)

  9. Cache access: 0 cycle (pre-fetch in advance)

  10. mult-sub latency 1 cycle

  11. ASIC has no problem running up to 3GHz or run multiple operations within one cycle.

    • don't compare with CPU frequency, this is different story
    • we don't need complex cache coherence, ALU has no problem running at higher frequency
    • see wave computing's 10GHz ALU

Hope this helps. We are a fairly open Shenzhen-based ASIC team, feel free to ask more questions in our Telegram t.me/LinzhiCorp

SChernykh commented 5 years ago

It seems you are using an instruction based model for calculation. All miner manufacturers we know use a pipeline based model to calculate AES or hash

How can they pipeline RandomX if it has scratchpad accesses and branches in random places of the code?

Sonia-Chen commented 5 years ago

Thank you very much for reviewing and commenting upon our comment.

Maybe there is a misunderstanding of what is a pipeline. A CPU has a pipeline stage during execution. ALU or cache operations only stick to certain stages, we don't need memory access at each stage.

Please be aware that we are unable to design a chip with a series of github issue comments. Tevador asked for feedback, and we provided that for part of his question (someone else can do HBM). In all things "asic resistance", the division of labor is such that someone claims that something cannot be done, and then someone else just does it. We don't want to switch to the claim side, we have most experience on the "just do it" side :)

SChernykh commented 5 years ago

Oh, so you're talking about more CPU-like pipeline. Can we assume that RandomX achieves its target here? Making ASIC look and perform like a CPU?

tevador commented 5 years ago

@Sonia-Chen Thanks for your comments.

Cost to design a RandomX ASIC is still way less than designing a good Bitcoin miner

I find that hard to believe as RandomX is a lot more complex than SHA256. How do you estimate the design cost just from reading the specification?

But there are a lot of ways to make chips bigger than that.

If you mean multi-chip modules, those are not technically single chips.

Power can't simply be calculated by area.

OK, So it's basically impossible to make a simple estimate like this.

Can reach 2 instruction per cycle (average) without a problem. Random instructions are not strictly sequential, so more is also possible.

Do you mean an out-of-order design? Because instructions have dependencies, so you cannot reach 2 IPC without reordering.

Stratchpad access: 0 cycle (pre-fetch in advance)

Not possible if the address is not known in advance. If it's known in advance, you need to reorder the load, which is what out-of-order CPUs do to hide the latency.

Cache access: 0 cycle (pre-fetch in advance)

True, CPUs also prefetch it, so it's a mistake in the calculation. Cache load is designed to be prefetched.

mult-sub latency 1 cycle

Even a 64b x 64b -> 128b multiplication? 1 cycle at what frequency?

ASIC has no problem running up to 3GHz

So why do most existing ASICs run at sub-1 GHz frequencies? I'm assuming it's not because they can't, but because it's more power efficient to run slowly.

we don't need complex cache coherence

You need some load/store coherency control if you have a retire stage. You need a retire stage if you do speculative execution. You need speculative execution if you want a pipelined design because there are branches in the code. If you don't use a pipelined design, you will not run at even 1 IPC.

timolson commented 5 years ago

I've reviewed the RandomX spec along with 7400 who's a top ASIC designer here in Silicon Valley.

As Sonia points out, power requirements are impossible to guess until the chip is close to finished, but from a macro point-of-view, we agree that IF you can utilize enough of the CPU, then the two big vendors, Intel and AMD, have a lot of IP, layout work, and scale that cannot be matched by other competitors. Philosophically, I think that's a bad thing, but we'll stick to technicals in this thread.

SquareHash

There seems to be a misconception that the latency of this hash matters whatsoever. Latencies can pipelined, and an ASIC can queue up a bunch of SquareHash requests for the next nonces. We really don't care about latencies in an ASIC, as long as we're able to pipeline the computation to gain throughput. However, this doesn't mean that a single-chip ASIC would use the 256MB cache to dynamically generate the data table, but that's because of the power not the latency. The energy required to recompute SquareHash makes it infeasible for a H/J metric. The latency of SquareHash is only a minor inconvenience. SquareHash still achieves its goal, but accidentally, not for the reason you think.

Branching

This is the crux of the matter. RandomX basically represents the compute datapath of a CPU, but in modern CPU's this is the smallest part of the chip! If the goal is to utilize the CPU fully, you must find a way to exercise all the fancy instruction reordering and speculative execution on modern CPU's. Of course the paradox is that bad branches cost energy, but if you don't branch, then you're wasting silicon, a lot of it, and advanced branch prediction IP, a lot of it. The recent settings of "no branching" and "1/128 branching" are insufficient. It's relatively easy for an ASIC to always assume no-branch and then clean up the fault 1/128 of the time. You can handle it with a stall, and you can pipeline another nonce context while you're waiting. Branch prediction is (can be) hard and discourages ASIC development, while optimizing a datapath of MUL's and XOR's is much easier. You need to emphasize branch prediction.

We're not sure there's really a solution to the wasted-energy bad-branch problem and it might be a critical paradox. CPU's are designed to be as fast as possible first, and save energy second. PoW cares about energy first and speed second. No way around that.

Perhaps Sonia was reviewing the version with no branching, which is when threading becomes very useful, like a GPU. Without branching, RandomX becomes a very simple compute datapath, much simpler than any CPU, which is maybe the design she's discussing.

Init / Final AES

Of course these can be crushingly faster in ASIC than CPU, but about the same energy. Just make sure the initialization and finalization are short compared to the inner loop. I don't know your current timings.

Compilation

There will be compiler tricks. The current Generator doesn't do any optimization AFAIK, but for one thing, instruction reordering and latency resolution can be calculated up-front for static accesses. This reduces the usefulness of CPU's complex instruction reordering at runtime, and can make simple non-reordering designs competitive.

Overall

RandomX has a much better chance of success compared to RandomJS, because it more closely approximates the target hardware. However, whether you can consistently utilize enough of the CPU to prevent ASICs from being 2x more efficient than current CPU's is an open question. On first impression, 7400 thought RandomX would work, but after we did our review and thought about some possible tricks, now he's more optimistic that a RandomX ASIC could be built. Modern CPU's are huge and complex and it's difficult to claim that enough can't be removed from them to get 2x efficiency in an ASIC. There's enough encouragement here that ASIC makers will certainly take RandomX through the design stage, and Sonia's "can-do" attitude has merit.

SChernykh commented 5 years ago

but that's because of the power not the latency. The energy required to recompute SquareHash makes it infeasible for a H/J metric. The latency of SquareHash is only a minor inconvenience. SquareHash still achieves its goal, but accidentally, not for the reason you think.

@tevador is working on SquareHash replacement that will use even more calculations/power, we already realized it.

The recent settings of "no branching" and "1/128 branching" are insufficient. It's relatively easy for an ASIC to always assume no-branch and then clean up the fault 1/128 of the time. You can handle it with a stall, and you can pipeline another nonce context while you're waiting. Branch prediction is (can be) hard and discourages ASIC development, while optimizing a datapath of MUL's and XOR's is much easier. You need to emphasize branch prediction.

Easier said than done. Predictable branch patterns needs to be generated somehow and ASIC can simply use this generation algorithm to predict 100% of branches. Fully random branches can't be predicted and CPU will just waste power trying, unlike ASIC.

timolson commented 5 years ago

Easier said than done. Predictable branch patterns needs to be generated somehow and ASIC can simply use this generation algorithm to predict 100% of branches. Fully random branches can't be predicted and CPU will just waste power trying, unlike ASIC.

The point we'd like to emphasize is that utilizing the complexity of branch predictors is more important than the small amount of energy wasted on bad branches. Choose your poison.

timolson commented 5 years ago

Good Stuff The critique above was more about "holes" we found, but I wanted to say wow you've done a lot of work to match the ISA to compute capabilities. Things like 2:1 cache r/w and tuning ratio of instructions to available execution ports on typical CPU's... all good. But such details also highlight the challenge you face, since these proportions change across CPU's...

SChernykh commented 5 years ago

The point we'd like to emphasize is that utilizing the complexity of branch predictors is more important than the small amount of energy wasted on bad branches. Choose your poison.

We discussed it before. The best we can do is to have branches at least in some form which is the easiest for CPU (predicted not taken) and little energy wasted on mispredictions. ASIC will most likely use simple in-order pipeline without the need for branch prediction at all. Branch predictor uses a lot of chip area, but not that much power compared to the rest of CPU.

SChernykh commented 5 years ago

Init / Final AES Of course these can be crushingly faster in ASIC than CPU, but about the same energy. Just make sure the initialization and finalization are short compared to the inner loop. I don't know your current timings.

It's a negligible amount of time: 262144 AES rounds for init + final AES in total and there are 4 parallel AES chains, so it can be pipelined on CPU (1 new AES each cycle) and fits in 262144 CPU cycles: 0.06-0.07 ms, 2% of total hash time.

Compilation There will be compiler tricks. The current Generator doesn't do any optimization AFAIK, but for one thing, instruction reordering and latency resolution can be calculated up-front for static accesses. This reduces the usefulness of CPU's complex instruction reordering at runtime, and can make simple non-reordering designs competitive.

Not sure how to mitigate it. @tevador - what % of hash time does code generation take currently? Maybe we can switch to 16 programs of 1024 iterations each to increase cost of compilation step? Or another approach, add some more optimization logic to CPU code generator if it can increase CPU performance or save power. I'm sure some single-pass optimizations are possible here, something like moving loads from scratchpad a few instructions up to hide latency.

tevador commented 5 years ago

@timolson Thanks a lot for the review!

SquareHash

We really don't care about latencies

Latencies matter especially for the single-chip design. Since the number of scratchpads you can store on-chip is limited to about ~250, the pipelining options are limited.

The energy required to recompute SquareHash makes it infeasible for a H/J metric

I'm still not entirely convinced about this, so I'm designing a replacement for SquareHash which has 3x more multiplications and 4-5x more ALU operations overall (same latency as SquareHash on a superscalar CPU). This should kill any single-chip design.

Branching

you must find a way to exercise all the fancy instruction reordering and speculative execution

Those are already utilized:

The recent settings of "no branching" and "1/128 branching" are insufficient. It's relatively easy for an ASIC to always assume no-branch and then clean up the fault 1/128 of the time. You can handle it with a stall, and you can pipeline another nonce context while you're waiting.

The importance of branches has been pointed to us by reviewers on one hardware forum and that prompted us to introduce the "1/128 branches". The difference between "no branches" and "1/128 branches" is quite significant if you think about it. There are basically 3 ways how an ASIC might handle those branches:

  1. Simply stall the pipeline when a branch is encountered and work on another hash. This increases the per-hash latency and requires more memory for scratchpads. No big issue for an ASIC with external memory, but still requires non-trivial logic similar to SMT.
  2. Unroll the branches and execute both directions, discard the taken branch if the condition is false. This is relatively simple to implement but it doubles the number of instructions that need to be executed per hash, so H/J will suffer a lot. This is probably the worst option.
  3. Do speculative execution: predict the branch as not taken and do a rollback when a misprediction occurs. This is exactly what a CPU does. It has the advantage of being both fast and energy efficient (only ~1% of wasted instructions). However, it significantly increases the complexity of the design because you need the following additional components:
    1. Store buffers. You cannot commit stores to the scratchpad because you have to be able to do a rollback. Having store buffers requires write-read coherency logic because you might read from a location that is currently modified in the store buffer.
    2. Register allocation. You cannot commit register values to the register file until you are sure that the prediction was correct.

Out of the 3 options above, I think the most likely one to be select for a hypothetical ASIC design is the first option, which doesn't actually need any branch prediction anyways.

Init / Final AES

All the fixed overhead including the AES generator/finalizer, Blake2b and compilation make up roughly 5-10% of the hash time. That's the best we can do because the hash time cannot be increased any further due to verification time concerns.

Compilation

can make simple non-reordering designs competitive.

I'm aware that some optimizations might be possible, but compilation/optimization must be performed online, so you might not have the luxury to do expensive optimizations. Also branches will complicate your options to reorder the instructions. There is no easy way to prevent optimizations except perhaps increasing the cost of compilation as suggested by @SChernykh but this must be done carefuly to not hurt CPUs too much.

@SChernykh

what % of hash time does code generation take currently?

I don't have the latest numbers, but it's around 1% for the x86 compiler.

I'm sure some single-pass optimizations are possible here, something like moving loads from scratchpad a few instructions up to hide latency.

Moving loads will not increase the performance of out-of-order CPUs. The only optimizations that might be possible are eliminations of some instructions, such as two XORs with the same source value or 2x NEG on the same register.

MoneroChan commented 5 years ago

@timolson @tevador Question: IP theft issues (e.g AMD/Intel Branch Prediction)

tevador commented 5 years ago

Can AMD/Intel's IP be enforced on "Secret ASIC manufacturers"?

Short asnwer: No.

At the moment, RandomX doesn't need any complicated branch predictor (all branches are predicted not taken by default).

The problem is that in order to use random code for consensus, the rules must be set out in advance. That includes the rules for branches. So there are basically 2 options:

  1. Random branches that depend on pseudorandom runtime values. These fundamentally cannot be predicted, so having a branch predictor can actually be worse than not predicting at all (see Agner Fog: The microarchitecture of Intel, AMD and VIA CPUs - pages 14-15).
  2. Predefined branching patterns that use the branch predictor. This has two problems:
    1. Intel's/AMD's branch predictors, while being pretty good, are designed for all possible use cases and branching patterns, so they would not be able to match a specialized predictor that follows the exact rules of the protocol. The result would be that an ASIC could have a 100% prediction rate, while Intel and CPU would have maybe 95-98%.
    2. The values of the registers in RandomX are completely random, so they cannot be used to make predictable conditions. The only register that is predictable at the moment is the iteration counter, which could be used to make a predictable branching sequence, but that would be too simple and doesn't require a branch predictor (ASIC could just have some wires connecting the branch unit directly to the iteration register and predict 100% of branches).

In addition to this, we think that an ASIC would actually not predict branches at all and would simply stall the pipeline and work on another hash. See my prevous comment why this is the most likely option.

TL;DR: Branch prediction cannot be forced.

tevador commented 5 years ago

The updated design documentation contains new details about RandomX. Some points that are closely related to the above discussion:

shelby3 commented 2 years ago

@timolson

The point we'd like to emphasize is that utilizing the complexity of branch predictors is more important than the small amount of energy wasted on bad branches. Choose your poison.

I want to elaborate on this for readers who may be a bit less tuned in to the Specification and Design documents and/or lack the domain knowledge. Also to check if I have the correct understanding of the situation.

As you noted and also by @ifdefelse, a general purpose CPU is designed to minimize worst case throughput and latency, not fully optimize throughput-efficiency and throughput-wafer-cost products, because the main cost for a productivity device is time/responsiveness not energy or sunk capital cost — which includes not exceeding time limits for I/O so as to not repeat requests and to minimize the limiting factor in Amdahl’s law of the part of the program which can’t be parallelized. A CPU will encounter a wide diversity of algorithms (e.g. various manifestations of nested loops, recursion, memory access patterns) and nondeterminism driven by I/O.

To the extent that mobile computer productivity and/or server software can be designed to utilize more parallelized computation then CPUs will presumably become optimized for RandomX because power-efficiency is a significant factor for those devices. Perhaps to be more so if the low power consumption eink displays become commonplace or ubiquitous.

A branch predictor (and speculative execution) can optimize throughput and latency for said numerous facets in varied real-world programs, which do not exist in RandomX.

I wrote:

In the real world they’re inseparable because of inherent serial, contention and synchronization overhead. But afaics RandomX lacks the extent of nondeterminism in the real world of I/O.

The nondeterminism that exists in RandomX after the randomization and compiler stage is the dynamic interaction of the computation (e.g. the registers, execution ports and memory). Yet to prevent the Halting problem (i.e. non-termination due to infinite loops or recursion) the nondeterminism for branching due to those factors is necessarily reduced in the design and as noted (as linked) probably wouldn’t be predictable anyway. Thus wasting energy on speculative branch misprediction would presumably be a reduction in ASIC resistance; and especially so if per my aforementioned point mobile and/or server CPUs eventually become more computational parallelism and power-efficiency oriented. Envision a trending computing paradigm in which computation (verifiable with SNARKs or STARKs) is offloaded from the mobile device to servers.

Branching was retained to reduce statically compiled optimizations and presumably also to force an ASIC to have the speculative branching circuitry mentioned by @tevador on Apr 2, 2019.

I wrote the above before reading @tevador’s Apr 4, 2019 post which confirms my understanding.

@tevador

Latencies matter especially for the single-chip design. Since the number of scratchpads you can store on-chip is limited to about ~250, the pipelining options are limited.

Agreed memory system latency bound does apply for limiting throughput. But RandomX ostensibly doesn’t[1] maximize the latency of off-chip, commodity SDRAM that the ASIC can employ. Potentially worse an ASIC design might presumably employ more memory channels so as to increase the number of banks (and for DDR4 also the bank groups) which form the latency bound. However, there’s a limitation that 256MiB commodity DDR4 chips are no longer available and who knows how long 512MiB will be available. Increasing beyond (currently four for 2GiB Dataset) needed chips would incur an additional power consumption ~1.5W per 4GiB. Presumably at high enough volumes (e.g. if RandomX was driving the world’s reserve currency) presumably commodity SDRAM foundries could be incentivized to manufacture the smaller chips? I believe @timolson has mentioned a limitation on the number of pins but I wonder if even that limit can be stretched by multiplexing the data bus. Again I’m not a hardware design guru, so I might be missing something?

Additionally by employing sufficient threads at very low energy cost (by moving L3 off-chip with a very low latency external memory design, employing optimal mix of low-power slightly higher latency L1/L2) I’m thinking it should be possible to optimize compute throughput relative to power consumption and wafer area cost.

Reo[r]dering is needed to hide scratchpad read latency and dependency-related stalls.

[…]

Simply stall the pipeline when a branch is encountered and work on another hash

Threading to mask latencies may be a general solution to all stalls and may also enable said optimization of L1/L2 tradeoffs? This might also obviate some of said benefit of including branching, since masking latency may have a holistic optimization of design perspective.

You need some load/store coherency control if you have a retire stage. You need a retire stage if you do speculative execution. You need speculative execution if you want a pipelined design because there are branches in the code. If you don't use a pipelined design, you will not run at even 1 IPC.

Perhaps invalidated point by aforementioned also potential compilation to VLIW? VLIW encoded static reordering and ILP would require multiple variants of the instructions to handle all branch targets.

My interpretation of @Sonia-Chen’s post is that there so many design variables in the ASIC design space for a holistic optimization that isn’t available to a general purpose CPU.

Cost to design a RandomX ASIC is still way less than designing a good Bitcoin miner

I find that hard to believe as RandomX is a lot more complex than SHA256. How do you estimate the design cost just from reading the specification?

Clearly he/she wasn’t carefully studying the design or didn’t convey their intended meaning well. If she had referred to ease of gaining a significant first mover advantage then one might interpret the ‘good Bitcoin miner’ to mean there’s a more fertile ground and low hanging design fruit on a RandomX ASIC as compared to extant Bitcoin ASICs?

That's the best we can do because the hash time cannot be increased any further due to verification time concerns.

For combating DDoS spam of erroneous proof-of-work solutions (i.e. every full node must verify them all to know they are invalid) I had the thought of requiring a forfeitable bond to send proof-of-work solutions on the network. The bond must be burned when a transgression is observed. Btw, this isn’t the first time I have attempted to make a useful suggestion for Monero, but surely someone else has already proposed this? One disadvantage of course is one needs to have some XMR before mining. Also I’m unable to design the amount of the bond to be free market market decentralized due to the free rider problem.[EDIT: Nodes could individually and each independently set the level of bond, in which forwarding nodes (that have validated the hash nonce) could supplement at no risk the bond provided by the source of nonce.] Bonds require a fraud proof to collect and thus must be a feature added to the consensus protocol.

[1] And probably can’t since it varies by personal computer thus must choose reasonable maximum commonly shared SDRAM memory system latency per core.