Open timolson opened 6 years ago
I’m discussing this on IRC in the Monero research channel if anyone is interested.
I’ll close the ticket because it doesn’t seem like anyone wants to discuss here.
I maintain that an ASIC which is more efficient than a CPU can be developed for this PoW.
RandomJS is not ASIC-resistent.
It depends on what you mean by "ASIC-resistant". For me, it doesn't mean that a custom hardware cannot be developed or cannot be faster. That would be "ASIC-immune".
The typical CPU development cycle is ~5 years. The picoJava processor took about 3 years of development before being scrapped. Javascript is an even more complex language. A lot can change in crypto in 3 years. The length of development and the required investment can mean that no ASIC will be developed simply for economic reasons.
A full JavaScript implementation is not necessary. See the "easy nonce" problem thread.
Let's even suppose that a full processor is necessary (I don't think it is, but for the purposes of discussion...). Your generator is not really that random, and tends to create programs of characteristic size, with a characteristic number of expressions, and a fixed distribution of functions to variables, etc. We may simply do statistical analysis on the generated programs; Suppose we find that on average, programs generated by RandomJS require 3MB of memory. We may quickly build a chip which outperforms consumer CPU's by licensing an ARM core and giving it exactly 3MB of SRAM. All the Intel chips which have 8MB of L3 cache per core are now inefficient compared to our chip, and we've done very little hardware design. Importantly, it is not the absolute size of the cache that matters, only the ratio of cache-to-core. If you make programs require 10MB the problem is even worse. Our ARM with 10MB L3 will blow away the Intel with 8MB cache because the Intel is waiting on cache stalls. So maybe then you design for a specific cache-to-core ratio. Now you have the problem that any CPU's which are not in the exact family you design for are slow. Even people with Intel's... if they do not have the chip with 8MB cache they are at a disadvantage to people with the 8MB Intel.
This idea may be extended well beyond caches to ALU's and other circuitry. For example, if you generate, on average, one integer variable for every two float variables, an ASIC design will have two int units and one float unit, whereas the Intel has one of each. ASIC wins.
ASIC exploitable:
<item type="Add" weight="4" /> <!-- OPERATOR += -->
<item type="Sub" weight="1" /> <!-- OPERATOR -= -->
<item type="Mul" weight="1" /> <!-- OPERATOR *= -->
<item type="Div" weight="1" /> <!-- OPERATOR /= -->
Also, the idea that 5 years is required is not true. A chip doing PoW does not have to be perfect. It only has to be good enough to find some solutions some of the time. A proper CPU or JavaScript chip must be 100% perfect 100% of the time. Most of the development for these projects is not design of the chip but validation. ASIC miners may get away with lots less validation. So what if the float unit gives wrong results 1% of the time if the chip is 10% faster? You just gained about 9% in a PoW context. In a general CPU context, Photoshop just crashed.
ASIC exploitable:
<item type="Add" weight="4" /> <!-- OPERATOR += --> <item type="Sub" weight="1" /> <!-- OPERATOR -= --> <item type="Mul" weight="1" /> <!-- OPERATOR *= --> <item type="Div" weight="1" /> <!-- OPERATOR /= -->
Here the situation is not as simple because the function of the +
operator depends on the type of operands. It can mean integer addition for integers, floating point addition for FP numbers and string concatenation (with some additional function calls) for everything else.
Most of what you wrote applies to any PoW algorithm.
The programs have to be limited to keep the verification time under control. Nothing can be done about that.
The custom ARM chip is probably the most realistic threat (we have already confirmed that even Raspberry Pi 3 is one of the most efficient devices at running RandomJS).
I will try to do more research about the possibility of generating random x86 code. If we couple that with a memory access pattern tuned for CPUs, it could be more ASIC resistant than the current approach.
ISA licensing issues aside, I don't believe it's a good idea to generate X86 or any other existing ISA. There's no good reason to benefit X86 and penalize other architectures (ARM, MIPS, SPARC, POWER, RISC-V, etc) at this point in history. ARM servers are about to become mainstream, and they already dominate the portable/handheld market. MIPS still dominates consumer routers. SPARC and POWER both still have significant data center presence. Supporting all of these aids decentralization.
Note: we've created the #monero-pow channel on IRC to avoid these conversations getting lost amid the rest of the MRL message traffic. Feel free to join us there.
https://github.com/tevador/RandomJS/issues/3#issuecomment-429041301
The fact that ARM cores are competitive should be a major concern. It is likely that an even smaller, simpler implementation will outperform, i.e. ASIC.
I must withdraw my recommendation for x86 and agree with @hyc. It is critical for mining fairness that we have many manufacturers. Using x86 basically grants the ASIC monopoly to Intel and AMD and excludes new competitors from competing. What happens when Intel decides to make miners and produce a RandomX86 ASIC? Binding to x86 would create a true monopoly because of the insurmountable barrier to entry for new competition. SHA-2 is wayyyy easier to implement and we still see significant variances in the performance of different designs.
More on ARM - hardware support for SHA-3 https://twitter.com/thecomp1ler/status/989530582513168385 Hardware support for float-to-fixedpoint conversion with Javascript semantics https://twitter.com/gparker/status/1047246359261106176
AFAICS these just keep ARM competitive with pure ASICs.
Keep in mind that the ARM will have one SHA-3 datapath per core. That datapath is tiny, so a dedicated SHA-3 ASIC might have thousands or tens of thousands of them. Not really competitive.
100% behind Blake2b as a fast hash in CPU's.
I would like to add that the Athena project which evaluated the hardware performance of SHA finalists appears to be written primarily by students, and for the specific hashes we've reimplemented, we do get better performance. So hashes which are close by Athena's hardware performance measurement may be out-of-order, including Blake2b. However, the speed of Keccak is far better than the others and I'd expect that not to change.
Another complicating factor to keep in mind is that PoW usage of these hashes often discards features of the original design, which is especially true for Blake2b. Blake2b has a number of features which are unnecessary for most PoW's which use it, so significant logic may be trimmed from the Athena implementations that are true to the specs.
We discussed possible changes to improve ASIC resistance of RandomJS on IRC.
Ideas that were pitched:
Make part of the program fixed for the entire block. This increases resistance against "nonce skipping" and not fully compliant ASICs.
As I was alluding to in IRC earlier today, someone needs to fully grasp what goes into the PoW block header, because there may be elements that could function as a nonce to allow for nonce skipping. Here is some info on the block headers. https://monero.stackexchange.com/questions/3958/what-is-the-format-of-a-block-in-the-monero-blockchain
@Gingeropolous We can use the hash of the previous block to initialize RNG for the fixed part. Same idea was used in CNVM.
ISA licensing issues aside, I don't believe it's a good idea to generate X86 or any other existing ISA. There's no good reason to benefit X86 and penalize other architectures (ARM, MIPS, SPARC, POWER, RISC-V, etc) at this point in history. ARM servers are about to become mainstream, and they already dominate the portable/handheld market. MIPS still dominates consumer routers. SPARC and POWER both still have significant data center presence. Supporting all of these aids decentralization.
How about generating random RISC-V programs?
Advantages:
Parser skipping would be fundamental for any good miner, software or hardware. Directly generating RISC-V has the benefit that a RISC-V CPU already embeds the "parser." Furthermore, generating RISC-V ties the code much more directly to the hardware. Using JavaScript introduces many high-level abstractions which may be optimized. JavaScript must always be "translated" into object code before a CPU can run it, allowing for a wide range of complex optimizations. Generating object code directly removes this tower of "babble."
Even in the RISC-V generating case, an ASIC designer will embed the code generator itself, whereas a generic RISC-V CPU must still run the code generator as "non-native" object code rather than custom circuitry.
Since the RISC-V ISA has a lot less complicated "syntax" than Javascript, the generator could be simpler and would have limited "knowledge" what the generated code actually does. Therefore the decoding stage could not be easily skipped.
For example, each instruction is 32-bits and usually at least 20 bits can be completely random (src/dst registers, immediate constants, part of the opcode).
Generating object code directly still leaves room for optimization. I can find analogous asm instructions between arbitrary CPU families/ISAs easily. I.e., the generator could generate native object code for the local machine just as easily as generating RISC-V object code. Even if we kept the "hash the generated source code" requirement it can still just be discarded and the native object code executed.
The problem with generating RISC-V without knowing what the code does is we can no longer guarantee that there are no infinite loops or recursions.
I suppose we could make the object code parser-skipping harder by interleaving all of the 1st pass generated code with a 2nd pass that does computations on bytes referenced off the program counter.
I can find analogous asm instructions between arbitrary CPU families/ISAs easily. I.e., the generator could generate native object code for the local machine just as easily as generating RISC-V object code.
Reasons why it may not be so easy:
I suppose we could make the object code parser-skipping harder by interleaving all of the 1st pass generated code with a 2nd pass that does computations on bytes referenced off the program counter.
Can you elaborate how it would work?
Again the idea is to make the code part of the execution data. We load some bytes indexed off the PC into registers and blend them into the computations.
But again, targeting a specific ISA like RISC-V gives actual RISC-V chips an unfair advantage. Why would we want to do this? As I mentioned on IRC, if we're going to target a machine language, better to target a machine no one is building any more or ever again - like DEC Alpha or PDP-11.
But again, targeting a specific ISA like RISC-V gives actual RISC-V chips an unfair advantage.
For any language/ISA we choose, an ASIC/CPU specialized for it will always have an advantage. As @timolson pointed out, even for a complex language like Javascript, an ASIC can provide a significant speed-up, but the cost of development might cause centralization of hashpower.
RISC-V is simple and royalty-free and there are open source hardware designs available. It might be a way to solve the chicken-egg problem of ASIC commoditization.
better to target a machine no one is building any more or ever again - like DEC Alpha or PDP-11
This would give a monopoly to whoever holds the rights to those instruction sets. We could also create our own instruction set. Something similar to this, which has the advantage that any random bitstring is a valid program, so the generator is trivial.
If the goal is commoditization, RISC-V has a market of existing proven designs, but the generator problem remains. The market will start with RISC-V boards until ASICs embed the generator code at a minimum. Using an older ISA doesnt seem too different from creating our own ISA, which isnt much different from our own ASIC PoW, in which case let's go simple and proven. :+1:
OK, starting to appreciate the possibilities in cnvm. Was just now thinking, the way to avoid parser-skipping is to randomly generate the source code one character at a time, so that the expression tree structure can't be known in advance. This would be easier in something like an assembly language so that the majority of character combinations are valid instructions.
Yes, I'm thinking to make a custom instruction set similar to RISC-V except with looser validation, so any instruction word is valid.
I've been also thinking how to optimize the memory accesses. For example, Cryptonight makes random accesses anywhere in the 2 MiB scratchpad. This takes advantage of the large L3 cache of modern CPUs, but L1 and L2 caches are not used to their full potential.
Most common cache sizes per core are 2 MiB for L3, 256 kiB for L2 and 32 kiB for L1D. I propose to split the 2 MiB workspace into 8x 256 kiB "L2-blocks" and each L2-block would be again split into 8x 32 kiB "L1-blocks". For each memory read/write, the same L1-block would be used with a probability of 7/8. In one case out of 8, we would switch to a random L1-block within the current L2-block with a probability of 7/8. In the remaining case, a completely random address in the workspace would be selected.
This means that the CPU would have to go to the L3 cache in only about ~1.3% of cases, L2 cache in ~10% cases and most accesses could hit the L1 cache. The probabilities could be adjusted to take full advantage of prefetch and to make sure the 2 MiB memory requirement cannot be avoided.
Maybe @timolson can suggest a better access pattern so that an ASIC would be forced to use multiple memory levels for optimal performance.
I kinda doubt an ASIC would need caches or a complex memory hierarchy. A 2 MiB workspace is trivial, could be all SRAM. For a large number of cores and large enough memory they could just use HBM.
A custom instruction set as you describe might work, but even if all words are valid instructions, we need a way to define and limit looping, subroutine calls, etc. Unless you still want to just have straight line execution, with externally configured loop parameters. I wouldn't favor this route because it removes the need for branch prediction hardware.
I kinda doubt an ASIC would need caches or a complex memory hierarchy. A 2 MiB workspace is trivial, could be all SRAM.
AFAIK, larger SRAM must have a higher latency due to propagation delay. A 2 MiB workspace can never be as fast as a 32 kiB cache close to the core. That's the reason why CPUs have multiple cache levels and not just one big chunk.
For example, Intel Core i7-7700K:
Cache | Latency (cycles) |
---|---|
L1D | 4 |
L2 | 12 |
L3 | 38 |
This CPU could hash ~7 times faster with the optimized memory access pattern, while an ASIC with a single SRAM block would have the same hashrate as with the random access pattern.
To me it looks like something that could increase ASIC resistance.
A custom instruction set as you describe might work, but even if all words are valid instructions, we need a way to define and limit looping, subroutine calls, etc.
In cnvm, this is solved by limiting the total number of executed instructions. That also has the benefit of roughly equal runtime for all programs, but maybe it can allow for some optimizations in ASIC, so I'm not sure if it's the best approach.
SRAM on chip will be organized into blocks likely not bigger than 256k each. For bigger rams, switching wires route across shards. Yes, propagation delay can be a problem if the logic needs uniform access to a very large SRAM, but 2MiB (hardware normally quotes bits not bytes so 16Mib) is not that large and in 28nm our CN core could randomly access a 2MiB scratchpad in about 1.25 ns or 800MHz. Part of the reason for staged caches is to get the clock speed up, which may or may not be helpful. It looks good on the marketing box, though. If your fastest opcode (atomic unit of work in whatever algorithm) can’t run faster than 1ns for example, then a clock over 1GHz doesn’t help and would actually be detrimental.
I’m not a CPU expert but my understanding is that lots of effort goes into prefetch logic (and speculative execution, but Spectre...) Good data from hyc on fetch times... older Intels can take 50-70 cycles to hit L3! If you use a random access pattern it’s gonna hurt. If you use a sequential or strided access, than I’d guess the prefetch works great. The problem would be that if an ASIC designer knows the access patterns, they can design for it. But generalized prefetch for a variety of access patterns is a hard problem.
Prefetch and speculative execution go hand-in-hand and with Spectre exploits being basically unfixable except by disabling spec-ex, you should be careful with design and testing. You want something that does not require spec-ex to be fast. On the other hand, that’s a big part of (Intel) CPUs and you lose performance and waste significant die area on unused spec-ex logic.
One more comment about access times in relation to clock speed and power. The primary metric for PoW is power per hash, and higher clock speeds bleed more power and generate more waste heat. CPU’s and especially GPU’s are designed to run fast & hot, power be damned. This approach has changed recently, but generally ASICs can gain a power-per-hash advantage by running slower than a GPU or CPU. To tie something to CPU, that extra power has to be really worth it. Try to ensure that computation blocks have enough compute in them to hide the ram access latency, then the extra GHz will not be wasted. However, this is a catch-22 because ASICs can optimize tight logic sequences very well... Maybe you want to generate random code snippets that are like 30 opcodes long, and chain those together with ram access in-between. Or make sure the ram access can be prefetched so the core stays running 100% hot. But that kind of thing is tying to Intel, really, not just CPU’s. I don’t think ARMs have the same prefetch/specex sophistication.
So essentially CPUs are best for sequential workloads that require a high clock speed to finish the task as quickly as possible.
It doesn't sound so good for ASIC resistance because mining is inherently a parallel workload (you can always run multiple nonces at the same time).
I suspect (but do not know) that part of the excessive latency for CPU's to hit L3 is due conflict administration, since L3 is shared across cores. I read just a little about Sandy Bridge and apparently all L2 caches are also mirrored into L3. If you're trying to tie to Intel, maybe use a 4-way or 8-way parallel compute structure with maybe a 32k scratchpad that must be occasionally swapped between cores... As you can see, it gets to be very specific to a particular CPU arch. AMD's caches are different and slower...
Intel caches are inclusive: full contents of L1 are also present in L2, full L2 present in L3. AMD caches are (generally) exclusive, contents only exist in one cache layer at any given time. (Apparently Zen architecture has some exceptions to this now.)
CPUs are best for non-sequential workloads with heavy branching. GPUs would be better aimed at sequential workloads, all of their memory architecture is designed for sequential access.
CPUs are for serial workloads (single-threaded), while GPUs and ASICs are better for parallel workloads, where you can have many small cores doing similar work at the same time.
Now if we want to design an instruction set for our virtual CPU, we should agree on a memory model. There are basically 3 options:
The third option is arguably the most ASIC resistant, especially if we tweak the access pattern it so that caches are useful.
I have been doing some research and testing on a GPU implementation.
One thing is that passing parameters to a function is done by value except for arrays. But you only know the type of the variable at runtime. And it might change during a loop. The implementation might choose to have the same size for all objects, but it means to size it for the largest string/array. This reduces tests & branching to copy the parameter but puts a huge pressure on memory. Still you have to check the variable type and pass it by reference or by value. The way RandomJS is built is that you are constantly calling very small functions, so function calls are an important part of execution time. I made a quick estimate of the ratio of instruction/bytecode types on a portion of RandomJS JS: load: 20%, store 20%, cmpeq 10%, jmp 10%, call 4%, ret 4%. Inlining could reduce call/ret but it's not simple for nested calls.
The other thing about eval() is you can make a fast parser that discards trivial error cases, then call xst to have a better confidence and if it's a valid expression then embed AST/RPN evaluation in the asm/bytecode. But you still have to handle whether or not the variables are defined at runtime.
And the number to string stuff... A fully accurate algorithm is about 2000 lines of C code with branching everywhere. It can be approximated using IEEE double with 10x less code but still with heavy branching. An ASIC extended IEEE double might do better, but this will probably require a lot of transistors.
This POW is not ASIC/GPU friendly as it may seem at first look.
Discussion about RandomX here: https://github.com/tevador/RandomX/issues/1
I couldn't find a better issue to discuss this. RandomJS is not ASIC-resistent. An ASIC designer may simply design a chip to do compilation. In effect, your PoW is a compiler, which is definitely implementable in hardware.