tevador / RandomJS

Cryptocurrency proof-of-work algorithm based on random javascript execution
GNU General Public License v3.0
42 stars 11 forks source link

The easy nonce subset problem & solution #9

Closed Gingeropolous closed 5 years ago

Gingeropolous commented 5 years ago

This was originally discussed during one of the early brainstorming sessions about this idea. It was also recently brought to light during a discussion on #monero-research-lab IRC.

Basically, as I understand it, a potential ASIC / customized hardware could exploit the fact that some subset of the randomly generated programs could be easier to calculate than others. So, the mining program would just cycle through nonces looking for those nonces that create these easier programs, and then send these programs to to the specialized hardware to perform the calculation.

According to the pseudocode, PoW = Blake2b(K, R). So therefore the only criteria for a valid proof of work is that the RandomJS ran and produced some code, and that the hash of the code and solution has some characteristic that can be measured as difficulty.

What I'm wondering is if whether after step 3, we can include a check to see whether the randomly generated javascript program contains an equal distribution of all of the programming functions. For example, if a nonce produced a javascript program that was nothing but a series of addition functions, but the output of the whole thing was under the difficulty target, it would still be an invalid PoW because it didn't meet the criteria of having an assortment, equal distribution, ( something ) , of all of the available functions from the function space.

Basically, we just introduce a criteria for a valid PoW to include all different types of functions so that an ASIC wouldn't be able to exploit some easy subset.

ouillepouille commented 5 years ago

I don't grasp a lot of what's going on deep down, but looking at this from a higher-level perspective, maybe there's a way to add a step in the CPU miner's algorithm that uses a hash of the compiled bytecode ? If the JS engine outputs universal bytecode, add a step that takes into account a hash of that bytecode. Then we have "proof of compilation"... Is such a thing possible ? I'll try peeking at the code's connection to the JS engine and see...

PS : I don't know if Tevador is still interested, but it's nice to see that others are ! I'm really looking forward to the Wownero integration/testing of this !

Gingeropolous commented 5 years ago

So I realize I am proposing to effectively flip the attack and incorporate it into the PoW itself. Instead of an ASIC designer making a miner that searches for easier calculations, we just have the PoW itself only use difficult calculations. And because we don't know which will be difficult, we just need to create the PoW so that it uses an element of each different class of function in a given PoW.

One way to do this would be that instead of the randomness creating a series of random functions, it instead creates the order of a given set of random functions.

E.g., if there are functions (or function classes) 1 2 3 4 5 6 7 8 9 , then the randomness just changes the order, not necessarily the composition... i.e., a given block will produce the following series

9 4 8 6 2 5 3 1 4 7

but it will never produce

1 1 1 1 1 1 1 1 1 1

this ensures that throughout the entire noncespace, all PoW are effectively equal in their complexity.

timolson commented 5 years ago

This simply trades one problem for another. With your modified approach, an ASIC designer can use the fact that exactly one of each op will be used, and devote the correct amount of circuitry to each stage. In fact, this enhances parallelism.

All you've done is make the routing less complicated and reduced the amount of serial processing necessary.

Gingeropolous commented 5 years ago

better solution proposed by needmoney in IRC #monero-research-lab tojust chain the randomJS

tevador commented 5 years ago

I don't grasp a lot of what's going on deep down, but looking at this from a higher-level perspective, maybe there's a way to add a step in the CPU miner's algorithm that uses a hash of the compiled bytecode ? If the JS engine outputs universal bytecode, add a step that takes into account a hash of that bytecode. Then we have "proof of compilation"... Is such a thing possible ? I'll try peeking at the code's connection to the JS engine and see...

The XS engine is an interpreter, so there is no "bytecode" generated from the source.

It's true that an adversary could rate each expression with some "difficulty" rating and then simply sum the score of the program to see if it's "easy". The problem is that the code has dead branches and loops, so static analysis is not enough to actually estimate the runtime.

In any case, has anyone tried to estimate the possible performance benefits of nonce-skipping?

At the moment, the distribution of RandomJS program runtimes looks similar to this: Imgur This is on a CPU, but I'm assuming the shape would be similar for an ASIC running bytecode.

The average runtime of the above distribution is 0.00972 seconds. This is what one hash would take for a miner who doesn't skip nonces.

The first 3 steps of the mining algorithm take around ~6% of the total solve time (most of this is the program generation routine). Now let's assume an adversary can somehow decrease this overhead time to just 1% per nonce.

The easiest nonce-skipping strategy would be to generate a group of N programs and run the easiest one from the group (with the lowest runtime).

Using the strategy of choosing the fastest program out of N, we will get the following results:

N Time per hash
1 0.00972 100%
2 0.00756 77.8%
3 0.00682 70.2%
4 0.00644 66.2%
5 0.00617 63.5%
6 0.00607 62.5%
7 0.00599 61.6%
8 0.00592 60.9%
9 0.00586 60.4%
10 0.00588 60.6%
11 0.00586 60.4%
12 0.00588 60.6%
13 0.00592 61.0%
14 0.00593 61.0%
15 0.00597 61.5%
16 0.00605 62.2%
17 0.00609 62.7%
18 0.00614 63.2%
19 0.00618 63.7%
20 0.00624 64.3%

So the maximum speed-up is for N = 9-12. Beyond that, the overhead will start to pile up and performance will drop.

So under these conditions, it would be possible to gain a ~66% performance advantage, assuming you can correctly find the fastest program out of ~10 generated source codes and decrease the overall overhead to 1% per nonce.

With a 6% overhead, the maximum speed-up is ~25% for N=3.

N Time per hash
1 0.00972 100%
2 0.00804 82.7%
3 0.00779 80.2%
4 0.00789 81.2%
5 0.00810 83.4%
6 0.00844 86.9%
7 0.00885 91.1%
8 0.00928 95.5%
9 0.00973 100.2%

With a slightly more narrow runtime distribution or a non-perfect runtime estimation, the benefits would be lower, possibly even negative.

timolson commented 5 years ago

What does “first three steps” mean? I looked at the flow diagram, sample program, and top-level code for main and genProgram, but couldn’t find numbered steps. I would like to share some insight about the (non) ASIC resistance of this PoW, specifically relating to the easy nonce problem, but want to be precise before commenting.

tevador commented 5 years ago

@timolson

Mining algorithm

  1. Get a block header H.
  2. Calculate K = Blake2b(H).
  3. Generate a random Javascript program P using K as the seed.
timolson commented 5 years ago

Thanks, sorry I missed that doc somewhere.

Steps 1 and 2 we can agree are extremely fast. 3. is not necessary for discarding nonces, as you pointed out:

It's true that an adversary could rate each expression with some "difficulty" rating and then simply sum the score of the program to see if it's "easy". The problem is that the code has dead branches and loops, so static analysis is not enough to actually estimate the runtime.

I strongly disagree with the last statement. This is a huge issue IMHO. Why couldn't a very simple analysis be very effective? For example, I might guess that loops and recursion dominate the execution time. Without generating any code, we run the "generator" whose only effect is to keep a score not generate code yet. Let's compute score = sum( loop_size(n) * recursion_depth ) for all loops n. Wouldn't you expect that very simple heuristic to get you pretty far? We can run billions of Blake2b's per second, and this score counter adds almost nothing to that time. The branching function in the "trimmer" is also fast, and it's not unreasonable to estimate that easily millions or hundreds of millions of nonces per second may be discarded. I would contend that your estimate of 1% of the runtime is way wayyy too high.

This is on a CPU, but I'm assuming the shape would be similar for an ASIC running bytecode.

Bad assumption. ASIC's would not use any bytecode. They would not generate JavaScript text* nor do any parsing. They would simply look at the hash and start to execute it. Think of the hash as object code. No need to expand it then re-parse it. Your generator is effectively already an AST.

With a slightly more narrow runtime distribution or a non-perfect runtime estimation, the benefits would be lower, possibly even negative.

If you narrow the distribution, you are further enabling specialized hardware. Already for example, an ASIC designer can use the fact that your recursion depth and loop counters are taken from a uniform distribution and hard-capped. Intel CPU's cannot make this assumption and must support the general case, so CPU's have a lot more circuitry. As a simple example, the ASIC is only going to use 13 bits for loop counters. The ASIC's call stack is similarly small and limited. Also, with a narrow distribution of program sizes, the ASIC can optimize its cache sizes. If you are not using the full L3 cache of an Intel CPU during every hash, then CPUs have lots of wasted gates. An ASIC miner will eliminate all this extra stuff that Intel puts in there to support general programs. RandomJS, no matter what you do, will always be a special case of programs, which an ASIC designer can optimize for.

tevador commented 5 years ago

Without generating any code, we run the "generator" whose only effect is to keep a score not generate code yet. Let's compute score = sum( loop_size(n) * recursion_depth ) for all loops n.

The generator has to run the whole generation algorithm (bar some data to text conversions) to keep the RNG state in sync. Whether it actually writes code into memory doesn't have dramatic impact on performance.

loop_size and recursion_depth are determined at runtime. Not sure how you would estimate it just by looking at the code (it may be possible in some cases, though). At the moment, significant portion of the runtime is spent in eval.

I'm curious how big a chip would need to be to run the whole generator in 1 cycle at 100 MHz. It's mostly serial code.

They would simply look at the hash and start to execute it.

That might just be a fancy way of saying that some "decoding" stage would feed data to some "execution" stage, like a CPU. Or are you expecting the chip to have all the 2256 codepaths hardwired?

I'm aware that a chip can be made to execute this PoW faster, but I'm not convinced it would be that easy to develop and run so quickly as you are claiming.

If you narrow the distribution, you are further enabling specialized hardware.

What are your suggestions to improve this PoW?

timolson commented 5 years ago

The generator has to run the whole generation algorithm (bar some data to text conversions) to keep the RNG state in sync.

Perhaps I misunderstand and will have to look more closely. Do you mean that it has to generate text, hash text, parse text, run programs? Why couldn't we write a "trimmer" which has similar logic to a generator, except it only keeps a score of estimated execution complexity? Sounds much smaller and faster than "code generation".

At the moment, significant portion of the runtime is spent in eval.

I do not understand why eval would be any slower or different than regular code generation. Again maybe I misunderstand some details, but if you have to generate valid JavaScript, you have an implicit AST structure. Another way of saying it: how is executing eval(string) any different from simply executing the string? If you can generate a valid string to pass to eval, then you can execute that string directly in the ASIC without any additional parsing.

loop_size and recursion_depth are determined at runtime.

Not really.

  <MaxCallDepthRange>3-3</MaxCallDepthRange>
  <MaxLoopCyclesRange>1000-2000</MaxLoopCyclesRange>

ceil(log2(2000)) = 11 bits for loop counters, for example. Call stacks in an ASIC may be very small: we only need to handle a recursion depth of 3.

That might just be a fancy way of saying that some "decoding" stage would feed data to some "execution" stage, like a CPU. Or are you expecting the chip to have all the 2256 codepaths hardwired?

Yes that's all I mean

I'm aware that a chip can be made to execute this PoW faster, but I'm not convinced it would be that easy to develop and run so quickly as you are claiming.

Race you? :)

What are your suggestions to improve this PoW?

Generate x86 instead of JavaScript. No-one can compete with Intel and AMD at building an optimized x86 processor. JavaScript is anyone's game. It's not simple and admits of many many optimizations. (Although I don't think a full JavaScript processor is necessary) On the other hand, binding to x86 you would just be granting Intel and AMD an unfair advantage for producing RandomJS ASIC's...

You could drop the idea of randomness to ensure consistency in device-binding. Create a (long-winded) PoW which exercises every feature of your target CPU in a serial, non-optimizable way. Still, this wouldn't necessarily work since an ASIC designer could do away with all the speculative execution paths and cache loading circuitry, which is a huge part of modern CPU's.

You can tune the distributions of your metaparameters to target the specific characteristics of Intel CPU's, like cache-to-core ratios, int-to-float ratios, etc., as mentioned above. Doing this well would benefit from insider knowledge about the chip design which is simply not available.

IMHO the best ASIC-resistant approach is to tie the PoW to DRAM which is by far the most commoditized chip, even more than CPU's. But still, we see from Ethash and Equihash that custom hardware will still outperform. If you want to ensure that the specialized hardware is commoditized and widely available, use a simple PoW with a proven optimal implementation. Complexity opens the door to proprietary implementations, and increases the barrier to entry for any competition. RandomJS is highly complex and no lower bounds on computation may be proved. Even something "simple" like SHA is too complex, as we saw with ASIC-Boost and continue to see with surprising variability in the performance of different designs. The only way we got 5x over BitMain with our CryptoNight design was because CN is so complex, and big optimizations were missed by the biggest miner producer in the world.

As I stated before, my opinion is that specialized hardware will always outperform general-purpose CPU's and creating an ASIC-immune (to use your term) PoW is not possible. ASIC "resistance" as a matter of degree is possible, yes, but custom hardware has still proven to be economically superior. Even for DRAM-bound PoW's we are seeing 3x and more.

Gingeropolous commented 5 years ago

So, I think its generally understood that an ASIC-immune PoW is not possible, but ASIC resistance may be possible. Indeed, it was always assumed that a cryptonight ASIC would show up, but that it would be within reasonable orders of magnitude. Then there's the whole commoditization problem.

With that in mind, perhaps randomJS (or randomX86, much cooler because it has an X) may prove a better asic resistant PoW by modifying its library somehow.... so instead of every 6 months a magical and special tweak doesn't have to be made to some tricky part of the code, but instead the core developers can just publish a new library of functions. Thus, this decreases the burden of the whole PoW-fork cycle because random functions just get designed up and swapped in... and if an ASIC manufacturer wanted to keep up, they would have to release new chips with the new functions....

because ultimately, yes, it seems any hardware manufacturing effort is going to be cost effective if the purpose of the thing being manufactured remains constant.

so perhaps its not how difficult it is for ASICs to perform thing, but how easy the PoW can be tweaked.

tevador commented 5 years ago

Why couldn't we write a "trimmer" which has similar logic to a generator, except it only keeps a score of estimated execution complexity? Sounds much smaller and faster than "code generation"

The generator logic is what takes the most time. When the generator finds it needs to emit expression X, it simply writes a few bytes into a prellocated buffer. This is the only part that your trimmer could omit. I think it could be about twice faster than the generator, certainly not 100 times faster.

Another way of saying it: how is executing eval(string) any different from simply executing the string? If you can generate a valid string to pass to eval, then you can execute that string directly in the ASIC without any additional parsing.

The point is that the string passed to eval is not checked syntactically by the generator. Actually, about 95% of eval invocations throw a SyntaxError. This means a (simplified) Javascript parser is actually required.

loop_size and recursion_depth are determined at runtime.

Not really.

<MaxCallDepthRange>3-3</MaxCallDepthRange>
<MaxLoopCyclesRange>1000-2000</MaxLoopCyclesRange>

ceil(log2(2000)) = 11 bits for loop counters, for example.

Yes, both loop size and recursion depth are bounded to limit the runtime of the program. It doesn't change the fact that they are detemined at runtime. You cannot tell in all cases if a loop will be executed once or 1000 times without running the code. This would have impact on the precision of your static analysis for nonce-skipping.

Generate x86 instead of JavaScript.

Interesting idea. Would it solve the "easy nonce" problem, though? It's certain there would be licensing issues for companies interesting in developing an x86 ASIC (apart from Intel, AMD and VIA).

timolson commented 5 years ago

Separately from the ASIC discussion... what about the "hard nonce" problem? Validation of a PoW must be fast to prevent DoS, but an attacker may selectively choose nonces (which are ultimately rejected) that target the very long tail of your distribution. I'm not sure what distribution would fit your graph best, but I'd guess Poisson or something else that has a substantial right-tail. Those distributions make it unreasonably easy to find nonces that take a looong time to validate, and an attacker could spam the network with those.

tevador commented 5 years ago

@timolson This shouldn't be a problem due to the proposed DoS-resistant verification algorithm (see RandomJS description). Basically, the attacker would have to first find a nonce that produces a Blake2b hash below the network difficulty before the verifying party even tries to run the program.

For example, if the network difficulty is 1 million, you need an average of 1 million Blake2b hashes to find a collision which would force program execution. 1M Blake2b hashes is around ~100 ms on a modern CPU core, which is more than 99.9th percentile of runtimes.

For mining pools, they will accept lower difficulty than that, but they can ban malicious miners.

timolson commented 5 years ago

Basically, the attacker would have to first find a nonce that produces a Blake2b hash below the network difficulty before the verifying party even tries to run the program.

Do they? Couldn't an attacker lie about the final target hash? You don't know that the final hash is valid or not unless you run the validation through to step 10.

Verification algorithm

Input: H, R from the received block.
...
10. If R != (A XOR B), discard the block.

H contains a malicious nonce and R is completely fabricated.

tevador commented 5 years ago

R can be fabricated, but the second step of the verification algorithm calculates Blake2b(R, K) and if the resulting hash is not below the network difficulty, no attempt to execute code is made since the block is clearly invalid regardless.

timolson commented 5 years ago

Sorry, I missed that step (blush) Thanks for tolerating my poking.

tevador commented 5 years ago

No problem. It's important to review each part of the algorithm with scrutiny.

I've been thinking about your proposal of generating x86 code. While this would have some inherent advantages, it doesn't solve the main issue that you highlighted.

In order to generate valid x86 code (or valid Javascript), the generator has to create an abstract representation of the program and then generate the code based on the rules of the ISA (or a higher level language). At runtime, this code is then decoded back into some abstract representation or a set of microinstructions.

An attacker might elect not to generate any x86 code to avoid licensing issues,, but instead encode the abstract syntax into an instruction set of his choosing or execute the AST directly to avoid the encoding and decoding steps.

The only way how to avoid generating an AST is to generate a random code sequence, which might not be valid. This is the purpose of the eval expression in RandomJS. Unfortunately, this brings a lot of other issues - decreased performance and the risk of optimizations due to slow exception handling in Javascript.

Another solution would be to define an instruction set in which every random bitstring is a valid program, like CNVM, but this opens doors to custom hardware implementations.

tevador commented 5 years ago

Potential solution of the "easy nonce problem" was posted here: https://github.com/monero-project/monero/issues/3545#issuecomment-428670984

Gingeropolous commented 5 years ago

yeah, it would be cool if it was discussed in this issue - that same solution was posted above. That issue linked to is generally large.

Regarding your comment there:

It might be a good solution with 2 chained code gen/exec steps. Longer chains would cause the verification time to be too high or require significant reduction of the average complexity of the program.

One thought I had was to have the complexity decay over the N. E.g., N is 1000 lines long, N+1 is 500 lines long, N+2 is 100 lines long, or something. Because the main defense against the easy nonce searching is that the attacker would have to execute each program, which is the slow part. But perhaps just have N and N+1 is enough, because they would have to search for N+1 by creating and compoling N many times.

SChernykh commented 5 years ago

It is enough to have N=2 to make nonce skipping attack inefficient. Just replace "1%" by "50%" and repeat all calculations from the post above:

The first 3 steps of the mining algorithm take around ~6% of the total solve time (most of this is the program generation routine). Now let's assume an adversary can somehow decrease this overhead time to just 1% per nonce.

Gingeropolous commented 5 years ago

aight.... is this issue closed?

tevador commented 5 years ago

With N=2, the maximum possible speed up (while keeping ~6% generation time) is ~10%:

X Time per hash
1 0.01944 100%
2 0.01790 92.1%
3 0.01760 90.6%
4 0.01757 90.4%
5 0.01789 92.1%
6 0.01805 92.9%
7 0.01852 95.3%
8 0.01900 97.8%
9 0.01945 100.1%

With 1% generation time, the maximum gain is ~25% and with zero generation time, the maximum possible speed-up is ~100% (compared to infinity for N=1).

hyc commented 5 years ago

Sounds like yes, we have a solution to the easy nonce problem.

timolson commented 5 years ago

execute the AST directly to avoid the encoding and decoding steps.

Yes exactly. That’s I was trying to communicate earlier about how an ASIC would operate. Moving to x86 wouldn’t help. Your generator is an implicit AST already. You don’t need to generate text (except to hash it in parallel). Just trigger execution based on the generator state.