monero-project / monero

Monero: the secure, private, untraceable cryptocurrency
https://getmonero.org
Other
8.82k stars 3.08k forks source link

Idea for ASIC resistance #3545

Closed zawy12 closed 3 years ago

zawy12 commented 6 years ago

If ASICs are going to be a recurring problem, can you change the POW to maybe 5 or 10 different options every block based on the hash value of the previous block? That way the software could be changed in GPUs whereas ASICs would hopefully require different hardware for each POW if they are logically and/or mathematically independent enough. Even better would be if a POW could change a single variable to require a different ASIC, then the previous hash would be used to set the variable for the next block.

tevador commented 6 years ago

@zawy12 My virtual machine does mostly what you described. I will have proof of concept code ready soon.

tevador commented 6 years ago

"CNVM" algorithm - proof of concept code: https://github.com/tevador/cnvm

hyc commented 6 years ago

@tevador as I mentioned before, this seems way too simple. Randomizing the opcode map isn't accomplishing much either since a lookup table can sort that out. It would be pretty easy to implement this entire VM as a hardwired ASIC.

zawy12 commented 6 years ago

His instruction set and algorithm construction seem to be following what I'm looking for, if it can result in near-equal solvetimes (using the selection process I described). If ASICS can't be made to have more CUs than GPUs, then my idea might work.

tevador commented 6 years ago

@hyc Ultimately, all VMs can be implemented in hardware (even javascript). It's just a question of how much silicon would be required to do it.

joijuke commented 6 years ago

simd in js

hyc commented 6 years ago

@tevador Yes, absolutely. My perspective is that javascript is far more complex than what you're working with, and will take a lot more silicon to implement. Which means larger cores, which means fewer cores per chip, less computational advantage.

tevador commented 6 years ago

I started developing my own javascript generator written in C#.

Also I modified the proof of work algorithm to be asymmetrical (faster verification than solving).

Working demo here: https://github.com/tevador/RandomJS (still in development).

hyc commented 6 years ago

Could you give a more detailed description of how the asymmetric verification works?

tevador commented 6 years ago

Algorithm description added.

Feel free to review it and comment.

baryluk commented 6 years ago

Just my two cents. I like the idea, but I would also suggest to include 64x64 bit multiplication and 64x64 bit division operations, which are notoriously complex to implement, and almost never used in crypto ASICs or hash designs, as their circuits either are slow, very large, or poorly understood from crypto research side of view. CPU and GPUs can do it. Implementing wide multiplications and divisions on only 32-bit archs, is also possible, at the expense of some few more branches, which is fine. FPGAs rarearly do have so wide multipliers, they do have DSP blocks with FMA units, but usually it is more like 18 or 28 bits or something like that (which is fine for almost all signal processing in real world with input data from ADC only having 16 or 14 bits of resolution anyway), you can probably combine multiple blocks for something wider, but that is also fine. I do not have much against FPGAs, as they are easier to obtain by general public in arbitrary quantities (sure, it is hard to buy large or very high frequency FPGAs, or FPGA eval boards, as they tend to be expensive, and populated with a lot of stuff that is not relevant to crypto). If anything, random algorithm generated by the generator should be small enough that it fits in L1 or L2 cache or on medium size FPGA (<50k gates).

baryluk commented 6 years ago

As for the RandomJS, the idea is ok as a prototype, but I am very against adding V8 as an enormous dependency of project like that. It just creates a lot of security issues, toolchain issue and portability to other platforms problems (V8 was not working on aarch64 well until just last year basically for example, and it doesn't work on more obscure, but otherwise capable, archs and operating systems at all).

The nice thing about the RandomJS is that is basically bypasses JIT to big extent, or at least make it the same both for miner and for verifier, by changing entire algo not just for the specific block, but also for each nounce. That actually also makes it hard to implement in FPGA. (which I am not sure it entirely good idea, from network security point of view, but otherwise I think fine).

As a more production oriented option I would suggest Lua (standard C implementation) or LuaJit (C/C++ alternative implementation for multiple platforms). They are small in size and well maintained and portable, and designed for embedding in other programs. You can also have multiple independent VMs in same process or even same thread, so parallel verification, mining, etc is easy.

tevador commented 6 years ago

@baryluk

Just my two cents. I like the idea, but I would also suggest to include 64x64 bit multiplication and 64x64 bit division operations, which are notoriously complex to implement, and almost never used in crypto ASICs or hash designs, as their circuits either are slow, very large, or poorly understood from crypto research side of view. CPU and GPUs can do it. Implementing wide multiplications and divisions on only 32-bit archs, is also possible, at the expense of some few more branches, which is fine. FPGAs rarearly do have so wide multipliers, they do have DSP blocks with FMA units, but usually it is more like 18 or 28 bits or something like that (which is fine for almost all signal processing in real world with input data from ADC only having 16 or 14 bits of resolution anyway), you can probably combine multiple blocks for something wider, but that is also fine. I do not have much against FPGAs, as they are easier to obtain by general public in arbitrary quantities (sure, it is hard to buy large or very high frequency FPGAs, or FPGA eval boards, as they tend to be expensive, and populated with a lot of stuff that is not relevant to crypto). If anything, random algorithm generated by the generator should be small enough that it fits in L1 or L2 cache or on medium size FPGA (<50k gates).

Most of your points are implemented in the CNVM proof of concept. The problem is that the VM can still be realized entirely in hardware with much higher efficiency than CPUs/GPUs can achieve.

By the way, GPUs usually don't have hardware integer dividers and divison must be done in software.

In any case, this debate needs ASIC design specialists to help us make the right choices.

As for the RandomJS, the idea is ok as a prototype, but I am very against adding V8 as an enormous dependency of project like that. It just creates a lot of security issues, toolchain issue and portability to other platforms problems (V8 was not working on aarch64 well until just last year basically for example, and it doesn't work on more obscure, but otherwise capable, archs and operating systems at all).

The algorithm does not necessarily need to use the V8. There are other ECMAScript engines - for example, this is one lightweight implementation. Personally, I don't see any security issues here with using the V8. For mining software, you want to use the fastest possible implementation, which is the V8 at the moment. There could be a more portable version using a slower Javascript engine just for blockchain verification.

As a more production oriented option I would suggest Lua (standard C implementation) or LuaJit (C/C++ alternative implementation for multiple platforms). They are small in size and well maintained and portable, and designed for embedding in other programs. You can also have multiple independent VMs in same process or even same thread, so parallel verification, mining, etc is easy.

I think one of the arguments for using Javascript was that even if an ASIC was developed, it would benefit everyone because then perhaps we could have hardware accelerated browsers one day. Also the Javascript implementation being larger in size is actually an advantage against ASICs.

baryluk commented 6 years ago

LuaJIT is faster than V8. And it is maybe 200KB of code. Not 60MB of hell that takes 20 hours to compile on decent computer.

Gingeropolous commented 6 years ago

well, make it in luajit :)

hyc commented 6 years ago

You're not helping.

Javascript is a good choice because it's bulky - an ASIC developer will need more resources to implement it on a chip.

tevador commented 6 years ago

I managed to get the runtime of the random program relatively under control. See https://github.com/tevador/RandomJS/issues/1 for detailed statistics.

I found 3 major reasons for outliers with high runtime:

  1. Nested loops. I fixed this by capping the total number of loop cycles in a program.
  2. Function calls inside loops. I fixed this by disabling function invocation in loop body.
  3. Large strings. String concatenation can slow down the program by memory allocation/garbage collection if the string length exceeds reasonable limits. I fixed this by capping the maximum length of a string.

Still I have to somewhat reduce the complexity of the program as I'm aiming for an average runtime around 5 ms.

As for program features, I still plan to incorporate objects (currently only numbers, strings and functions are used).

zawy12 commented 6 years ago

@tevador The runtime problem is why some of us have said the complex algorithm solutions will not work. I believe that by the time you've got a stable runtime solution, you will have restricted the possible algorithms in a way that it may be amenable to ASIC implementation.

The solution I proposed may also be too amenable to ASICs because I would have a restricted set of instructions, so the ASICs would just be 10,000 simplified "CPUs" running the algo in parallel with difference nonces. So the problem with my idea is approximately zero hardware cost compared to GPUs, although ASIC electricity cost may be 2x higher. But it may be salvageable for a constant-value coin that has no mining rewards or mining fees, but runs on everyone's cell phone with the wallet. So only a 51% attack would be profitable and very difficult if there are sufficient cell phones and CPUs running it in the background.

tevador commented 6 years ago

@zawy12 I don't think the optimizations are too restrictive to enable ASIC implementations. The program is still way too complex for that.

I any case, the generator has an XML config file, so you can disable the restrictions and test. The only restriction which must be in place is 3), otherwise the program crashes sometimes when string concatenation happens in a loop.

By the way, today there was a guy in the chat on supportxmr.com claiming he has a new ASIC in development for cryptonight + any possible variants of it (includes an FPGA so it's partly programmable). He was quoting 100 KH/s @ 1 kW, shipping in July. I think he was legit.

zawy12 commented 6 years ago

I'm waiting to see if you guys are successful with the more complex approach.

moneromooo-monero commented 6 years ago

I did a quick test, basically this: _mm_load_si128 -> _mm_shuffle_epi32 -> _mm_store_si128 for the other 3 16-byte chunks in current 64-byte cache line, right after _mm_aesenc_si128 instruction and right after _umul128 instruction.

@SChernykh, would you mind posting some code implementing this so we're sure to use the exact same code ? I'm planning on using this as one of the v2 changes.

SChernykh commented 6 years ago

@moneromooo-monero Yes, I'll post it here later today. I deleted the original code, but it's easy to recover from scratch.

SChernykh commented 6 years ago

Here you go: https://github.com/SChernykh/xmr-stak-cpu/commit/5f8acd10ed2ae55f1ee8a02f0302e186c9f410cc

I took the old xmr-stak-cpu repo for testing purposes. It still has the original Cryptonight implementation, you can see what exactly I did there. Performance difference with shuffle on is negligible on CPU.

LordMajestros commented 6 years ago

Here is another approach to asic resistance that claims asics can only get 1.1-1.2x improvement over GPUs. https://github.com/ifdefelse/ProgPOW

hyc commented 6 years ago

@LordMajestros I've been emailing them privately to discuss their scheme. One thing I don't like about it is that it's GPU-centric, and CPUs are at a large disadvantage.

zawy12 commented 6 years ago

@hyc Although it may be possible to make electricity more expensive on GPUs and ASICs compared to cell phones and CPUs, the hardware cost to get a lot of instances running should be a lot less on GPUs and ASICs. For this reason, if the only way to beat ASICs is with GPUs, then so be it. A POW for cell phones and CPUs seems to have serious potential only if there are no profits in mining and if the coin is widespread as a wallet+miner app on cell phones. Satoshi's idea was to make the network strong against attack by making it profitable, but it seems like this has the hard-to-avoid side effect of centralization. Maybe strong-by-widespread-adoption combined with diverse-by-being-unprofitable is an implementable solution, somehow.

baryluk commented 6 years ago

GPUs and CPUs are general chips and can do the same set of tasks. Any PoW is parallelizable (because multiple independent miners need to be able to mine in parallel ), and no matter what is the ratio of compute to memory usage and bandwidth, the GPU can hide memory latencies much easier due to highly threaded nature. As such I think fighting GPUs is not an option, and any future PoW algorithm should be compatible reasonably with GPUs. If you design PoW to be more efficient on CPU right now, with time GPUs will catch up on this specific front, and be more efficient. Also the issue is about ASIC resistance, not GPU resistance. And I would advise to not set goals too high at first, before actually having something that works good on GPU. Iterate from there then.

tevador commented 6 years ago

Here is another approach to asic resistance that claims asics can only get 1.1-1.2x improvement over GPUs. https://github.com/ifdefelse/ProgPOW

I didn't find any details how the 1.1-1.2x ASIC improvement was calculated.

It doesn't use any floating point operations, so I think an ASIC chip could be a lot smaller than a GPU in terms of the required silicon area.

SChernykh commented 6 years ago

@moneromooo-monero I improved my shuffling modification a bit: https://github.com/SChernykh/xmr-stak-cpu/commit/cf5175aabbf08cd25366b66a4e4b98e4e8958a48

My concern was that data never crossed 16-byte border during shuffling which could lead to some ASIC/FPGA optimizations. For example, ASIC/FPGA could just maintain virtual ordering (1 byte per 16 bytes of data) instead of actually moving the data in memory. Now it's moved all across 64-byte cache line.

P.S. Maybe some other simple operations like XOR could be applied in addition to shuffling without impact to performance. I'll keep experimenting.

moneromooo-monero commented 6 years ago

Is there any mileage in doing some operation on that extra memory for which there's a fast enough CPU instruction, and which would require substantial extra silicon, like a a division ? Those seem to be still slowish on CPU but it might be "hidden" by memory latency ?

SChernykh commented 6 years ago

Division, unfortunately, is a showstopper for parallel execution. It occupies a lot of resources, even if its result is not needed for the next instruction. It's in my plans to try hiding division latency - maybe by using division result only on the next iteration. But I still think it will slow down the main loop.

tevador commented 6 years ago

Division is usually not pipelined in the CPU because its latency varies widely depending on the values of the operands. This means any instructions that depends on the result of division are stalled until the division is complete, which can be up to 100 clock cycles for 64-bit division.

One alternative would be to use the completely free floating point unit.

SChernykh commented 6 years ago

@tevador Up to 94 clock cycles on Skylake (and *-lake successors), but only up to 46 clock cycles on Ryzen: http://agner.org/optimize/instruction_tables.pdf

SChernykh commented 6 years ago

@moneromooo-monero @tevador I've added division modification to test: https://github.com/SChernykh/xmr-stak-cpu/commit/c76996617c5175ffe9f03f7ca1b3f9b3115a60a3

It looks like my idea with using division result on the next iteration to hide latency worked, only 3.5% slowdown in my tests. And even combined with shuffle modification it's still only 3.5% slowdown.

P.S. I had a stupid copy-paste error in that commit, but it didn't change performance results. P.P.S. Actually, it's 6.5% slowdown with division. I rebooted my notebook and tested it again without anything else running in background.

tevador commented 6 years ago

For a 3.5 GHz CPU @ 70 h/s, it's at most 95 clock cycles per cryptonight iteration, which is very close to the maximum division latency. So this might work with one division per iteration. Would be interesting to also test it on a GPU.

SChernykh commented 6 years ago

GPUs have an abundance of computing power, they're mostly limited by memory access when running Cryptonight.

tevador commented 6 years ago

However, the shuffle can have some impact on GPU performance, For example, xmr-stak in default configuration splits the scratchpad into 16 byte chunks interleaved with the chunks of other threads. This pattern will have to change to avoid 3 additional memory accesses.

SChernykh commented 6 years ago

Yes, just interleave 64 byte chunks instead. I don't see a problem for GPUs in that. Quite the contrary, GPUs like accessing larger chunks, they're optimized for sequential access. The only problem is 4x memory bandwidth usage, but GPUs also have enough bandwidth:

Radeon Vega 64 uses 15% of bandwidth currently: 37 MB per hash, 2000 h/s = 74 GB/s out of 483.8 GB/s available. Radeon Vega 56 uses 16.2% of bandwidth currently: 37 MB per hash, 1800 h/s = 66.6 GB/s out of 409.6 GB/s available. Radeon RX 580 uses 12% of bandwidth currently: 37 MB per hash, 830 h/s = 30.71 GB/s out of 256 GB/s available.

And so on...

tevador commented 6 years ago

I just tested with an RX 550 using the "mem_chunk" option in xmr-stak. You are right, no performance drop when switching to 64 byte chunks.

If a GPU execution unit can emulate integer division in less than ~1000 clock cycles (approximate time of one CN interation for my RX 550 @ 1200 MHz), then there should be no impact on GPU performance.

SChernykh commented 6 years ago

And I just tested shuffle & division on the newest Skylake-X processor. Division latency is hidden entirely here, no performance drop at all: only 2.5% slowdown with shuffle and the same slowdown with shuffle and division. Funny thing is that the actual performance with one thread is lower on modern 4 GHz Skylake-X than on 6 years old 2.7 GHz Ivy Bridge notebook processor, even with the original Cryptonight. It proves once again that Cryptonight is memory latency bound and we can add a lot of computations without affecting performance.

Gingeropolous commented 6 years ago

@SChernykh , can you weigh in on this whole javascript code proof of work thing? I see you are active on this thread, but haven't directly spoke up on the javascript thing that @hyc and @tevador have proposed and created prototypes for. Specifically, @tevador mentioned above

In any case, this debate needs ASIC design specialists to help us make the right choices.

SChernykh commented 6 years ago

I'm not an ASIC specialist. But as a programmer, I see it as a high risk change at this point, maybe even in a year from now, after a lot of testing, it will still be risky. Chrome's V8 engine is huge and of course contains (yet unknown) bugs that can be exploited. The random code can't be truly random because of so called "halting problem" (google it), so the programs generated must be a small and limited subset of what Javascript (and any other programming language) can offer. But yes, it will make an ASIC hardware implementation look a lot more like CPU/GPU than it is now, with almost no performance gains.

And, judging by current tendentions in general purpose CPUs - all these new instruction set extensions, they're likely to have some JS acceleration support in the future. Which is good for this approach.

P.S. Right now I concentrate on how to modify Cryptonight to use CPU strengths that are not used yet, making it less efficient for ASICs without impact on CPU/GPU. Shuffling exploits unused cache bandwidth, division takes advantage of out-of-order execution. Maybe there is something else I'm missing here. But it's still a temporary solution. Random programs is the way to go for the future.

P.P.S. And random programs must also have an efficient GPU implementation, otherwise we'll have a big problem with mining community and a lot of unneeded fork debates.

tevador commented 6 years ago

@SChernykh The halting problem is based on deciding whether an arbitrary program will halt or run forever. However, this has no impact on random program generation, because we don't generate arbitrary programs. For example, it's trivial to restrict the generation routine to exclude infinite loops and infinite recursion. The subset of programs that don't run forever is still so large that this has no impact on ASIC resistance.

As for bugs in V8, the worst thing I can think of is if someone can find blocks that generate programs that crash the VM. This is not really a security issue.

But I agree that it's a big change and needs a lot of testing before being deployed by a major cryptocurrency like Monero. I'm currently starting a collaboration with the Wownero dev team to implement RandomJS (Wownero is a fork of Monero).

SChernykh commented 6 years ago

One more concern is determinism. The same JS program/JS engine combo must produce exactly the same results on all supported platforms (x86, ARM), OSes and compilators in order to be used for hashing. It's very hard to achieve this, especially with floating point calculations enabled in generated programs.

baryluk commented 6 years ago

Not only that, even future versions of the same engine (i.e. corrected for bugs or security issues), can change results, due to different ordering of operations that are assumed to be commutative, or optimized, or just finding out that the engine is not conforming to spec (which should not happen as long as we stick to basic operations, and do not call external functions, i.e. from Math module). This mostly applies to floats, i.e. when doing multiplications, additions and divisions, (i.e. (a/x + b/x) = (1.0/x) * (a + b), second being faster, but not actually equal). The problem is JS all numbers are floats! So you are screwed. This also applies to integers to some extent, where many compilers will assume overflows do not happen in well conforming program, and use various shortcuts. Also JIT might change behavior of function inlining and loop unrolling, which will have unpredictable effect on performance, memory and Lx caches. Lua or something more defined in semantic, could be better.

zawy12 commented 6 years ago

Division should be the only problem and it can be solved by doing something like the following after each division. if ( ceil( x + 0.001) > ceil(x - 0.001)) { x = ceil( x + 0.004); }

I mean, if you aren't going to use more advanced functions. You'd have to apply the above to every function like SQRT and whatnot.

SChernykh commented 6 years ago

@zawy12 If only it was that simple... Just read this: https://gafferongames.com/post/floating_point_determinism/

zawy12 commented 6 years ago

They are only talking about being exact to the same resolution of the float being declared. I don't see anything suggesting my if statement is not always deterministic. To use it, you would not let a "double" go above 1 trillion so that compilers can be off as much as +/- 2 digits (a +/- 100x error) in the least significant digits.

You could just restrict your code to integer division. I don't know if other complex math functions can be assured to be the same when performed on integers. But they should be if the equation used is the exact same, and integer division at each step.

SChernykh commented 6 years ago

In theory yes, but there are countless cases that can slip through fingers. For example, positive feedback loop that increases rounding error exponentially. This can happen in random generated code. Existing PoW/hash functions don't use floating point for a reason.

One more good article on the subject: https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/

zawy12 commented 6 years ago

I just remembered my code above is only for getting a deterministic integer output. Off hand I can't think of a way to do it with floating point. The protection would go hand-in-hand with any division or other problematic function, so if the problem function is in the loop, the correction would be too. As I've said before, there's a lot of restrictions needed to make sure their random function idea to terminate at the correct time (the function does not merely need to terminate under some limit, but must stay within +/- 10% of a protocol-determined execution time if you want difficulty to be that accurate. You could have wider limits by changing it every nonce so that the average execution time is what you plan for. So it's far from random, which might open a door for ASICs.

Maybe this will work: if ( ceil( 100*(x + 0.001)) > ceil(100*(x - 0.001)) ) { x = ceil( 100*(x + 0.004))/1000; }