Idea for ASIC resistance

zawy12 commented 6 years ago

If ASICs are going to be a recurring problem, can you change the POW to maybe 5 or 10 different options every block based on the hash value of the previous block? That way the software could be changed in GPUs whereas ASICs would hopefully require different hardware for each POW if they are logically and/or mathematically independent enough. Even better would be if a POW could change a single variable to require a different ASIC, then the previous hash would be used to set the variable for the next block.

moneromooo-monero commented 6 years ago

That's the idea used by myriadcoin IIRC. It just increases the ASIC cost incrementally though. Maybe it's part of a solution though. For clarity for others reading this, what we've done for v7 is about breaking ASICs through unpredictable change, rather than making harder to build ASICs.

zawy12, could you rename the issue to "ideas for ASIC resistance" or so, we'll need to collect ideas anyway for next switch, since we're very unlikely to have a new PoW by then. So we'll use this to get people start thinking about different ways we could go.

tevador commented 6 years ago

I started developing a similar concept of an ASIC resistant PoW.

Basically, the idea is to create a virtual instruction set with basic bitwise, arithmetic and floating point operations and some branching (this must be designed carefully to avoid infinite loops). At the start of each new block, the hash of the previous block is expanded into a few hundred random instructions and compiled into machine code (for CPU/GPU). This random program is then applied to the scratchpad (initialized from the current block header). At the end, the scratchpad is hashed and the result is compared to the target.

Therefore, the PoW would change with every block and if the instuction set is broad enough to cover the majority of what current CPUs/GPUs implement in hardware, then any potential ASIC would be basically a CPU/GPU with similar performance.

The specific details can be tuned so the benefits from a custom hardware are negligible - for example, the PoW code should fit into L1I cache, scratchpad fits into L2/L3 cache, 64 bytes of data are processed at a time (whole cache line) etc. A different permutation of opcodes could be chosen for every block to prevent direct hardware decoding.

This approach has some disadvantages, but it's the only one guaranteed to be ASIC resistant.

zawy12 commented 6 years ago

That's really interesting. How would you make solvetime be in the correct ballpark for a given network hahrate? That is, how do you come up with a difficulty setting for an unpredictable algorithm?

hyc commented 6 years ago

Sounds a lot like what I outlined here https://www.reddit.com/r/Monero/comments/865vbb/asic_resistance/dw2p4b7/

The basic idea is to use randomly generated code. If you simply select from a handful of known algorithms, eventually someone will build an ASIC that just implements the entire set of algorithms. Using randomly generated code, with an RNG with an intractably large set of states, will defeat that possibility.

My approach is to randomly generate javascript. Developing your own instruction set sounds great but such a development will be immature for quite a long time, leaving countless opportunities for fatal flaws and unexpected optimizations. Using something mature like javascript means most optimization opportunities have already been exploited. (And meanwhile, if anyone does manage a new optimization, it becomes useful to the entire computing community.) I will have deployable code demonstrating this approach in a few days.

zawy12 commented 6 years ago

Again I can't imagine how you guys are going to be able to adjust a difficulty for randomly generated code that has an unknown solve time. I guess you're right that they could build an Asic for 5 algorithms.

tevador commented 6 years ago

@zawy12 The result of the PoW is a standard 256 bit hash of the collapsed scratchpad state, so difficulty targeting is not affected. Basically, just the memory-hard loop of cryptonight would be replaced. Everything else stays the same.

Mining would go like this: 1) Select a nonce value. 2) Compute the Keccak state of the block header and initialize a 2 MiB scratchpad (same as cryptonight). 3) Apply the block-specific algorithm to the scratchpad. 4) Collapse the modified scratchpad to a 256 bit hash (same as cryptonight). 5) If the hash meets the difficulty for the current block, the block is solved. Otherwise repeat from step 1).

The random algorithm would be long enough so that the latencies of different instructions average out and step 3) should take approximately the same time for every block.

The only major difference would be the addidional compilation step when a new block is found.

ghost commented 6 years ago

How do you determine the difficulty of the random algorithm though? How do you determine if it even terminates? There must be some rules governing how the algorithm is generated and in that case it may be vulnerable to FPGAs or CGRAs using dynamic partial reconfiguration.

gsovereignty commented 6 years ago

Using randomly generated code, with an RNG with an intractably large set of states, will defeat that possibility.

@hyc that sounds pretty cool, what's the source of randomness though, how do you prove this is random to everyone else? Or doesn't it matter for your solution?

If verifiable randomness is needed, Micali's VRFs could maybe help but that would probably require a shift towards a PoS-ish solution.

If randomness isn't verifiably random, couldn't an ASIC just propose their own 'randomness' that they are optimized for solving?

ghost commented 6 years ago

@gazhayes I don't think the source of randomness matters, people solving based on their own randomness will not get the same result as everyone else, and thus would be rejected.

hyc commented 6 years ago

The source of randomness is the block hashing template and the nonce. These are fed as a seed into the PRNG. Everyone must use the same PRNG and same random program generator, otherwise the results are invalid.

hyc commented 6 years ago

Proof of concept is here https://github.com/hyc/randprog note that the actual program generator is some pretty bad code. This is nowhere close to useable for real, it literally only illustrates the idea.

Some other differences between mine and @tevador 's proposal - this approach generates an entirely new program for every nonce, his generates a single program for a given block. In my approach, the PRNG and the program generator play a more significant role in the overall PoW. The other approach relies primarily on the difficulty of executing the generated program.

zawy12 commented 6 years ago

@hyc It seems like the miner might be able to see higher difficulty in some algorithms and just skip them (go to a new nonce). Trevador said the algorithm would be long enough that it's computational difficulty would average out. I'm skeptical but maybe it's possible by bounding outliers. I can't see how you've address the problem, especially with all functionality of javascript. It seems like ya'll are attempting a form of genetic programming, which has rules on what can be combined so that valid algorithms are created.

How are nodes going to easily validate the work without redoing it all?

Why change the algorithm with each nonce? Why not use bits of the hash of the previous block? All miners would know the algorithm as soon as they start working on a block.

LordMajestros commented 6 years ago

On further consideration I don't think verifying this will be easy. It seems to me that the only way to verify it is to do it all over again.

hyc commented 6 years ago

Yes, the only way to verify is to perform the same steps. But that's true for Cryptonight too so this is no worse in that respect. (This is not the only approach we're exploring. There are other avenues which allow the verifier to be cheaper than the original proof but I'm not directly involved in that work.)

@zawy12 as I noted, this particular code isn't directly usable. But yes, having all of the functionality of javascript is partly the point. A real solution would ensure some uniformity of difficulty for all of the generated code. E.g., "output will have [30-40] functions, [80-100] variables; functions will have [10-20] statements" etc. - the actual range of parameters will need to be chosen to provide a large enough set of permutations.

Why change on every nonce? Because we already know that companies like Bitmain have automatic tools to turn e.g. C code into e.g. RTL - Someone using FPGAs could easily reconfigure if you only use one algorithm per block. Generating a new program for every nonce eliminates any such advantage. It also eliminates a lot of compiler advantage too - both AOT and JIT compilation trade off compile time for run time on code that executes more than once. Since the generated code will only ever execute once, complex AOT or JIT optimization are pointless.

zawy12 commented 6 years ago

@hyc OK, so the validator only has to run the "winning" nonce etc (algorithm) while the miner had to do it for a bunch of different algorithms. Maybe you should use reverse-polish-notation (stack) thinking which is probably how a lot of genetic programming is done (not to be confused with GAs). By this I mean your first 2 variables go into a single math operation which might be limited to +, -, * , and /. Or maybe there is a set of operations that CPUs are better at. You choose variable widths optimized for CPU, equal to something like 64 bits. The output from the first operation is one of two inputs to the next operation, and a 3rd variable is the other. So you have N operations and N+1 variables. To be Turing complete, you have to enable outputs the ability to be inputs to earlier (make it recursive) or later operations, so N+1 is the max number of variables. Cellular automata can be similarly Turing complete. You don't need all of a language which may require a lot more protective logic (or limitations) for dealing with (or preventing) mal-formed operations. Some variables (and therefore operation difficulty on them) are going to blowup in size (and therefore difficulty). So to meet the "near equal" computation difficulty requirement, I would make the variable sizes equal to the max for the declaration and let truncation at the max to prevent overflow end be the norm. Maybe have to make sure the number of each of the possible math operations are always equal, including the recursion count of that operation. But you can't count the number of recursions until you run the simulation, and it's not Turing complete if this is not an option. So it can't be that general, unless you have a rule that the process stops after a certain number of operations, and that stopping point is the output. But then you can't predict if you'll have way too many divisions compared to additions, unless the stopping point includes an assumption about the relative difficulty of the operations which may not be a good idea. With recursion, I'm doubtful you can depend on a law of averaging like trevador requires. This brings me back to the idea of simpler cellular automata where the amount of computation in each step is known, so after the algorithm is created, you know exactly how many steps are required to meet the difficulty requirement. I guess this is effectively the same thing as the 4 math operations with recursion, except that the method of recursion is pre-defined in the algo at each step, instead of being a free-for-all in what you could end up creating.

So maybe there is a way to have a very unpredictable algorithm that CPUs are good at that also can have its computational difficulty fixed. I assume a hash of the output is the thing that is trying to meet the difficulty target. I don't know why it would be needed, but you might be able to increase the number of steps required via a difficulty algorithm instead of lowering the target hash, or do both.

What I've described goes in a completely different direction than I think equihash and cryptonight, if it is not done to require a large amount of RAM for random access. But does a GPU or ASIC haved any advantage over CPU with this? By requiring a lot of step-by-step recursion without much memory, the central processor of any of the architectures would generate a lot of heat. So it's electricity intensive instead of up front equipment cost. The smallest resolution IC's would have an efficiency advantage, which I think puts ASICs at a disadvantage.

zawy12 commented 6 years ago

What I've described above could be done most cost-effectively (least electricity per simple computation) on risc architecture like cell phones, especially since they're optimized for electrical efficiency and low ram. FPGAs and ASICs would be at a disadvantage due to lagging the line resolution in GPUs and CPUs. An ARM expert could identify the math operations (including bitwise operations) that ARM excels at. It only needs a way to generate recursive procedures that result in pseudo-random complex patterns like certain cellular automata. But I don't know if Wolfram found a large "generate-able" class of them that reliably gives complex patterns.

I'm saying "cellular automata" but they're just sort of discrete differential equations. So you can think in those terms: just find a class of simple non-linear DE's that has these characteristics (predicable computation difficulty per step, yet non-reducible across their class to a more efficient computation). Both CA and DE's require initial conditions, and if the class of DE's is complex enough the only way to solve is step-by-step recursive procedure that's already been optimized by mathematicians.

ipbc-dev commented 6 years ago

Hello, we thought that a recent idea we started building might be of interest to this discussion: https://github.com/ipbc-dev/TuringsNightmare (Very WIP)

The idea is similar to what tevador suggests, I think.

SChernykh commented 6 years ago

@moneromooo-monero I have a great idea which can be used as a building block for the next small tweak to Cryptonight. It's not a big change like everything discussed above.

So, current Cryptonight is memory-intensive in terms of memory latency, but not bandwidth. Modern CPUs use 64-byte wide cache lines, but Cryptonight only loads/stores 16 bytes at a time, so 75% of available CPU cache bandwidth is wasted. ASICs are optimized for these 16 byte-wide memory accesses, so they always use 100% of whatever memory they have.

The idea is to do something computationally light and safe with the other 48 bytes of the same cache line (which is loaded in L1 cache anyway) on each half-step.

I did a quick test, basically this: _mm_load_si128 -> _mm_shuffle_epi32 -> _mm_store_si128 for the other 3 16-byte chunks in current 64-byte cache line, right after _mm_aesenc_si128 instruction and right after _umul128 instruction. CPU mining performance stayed the same! Well, it went down 0.5-1% in my test, but that's negligible. It's easy to explain why: modern CPUs execute code out of order and have plenty of execution resources, so they can do it in parallel, and these 3 16-byte chunks are already in L1 cache, so no additional delay happens. It's just executed in parallel with AES instruction and MUL instruction.

I haven't tested it on GPU yet, but my estimations show that GPUs use only 15-20% of their memory bandwidth when mining Cryptonight, so they should also handle this change with minimal impact on performance.

ASICs, on the other hand, will have to deal with 4x more bandwith usage, so they will be 4 times less efficient compared to the original Cryptonight ASICs.

This building block (_mm_load_si128 -> _mm_shuffle_epi32 -> _mm_store_si128 for the other 3 16-byte chunks) is easy to add to the next PoW change, together with other changes. And it can't break anything, because it only shuffles DWORDs that are in the scratchpad, it can never cancel anything out to 0 like XOR (if implemented unsafe) or reduce entropy in any other way.

P.S. I just realized that this makes FPGAs 4 times less efficient as well. Not only ASICs.

hyc commented 6 years ago

@ipbc-dev All that's needed to solve this is an ASIC designed to execute your 255 instruction opcodes. 255 instructions is a trivially small number to implement.

@zawy12 Your suggestion boils down to a finite state machine. It would require only a small number of logic gates to implement each state, and some other amount of memory.

@SChernykh a 4x speedup for CPUs is nice, but we're looking at ASICs with ~300x CPU efficiency. Spitting into the wind.

https://www.reddit.com/r/Monero/comments/8bbhxq/understanding_network_hashrate_prepost_asics_and/

Let's start with a rx 580 8gb rig and compare it to an asics rig. A 6 gpu rig gets 5000 h/s. The asics rig from bitmain gets 220,000 h/s for the same amount of power usage. You would need 44 - 6 gpu rigs to get the same hashrate.

SChernykh commented 6 years ago

@hyc It's not a speedup for CPU, more 4x slowdown of ASICs. I never said it would solve the ASIC problem entirely, but 4x more ROI time for ASICs could actually make them economically useless with 6 month PoW update cycle. Original ASICs had 1 month ROI, and had 3 months until the PoW change. With 4 times less efficient ASICs they will never ROI before the PoW change, or at least it will be questionable.

zawy12 commented 6 years ago

@hyc ASIC gains depend exclusively on a particular algorithm that can be optimized where calculations can be skipped. Very simple recursive algorithms that produce pseudo-random output can't be optimized because they already are optimized. The extent to which the algorithm produces non-random output is the extent to which they can be optimized, but optimization beyond a certain point requires human tinkering. Before that point, a CPU can figure out the optimization before sending it off to a fabrication lab, so pseudo-random or even being some distance from random may not be a problem. But the general class of the on-the-fly generate-able algorithms should not have a significant demonstrable general pattern to them that could then be encoded as an optimization in an ASIC.

What I'm saying is the answer. Everyone's been thinking more RAM is the key, and it has a side marketing benefit of potentially not requiring as much electricity. But I'm saying there is a great route here that specifically tries to waste electricity. That provable waste is the underlying value of non-fiat coins like gold (see Szabo's writings, Dai's b-money comments about what he was trying to acheive, and attempts to make currency equal to Joules). Recursive output from simple algorithms can't be optimized. I read Wolfram's "New Kind of Science" and know the basic theoretical limits of computing which is how I know ASICs and FGPA's are not going to help.

I'm not sure your changing the algo "per nonce" method is better than the "per block" method, and it might not be good if a miner can identify harder algorithms and just skip it (lending to an exploit by ASICs), but it may be needed to average out the difficulty miners face in the series of recursive algorithms. I think per block might be better and force us to find a class of algorithms that have the same per-recursive-step computation time. This tightness might also force us to end up with something more similarly random, and more easily provably unfit for ASICs.

Each single bit transition in a computing element like the output of a NAND gate requires at least k*T*ln(2) joules of energy. Our devices are many orders of magnitude above this Landauer limit, largely depending on how small the gates can be. CPUs and GPUs have smaller gates than ASICs and FGPAs, or at least ASICs and FGPAs are not smaller.

Wolfram and others have proven you can't reduce certain very simple recursive algorithms in a way that eliminates the need to recursively go through all the steps. ASICs will have to burn more electricity than cell phones to do the same computations.

zawy12 commented 6 years ago

@hyc Elementary cellular automata are not necessarily finite state machines and can be Turing complete like wiki says. I was thinking of automata that potentially use more than just the 2 adjacent cells (the definition of the elementary type) so that it is more likely to be pseudo-random output. One algorithm per nonce might be required because instead of trynig to create an FPGA per block, you would create a lookup table to fill RAM and use that instead of computing. But that can't be done with a change every nonce on a large general of algorithms.

Or maybe it could be done with 1 algorithm per block like this: make it start with 256 bits (cells). A hash of your nonce would be the random number that creates that initial condition. That way you can't build a lookup table for the algorithm. If it's random enough, you won't be able to reduce the logic for programming into a FPGA in the time span of a block.

If a very general class could be created that has the same amount of computation per step, then you would have a fixed number of steps to the answer which would be hashed to get your target. Just as an interesting point, maybe the difficulty could change that number of steps instead of making the target hash lower.

hyc commented 6 years ago

@zawy12

ASIC gains depend exclusively on a particular algorithm that can be optimized where calculations can be skipped. Very simple recursive algorithms that produce pseudo-random output can't be optimized because they already are optimized.

This is only part of the story. ASICs are also faster because their memory subsystems don't require complex caches and cache coherency protocols that general purpose CPUs require. You're only looking at the theory, but we're dealing with real world practice where other considerations apply. A state machine implemented in an ASIC can certainly be made to iterate through a set of states far faster than a CPU could be programmed to do. It's not just about a single specific algorithm, it's about all the auxiliary support that does (or doesn't) have to be put in place.

When you toss around mention of "Turing completeness" remember that a Turing machine is just a device that reads or writes one memory cell at a time on a very large ("infinite") memory tape. It is trivial to implement in hardware for a given size of memory. We already know that depending on "xxMB is too expensive to implement in ASIC" as a defense is inadequate.

Whenever you focus in on simple state machines, you're reducing to a Turing machine and the only hardness is how much memory it will need. My approach here is specifically focused on making the algorithm more complex, because (a) while all algorithms can eventually be reduced to something executable by a Turing machine, the number of steps needed to perform that reduction will drastically impact its runtime and (b) increasing the algorithm complexity increases difficulty far faster than just increasing memory requirement.

zawy12 commented 6 years ago

Again, the primary goal is to not allow more than minimal memory to be helpful. The only thing that can be fed into the processor is mainly the previous output of the processor, i.e. a heavy reliance on unpredictable recursiveness instead of memory. I mentioned Turing completeness the second time because you stated it was a state machine which is not Turing complete. Turing complete is not needed, but something complex enough to have output that can't be easily predicted. We're talking about changing the "state machine" every block or nonce, so the ASIC and FPGA can't be created ahead of time. (to execute an optimization of the state machine) ( i am not talking about a finite state machine anyway)

To say it another way, modeling of complex differential equations has to do things 1 recursive step at a time like this. An ASIC or FPGA could be could be designed for a given model. But if the model is changing all the time, they can't help.

hyc commented 6 years ago

An ASIC or FPGA could be could be designed for a given model. But if the model is changing all the time, they can't help.

Yes, in this respect we're saying the same thing.

( i am not talking about a finite state machine anyway)

Beside the point. The cellular automaton you're talking about is a trivially small set of logic states. All it does is transition from one state to another based on the current input. My point is we need a much larger set of states to transition between.

zawy12 commented 6 years ago

Why are more than 2^256 states per cycle (per recursion) needed? (maybe 1,000,000 loops per hash)

hyc commented 6 years ago

All you've described is essentially a compute engine with 4 instructions in its instruction set and a 256 bit program. It is ridiculously trivial to implement.

zawy12 commented 6 years ago

Trivial to implement is core to my goal, not a problem. Do you mean it's trivial to optimize?

myhash = ffx32 
unless (myhash < target) {
   nonce++
   j =  hash(previous block hash + nonce)
   for (1 to 100)  { 
       j++
       function = generate_function( hash( j ) )
       for  (k=1 to 100,000) {
           input=output
           output = function( input )
       }
   }
   myhash = hash(output)
}
if ( myhash < target ) { claim my block }

The ending j and k need to be included in the header for validators, and j checked for being a reasonably small multiple of 100 above hash(previous block hash + nonce) Reasonably small would be < 1000x of expected hashes per block needed by the largest miner, assuming that miner has 10x rest of the network).

Where our goal is a class of functions that is reasonably unpredictable in logic, but predictably difficult to optimize and predictably nearly equal in time to compute (for a given "CPU") as all of those in its class.

hyc commented 6 years ago

How is what you describe in any way different from Cryptonight, which has already been broken?

zawy12 commented 6 years ago

You mean other than not using read-writes to random memory and changing my function every hash instead of never?

zawy12 commented 6 years ago

OK, now I see the why of your question. I just inserted a missing part in my pseudocode.

hyc commented 6 years ago

I think you're still missing the point. One of the key concepts you seem to be missing is that operations (instructions, code, whatever you want to call it) are interchangeable with data. Cryptonight was "memory hard" because it accessed RAM in random order. I.e, the sequence of steps that accessed the data was unpredictable. This unpredictability didn't prevent ASICs from being developed, because the sequence of instructions required to execute the random pattern was known in advance. (And the instructions themselves were trivial to hardwire.)

Your trivially simple cellular automaton is supposedly going to execute a random sequence of instructions, which you believe an ASIC cannot execute because the sequence can't be known in advance. This is fallacious, because the sequence of instructions you're proposing to generate is merely data. The rules that your automaton operates under are the actual instructions that matter, and again, they're fully known in advance. And the set of instructions you're proposing is tiny, which means they can be implemented trivially in hardware, and their execution efficiency will still outstrip any CPU's.

The reason I've proposed generating code in Javascript is because the set of operations is non-trivial, and executing a Javascript program will require an ASIC developer to implement an actual CPU on their chip.

LordMajestros commented 6 years ago

A CPU for every single asic chip This should work if it is feasible. How far along are you on this? @hyc

hyc commented 6 years ago

@LordMajestros I already posted a proof of concept https://github.com/monero-project/monero/issues/3545#issuecomment-379873084 you can try it out yourself.

zawy12 commented 6 years ago

@hyc

The rules that your automaton operates under are the actual instructions that matter, and again, they're fully known in advance.

They're not known in advance because I am changing the "automata" algorithm 100 times per nonce. See my psuedocode above. I thought of automata because I knew off-hand some classes can't be solved for "N loops" into the future without having already run the recursion itself. Then I remembered most real world differential equations are the same. My psuedocode is more general than either. The recursiveness was needed to address the problem of unknown execution time. Otherwise, I'm not arguing for something different than your idea. I'm trying to make it work and optimize it for cell phones.

Cryptnight's algorithm is known in advanced but ours' are not. My smaller instruction set is optional, but I wanted to pick instructions that are more efficient on RISC cell phones than on CPUs and GPUs. I didn't want complex instructions that are not implementable directly and completely on a RISC chip. A simpler instruction set may also enable better control of execution time and theoretical characterization of what the results will be.

To try to address your complaint about a simpler instruction set, suppose I have 5x more combinations of instructions than your scheme but use only 4 instructions. Then I would have 4^5 =1024 advanced instructions in each group of 5 instruction. Even 1 instruction is feasible: All javascript is expressible in NAND or XOR instructions in a specific type of arrangement.

Let's say I use only the NAND gate (a simple binary if-then statement). My output every 1,000 loops will be hashed to get a new set of NAND gate combinations out of a possible 2^256 combinations. There could be more if needed. An FPGA or ASIC or GPU program would have to figure out a simplified version of the NAND gate 100 times per nonce. This is possible if the inner loop of 1000 is too many loops. So there's an upper limit on it. Also, the algorithm generating function loop of 100 has an upper bound: it's a slow algorithm that could be optimized by ASIC or GPU programmer, so the time to generate the next algorithm must not be a large fraction of the time to do the inner loop.

A larger instruction set has the same problems to the exact same extent if I use log2(javascript instructions) more combinations of NAND gates. Both schemes seem to need some bound on how the instructions can be combined in order to get approximately reasonable exceution, but that should not change this point.

I found the need to change your scheme by making it change the algorithm 100 times per nonce in order to get an averaging out of solvetimes to be approximately the same while preventing ASIC and GPU programmers from skipping nonces that give apparently slow algorithms. Generalized loops and skipping sections of code within our created algorithms are not feasible due to creating an unknown completion time, which is a deal breaker. I have an interior loop of 1,000 to keep the created algorithm 1000x smaller, but thiis just to make sure no one benefits from larger memory. This is for cell phone efficiency per instruction while negating the benefit of a large CPU cache.

hyc commented 6 years ago

I'm skeptical about skipping nonces - they have no way to know in advance whether any particular nonce will yield a hash that satisfies the target. Skipping a nonce may deprive them of a win.

Yes, all instructions can be boiled down to combinations of a single gate type, but it costs a larger number of gates to do so, and the number of gates a design requires directly determines its cost. The more complexity, the more gates used, the more expensive the ASIC design.

tevador commented 6 years ago

@hyc I tested your code and I think it's an interesting concept. Any idea how the ratio of CPU:GPU mining performance would be affected?

My proposal is basically a simpler version of yours, having a lower level language in which random code is generated. Yours has the advantage of a mature VM.

My concept is almost finished. The VM has 3 registers, 4 MiB of memory and can execute 128 instructions. I'm not sure if it's worth developing further (I don't have any test code yet).

hyc commented 6 years ago

@tevador I really have no clue how well a GPU will handle this. Projects like this exist http://gpu.rocks/ but they're aimed at running highly parallelizable functions on the GPU and none of the randomly generated code is really vector oriented.

As you'll have seen from my debate with @zawy12 I'm of the opinion that a simpler VM is a weakness because it becomes too easy to implement in hardware. But we have to balance that against @zawy12's point that it makes future optimization less likely.

tevador commented 6 years ago

The problem is that if this algorithm ends up being only CPU-mineable, monero can lose a majority of miners, who form a large part of the community. It would also incentivize malware mining.

That's why I designed my algorithm to be GPU-mineable.

zawy12 commented 6 years ago

Skipping nonces: what I'm saying is that I don't think you have bounds on the time-difficulty of the generated algorithms, but that an ASIC could be designed to quickly recognize if an algorithm is going to be difficult to solve, so they'll skip it.

I can't see if you still disagree. In reference to ASIC complexity, as I was saying, as I would just use the smaller set of instructions a larger number of times to equal your larger set of instructions with fewer steps. I want to send it assembly meant for ARM processors and for it to be as difficult as your javascript. We can go so far beyond what an ASIC can predict, I'm trying to cheat by redirecting it's output to its input, enabling "only" 2^256 different functions on random 32 byte seeds.

I think the general method of a small instruction set is the only one that can have predictably similarly difficult functions like I'm saying seems absolutely necessary. With say 4 instruction (they might just be well-chosen bit operations) it seems possible to know what average amount of computation they have, and to prove the 4 selected will continually add complexity rather than going towards patterns like all 0's or all 1's.

It's not that I know a larger instruction set won't work. The problem is that I view the task of getting similar execute time to be less daunting with fewer instructions. BTC script had to exclude loops and other things for the same reason. And still many scripts can go bad. Fewer instructions might be useful in proving the algorithms are well-behaved.

I think not only limiting the type of instructions is needed, but probably also limiting how they can combine. This increases the risk of an ASIC optimization of the gneral class of algorithms you produce.

+, -, /, and * creates something wider in scope than power series which can approximate any function. But maybe it should be required to follow the rules of power series. Actually power series are recursive like automata and D.E.s, it does fit in well with my desire that the only loop be external to the "function". I'm been using "function" very unmathematically to mean "algorithm of any type", since I don't have those details worked out.

@zawy12's point that it [fewer instruction] makes future optimization less likely.

I don't think I said that. I don't view my scheme as more difficult for ASICs to "hack", but that I have more hope it will help the design process in creating an "algorithm generator" that has well behaved output and similar execution time.

There's another potential advantage to doing recursive simple steps "on chip". When you have a large algorithm full of a wide range of instructions that are not going to be native to the chip, there are delays from going back and forth to RAM. If you have an instruction set of say 256, and if they all need to have 64 bit inputs and 64 bit outputs and follow each other sequentially (in order to be well behaved), then a lookup table 3GB in size (or much less with compression) could be used instead of actually computing each sequence of 4 instructions. There may be ways around this like changing input data before instructions by a 1 to 4 bit shift based on something unpredictable, or similarly every instruction could have two inputs where one is the output of a previous instruction and the other is seed data.

Fewer instructions has the same problem with this, but I'm hoping a chip can go through more instruction than can be pre-implemented in a lookup table, faster than the time for a lookup-memory transfer in an ASIC. I probably need to use as many on-chip instructions as possible, supporting @hyc 's viewpoint. But my reasons for wanting to keep a limit on instructions remain.

GPUs as well as CPUs can use a lookup table, and may not be improved upon by ASICs, but I feel like this is a problem that should tried to be avoided, and but my starting point (same size inputs and outputs going from 1 instruction to the next) is seductively simple in terms of getting constant compute time and good behavior implemented.

If one of the two ways to foil lookup tables works, an ASIC may still implement the actual logic of every hard-to-compute pair, triple, and maybe quadruple sequence of complex instructions. It can't do every pair due to the large instruction set. Just implementing all non-redundant pairs of 256 instructions could be difficult. It would have a lot more flexibility in exploiting sequences of fewer instructions, so that's a +1 for @hyc . I mean it could easily optimize and implement every 5-instruction sequence of a 4-instruction set, naively giving it a 4x speedup, if it can be done as a fast as on chip. So I'll say I'll make the instruction set as large as I can as long as it stays on the chip, and as long as I can make them predictable in computation time and otherwise well-behaved. So maybe I'm in trevador's ball park.

My pseudocode (that I've been changing all day as I realized problems) applies as well to large sets as small: I think the recursion outside the algorithm as the only loop allowed is a requirement for this to work. Without it, the algorithm will have to be a lot more complex which means a lot of complicated memory usage that means too much room for GPU optimization. It's not that I'm that strongly against GPUs, but I don't the long drawn out process of GPU devs slowly increasing hashpower for a pecentage. It makes me think if GPU optimization can be done, there's a loophole for ASICs. Assembly code doable solely on the chip seems like it has the nice possibility of never becoming optimizeable.

If the algorithm generator can make algorithms with solvetimes +/- 5%, then my 100 algos per nonce is not needed, and 1 algorithm per block might be sufficient. I don't think you can get a routine to optimize the algorithm and automatically reprogram an FPGA that fast. But since GPUs could, I might still prefer the per nonce method.

hyc commented 6 years ago

GPU support is looking pretty bad, from what I can see. I think most of the work winds up being done on the host CPU. The generator could be translated into OpenCL, no problem. But we'd need something to transform the generated code into OpenCL as well, and the OpenCL compiler only runs on the host. I don't know any portable way to transform the generated code into GPU-native code on the GPU itself. (You want to use a vendor-supplied GPU compiler, because it knows all the device-specific details of each target.) But this isn't really my specialty, maybe someone else knows how to handle this.

Or: someone ports the entire VM into OpenCL, and the GPU just interprets the generated code. I suppose that's fine.

joijuke commented 6 years ago

@yinwang0 can you join talking??

ipbc-dev commented 6 years ago

@hyc why you want to use gpu? think about devices without a gpu. @zawy12 please take a look here https://github.com/ipbc-dev/TuringsNightmare/blob/master/TN_Explanation.md

Gingeropolous commented 6 years ago

I wouldn't worry about GPUs at this point. I'd say get a working thing up and running, and then do what tromp did with cuckooo cycle - claim that its not GPU mineable, slap a bounty on it, and see what happens.

miner developers will find a way to mine on GPUs ...thats sort of a given.

tevador commented 6 years ago

I made some research about the possibility of running general purpose code on a GPU.

Firstly, all AMD GPUs before the GCN architecture (everything older than Radeon 7700 Series) do not support non-inlined function calls and recursion, so you could not even compile the javascript VM for these architectures (the compiler gives "irreducible control flow error").

CUDA and GCN graphics cards seem to have the necessary hardware support to actually run a full javascript (or any other) VM, although someone might have to write a new compiler for that.

The problem is that although these modern GPUs could theoretically run a full virtual machine, they lack support for speculative execution and branch prediction, so their performance per clock would be at best at the level of the first generation Intel Atom CPUs. Combined with the fact that GPUs typically run at about half the core frequency of a modern CPU, their "per-core" performance would be about 6-7 times lower than a CPU core. This estimate is based on Passmark CPU benchmark score of Intel Atom 230 vs AMD Ryzen 1700 and it's a conservative estimate because this Intel Atom has core frequency higher than most GPUs (1.6 GHz) and supports branch prediction. In reality GPUs can be closer to 10 times slower.

To give a real example, the AMD Radeon RX 570 graphics card has 32 compute units (CU). Each compute unit has a separate instruction scheduler, so they can be considered like "cores" on a CPU, since each CU could run a separate mining thread. The estimated mining performance would be simular to 3-5 CPU cores, or about half the performance of a Ryzen 7 1700, which costs 30% less than the GPU.

In conclusion, the performance per dollar of GPU mining would be around 3 times lower than CPU mining. I think this is enough to make it unprofitable.

Sources: https://community.amd.com/thread/167912 https://community.amd.com/message/1298840#comment-1298840 http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA's_Fermi-The_First_Complete_GPU_Architecture.pdf https://en.wikipedia.org/wiki/Intel_Atom

zawy12 commented 6 years ago

Unless botnets become a problem I don't think CPUs being competitive with gpus is a drawback. It has historically been considered ideal for CPUs to be more competitive than gpus. This is so that the network can remain diverse and therefore less subject to attack

LordMajestros commented 6 years ago

GPU miners attach several GPUs to a motherboard, it is more difficult to scale this way with CPUs. I don't see the problem unless of course botnets become a bigger issue.

hyc commented 6 years ago

@tevador thanks for digging into this. It might be acceptable to shut out pre-GCN GPUs. There's another aspect of this, a full JS interpreter will be a few MB of code itself. Lower end GPUs wouldn't have the memory for it anyway. (Most smartphone GPUs already lack the memory for CryptoNight so this isn't very surprising.)

hyc commented 6 years ago

(For the record, the spaghetti-code function in https://community.amd.com/thread/167912 can be rewritten equivalently as this

void qwe() {
   int CUR = 2;
   int i=0;
   while (i<300) {
      i++;
      if (CUR > 0) {
         CUR--;
         continue;
      }
      break;
   }
   while (CUR <2) {
      CUR++;
   }
}

Pretty much any use of gotos can always be rewritten in structured form. If this was the only obstacle, that'd be OK. Probably not the only obstacle of course.)

zawy12 commented 6 years ago

@LordMajestros Does or can the motherboard build the assembly code for the CU and pass it off for on-the-CU processing? If not, then interrupts from all the CUs to the motherboard would slow it down to the speed of 1 CPU.

If botnets are a major concern and GPUs preferred, then the simple instruction set could be optimized for GPU CUs. Then for an ASIC to beat a GPU, it would have to be basically a larger set of CUs than a GPU. But ASICs will not be able to do it as electricity-efficient as mass-produced GPUs that have smaller etching lines. Even if cell phones and CPUs can use the electricity more efficiently per hash, they can't scale so that if price goes up faster than cell and CPU power comes online, then GPUs would fill the gap. This is with an instruction set and algorithm size that can stay "on-chip". Because of the lookup table problem, I would use the largest instruction set that can stay on chip. The sequence of instructions must be a reliable computation time. So I would group the instructions based on having execution time within +/- 10% Let's say I have 256 instructions that have speed like in 16 different groups, and I selected the possible instructions based on this requirement so that there are 14 to 17 instructions in each group. The generated algorithm per nonce would be 256 possible instructions in a sequence of 32 instructions. So only 2 instructions from each of the 16 similar-execution-time groups would be allowed in each algorithm. A 32 instruction algorithm would not need to access off-chip memory. This is hopefully complex enough that a GPU can't reduce it to much simpler equation and feed the CUs the simpler code before the original code finishes and goes to next nonce. The 256 bit random number needed to select them would the previous block's hash. A hash of that hash would provide the seed. The actual bit-size of the seed could be bigger or smaller. Each instruction would operate "nicely" on inputs. By this I mean its output must be the same size and not reduce its entropy too much. Each instruction might be a pair of instructions to meet the entropy criteria. For example, if you use a 256 bit input with an addition instruction, you might split the input data in half and add the halves, ending up with a 129 bit number and 127 leading zeros, reducing the input's entropy.

Such a simple algorithm requires the inner loop above, which is just using the algorithm's ouput to create the seed for the next iteration. The outer loop is not needed (since I'm keeping a tight constrain on solvetimes). But it could be used if the constraints can't be as tight as I specified.

We do not have the expertise to prove it would be ASIC resistant, but maybe there is enough knowledge here to turn it into a functioning POW. If my equal-solvetime criterion is met, I can get a cryptonote clone or two to implement it live.

monero-project / monero

Idea for ASIC resistance #3545