monero-project / monero

Monero: the secure, private, untraceable cryptocurrency
https://getmonero.org
Other
8.86k stars 3.09k forks source link

Idea for ASIC resistance #3545

Closed zawy12 closed 3 years ago

zawy12 commented 6 years ago

If ASICs are going to be a recurring problem, can you change the POW to maybe 5 or 10 different options every block based on the hash value of the previous block? That way the software could be changed in GPUs whereas ASICs would hopefully require different hardware for each POW if they are logically and/or mathematically independent enough. Even better would be if a POW could change a single variable to require a different ASIC, then the previous hash would be used to set the variable for the next block.

tevador commented 6 years ago

One more concern is determinism. The same JS program/JS engine combo must produce exactly the same results on all supported platforms (x86, ARM), OSes and compilators in order to be used for hashing. It's very hard to achieve this, especially with floating point calculations enabled in generated programs.

Yes, this must be taken care of. However, the ECMA specification requires floating point math to conform to IEEE 754, which in turn requires bit-exact rounding for basic operations (+, -, *, /). Operations with inexact results (sqrt, log, exp, etc.) must be handled in code by rounding manually. We have to decide the required precision that is supported by most platforms (it's one of the parameters of the generator).

AFAIK there was one bug in chrome related to floating point precision and it has been fixed. V8 now requires SSE2 support, which means it will not run on CPUs older than Pentium 4 or Athlon 64 (15+ years old CPUs).

For example, positive feedback loop that increases rounding error exponentially. This can happen in random generated code. Existing PoW/hash functions don't use floating point for a reason.

This is not a problem as long as each intermediate result is correctly rounded (as required by IEEE 754). Then everybody will arrive at the same result (even if the value is "wrong" compared to a theoretical infinite precision case),

Not only that, even future versions of the same engine (i.e. corrected for bugs or security issues), can change results

Yes, this can happen. I guess if we decide in the future to update the V8 version, we will have to scan the whole blockchain with the new version to see if everything validates. If not, both versions will have to be included and the switch will happen at a predetermined block height.

I mean, if you aren't going to use more advanced functions. You'd have to apply the above to every function like SQRT and whatnot.

Currently I use this function in RandomJS:

function __prec(x) { return +x.toPrecision(__fpMathPrec); }

where __fpMathPrec is a constant (I tested with values between 10 and 14). The maximum precision of 64-bit float is 15-17 decimal digits.

SChernykh commented 6 years ago

@tevador Does ECMA specification define a strict order of floating point operations for all cases? I'm thinking about compiler optimizations like ab+ac -> a*(b+c) which can change the result. Even if V8 engine applies it every time in the same way, different С++ compilers on different platforms will compile V8 and its floating point internals differently, giving unpredictable changes. So again, in theory it's all fine, IEEE-754 and ECMA standards are respected, but the actual tests are needed.

tevador commented 6 years ago

@SChernykh IEEE 754, chapter 10.4

A language standard should require that by default, when no optimizations are enabled and no alternate exception handling is enabled, language implementations preserve the literal meaning of the source code. That means that language implementations do not perform value-changing transformations that change the numerical results or the flags raised. A language implementation preserves the literal meaning of the source code by, for example:

  • Preserving the order of operations defined by explicit sequence or parenthesization.
  • Preserving the formats of explicit and implicit destinations.
  • Applying the properties of real numbers to floating-point expressions only when they preserve numerical results and flags raised:
    • Applying the commutative law only to operations, such as addition and multiplication, for which neither the numerical values of the results, nor the representations of the results, depend on the order of the operands.
    • Applying the associative or distributive laws only when they preserve numerical results and flags raised.
    • Applying the identity laws (0 + x and 1 × x) only when they preserve numerical results and flags raised.

So I think it should be safe as long as some exotic compiler flags are not used (such as -ffast-math in GCC).

Anyways I agree that we need to test the algorithm carefully at least on the most common platforms..

SChernykh commented 6 years ago

I've tweaked and fine-tuned my Cryptonight modifications, making them more robust and harder to crack for ASIC/FPGA. Also added some short description to README.md: https://github.com/SChernykh/xmr-stak-cpu

tevador commented 6 years ago

I tested RandomJS on a Raspberry PI (armv7l) and it gives the same hashes as the x64 platform (except it runs 10x slower than a Core i5 laptop, which is expected).

baryluk commented 6 years ago

A language standard should require ...

Notice the word SHOULD. It is not a guarantee. Nor there is mandate that the JS implementations follow this guideline either. (This is not a C or FORTRAN). I.e. it is common even for C programs to violate IEEE 754, for example by using FMA operation that often do have smaller errors than separate mul and add. In fact IEEE 754-2008 mandates to use single rounding in such situation, which is indirect violation of previous standard.

hyc commented 6 years ago

ECMA spec says all numbers are IEEE 754-2008. It says that many functions in the Math library may return approximations, but it does not allow that for the standard arithmetic operators.

By the way, I've started looking at using MuJS instead of v8 - it's a smaller, simpler implementation. Might be more suitable for a reference implementation. https://artifex.com/mujs/

hyc commented 6 years ago

Fwiw I'm not as optimistic as @tevador about using Math.* functions and manually rounding the results. We'd have to insert rounding statements at every invocation, to ensure deterministic roundoff error across implementations. Doable, but annoying.

tevador commented 6 years ago

@baryluk Feel free to search for a platform which gives a different result: https://github.com/tevador/RandomJS/issues/3

By the way, I've started looking at using MuJS instead of v8 - it's a smaller, simpler implementation. Might be more suitable for a reference implementation. https://artifex.com/mujs/

Is there a performance difference compared to the V8?

It seems to be only ES5 implementation, so it's unusable for my ES6 generator.

zawy12 commented 6 years ago

I'm thinking about compiler optimizations like ab+ac -> a*(b+c) which can change the result.

Good point. It seems something like my ciel() method is required. (and not only for divisions like I was thinking). Effort would need to be made to make sure the rounding is done very efficiently compared to the scope it's protecting or it would be a source of optimization.

tevador commented 6 years ago

I have commited a draft of RandomJS generator documentation.

It would be best if someone could review it before I start implementing the generator in C++. I'm sure there is some room for improvements and I'd like to hear your comments.

BTW I'm planning to use this Javascript interpreter for the reference implementation. It's a lot smaller than Chrome V8 and has (almost) full support of ES6 (it's a fork of the KinomaJS engine).

SChernykh commented 6 years ago

Thanks, I'll have a look at the documentation.

That JS interpreter is rather new - first commit is from October, 2017 and there's a been a lot of active development in it. It may contain a lot of bugs. Did you test it for "hashing" compatibility with V8?

tevador commented 6 years ago

@SChernykh they cloned the engine from here: https://github.com/Kinoma/kinomajs You can see it in issue 28 where one dev explains the original repo is no longer maintained. I'm not sure why they didn't keep the commit history, though.

I'm planning to compare the results to the V8 and also make some performance comparison.

baryluk commented 6 years ago

So, if the generated code is used in a prototype, but the generated code doesn't actually use heavily any JavaScript related functionality (like libraries, prototypes, etc), there seems there is nothing stopping me to rewrite the prototype to generate an equivalent Lua code or Java bytecode that produces same result, and run it - faster, with less memory, etc. Am I right?

baryluk commented 6 years ago

@tevador Oh, I see your EvalExpression with random content. That Is Evil.

tevador commented 6 years ago

@baryluk Yes, there are two things that will make that approach harder:

  1. The EvalExpression (with the default settings, there is about 60 of them in each program on average).
  2. The hash of the reference source code is part of the PoW, so you have to generate it anyways.
moneromooo-monero commented 6 years ago

The remaining ~77% is a SyntaxError.

Does that mean that a miner might choose to always claim syntax error (IIRC returning just "SyntaxError" + thatstring) to avoid the load of procesing this eval, at the cost of 23% of the hashes being incorrect ? Whether it's a good choice depends on how much time that eval code takes compared to a typical whole hash.

moneromooo-monero commented 6 years ago

Also, this PoW is actually useful beyond PoW as a large scale fuzzer for javascript implementations. I'm starting to like it a bit more now :)

SChernykh commented 6 years ago

@moneromooo-monero An average random program does about 60 evals, so there is no way to avoid it.

In the meantime, some guys claim they achieved 14 KH/s @ 150 watts on FPGA for Cryptonight V1 and are going to send the first batch in August: https://bitcointalk.org/index.php?topic=3688965.0

hyc commented 6 years ago

@moneromooo-monero considering that the original randprog was used as a fuzzer for C compilers, that's not surprising

tevador commented 6 years ago

@moneromooo-monero

Does that mean that a miner might choose to always claim syntax error (IIRC returning just "SyntaxError" + thatstring) to avoid the load of procesing this eval, at the cost of 23% of the hashes being incorrect ? Whether it's a good choice depends on how much time that eval code takes compared to a typical whole hash.

I was aware of this strategy, that's why I tried to find a character set which produces the lowest possible amount of SyntaxErrors.

Anyways, your comment made me run the numbers. I tested a simple regex.Replace to turn all EvalExpressions into string literals as if they all produced a SyntaxError. Using the default generator options, about 61% of programs have the same output as with the original code and the optimized code runs ~36% faster.

So effectively, a solo miner would reduce their chance of finding a valid block by ~17% with this optimization (0.61 * 1.36 ~ 0.83). Pool miners would get banned by the pool because of ~39% of invalid shares.

Before, I was also thinking to include the whole error message in the output. It would completely eliminate this optimization, but it would also force everyone to use the reference engine (or an engine which produces the same error messages).

baryluk commented 6 years ago

@tevador Could you dump somewhere (maybe into test vectors directory) sample of ~100 programs generated with input seeds and hashes, and timing distribution?

The hash of the reference source code is part of the PoW, so you have to generate it anyways.

That is not a problem, just a minor slowdown.

AFAIK because the eval sees very simple strings, very simplified parser could be used, definitively not a full JS parser is needed. Also converting it into another language wouldn't be hard. I believe I could still implement equivalent Lua code, that would be able to process it.

What would be probably better is to:

1) rewrite the generator in the JS itself, or provide a JS function that calls back generator, so you can do code2, hash_of_code2 = generator(some_seed, max_depth), from the generated code dynamically, and then run it with (new Eval(code2))(additional_input_to_generated_code). Where generator would generate recursively new code (possibly with more eval and generator calls, unless max_depth <= 0). some_seed would be some string or integer, that is coming from current level execution, and max_depth is max_depth-1 current level. generated code should use both bunch of local variables and global variables (shared with all other levels), and global variables only shared with current level. Any given code generated would be called exactly once, and on each level between 0 and 2 calls to generator (with 0 at the last level), would be performed.

About 1), this still do not preclude me of doing Lua implementation actually. Even if I need to generate code twice, the implementation would be about 100 times smaller than full V8 engine.

2) To exercise parser even better, generated code should use more features, including comments and regular expression, as well array manipulations, including array methods.

3) Highly recursive calls would be interesting.

4) How many eval with different strings on average is there in a generated program? If it is something like 10+, then you cannot statistically make them all SyntaxErrors and be still correct.

I still the current implementation is way to heavy, and hacky. It poses multiple problems, portability, security, maintenance, hacky handling of integers/floats, ad-hoc additional of simple eval.

You can achieve all the same with Lua, plus it is much better defined, more portable to more operating systems and machine, and easier on verifier, i.e. mobile phones. Also you can easily run 30 Lua engines on one multi core machine, in single process, without big strain on memory and impact on other programs on the same machine.

tevador commented 6 years ago

Could you dump somewhere (maybe into test vectors directory) sample of ~100 programs generated with input seeds and hashes, and timing distribution?

For program generation: ./Tevador.RandomJS.exe > sample_program.js

For distribution of runtimes: ./Tevador.RandomJS.Test.exe --count 10000 --verbose --threads 2 (This will generate a histogram of runtimes.)

That is not a problem, just a minor slowdown.

Program generation can be at least 5-10% of one hash. It means all optimizations will have a negative starting point.

AFAIK because the eval sees very simple strings, very simplified parser could be used, definitively not a full JS parser is needed. Also converting it into another language wouldn't be hard. I believe I could still implement equivalent Lua code, that would be able to process it.

Yes, a full parser is not needed, but still it adds significant complexity if someone wanted to bypass the parser and generate bytecode directly. Anyways, I challenge you to make a very simple parser that can handle all the random eval strings.

As for your points:

  1. Interesting approach, but I think the generator would need to generate very simple code to meet the runtime targets.

About 1), this still do not preclude me of doing Lua implementation actually. Even if I need to generate code twice, the implementation would be about 100 times smaller than full V8 engine.

The reference implementation will not use the V8. I'm testing a lightweight interpreter for it. Also what matters for mining is the performance, not size.

  1. More features can certainly be added, but the problem is that all code paths must handle all types to keep the output entropy high, so adding more features will be harder and harder. Comments and regular expressions are already included in EvalExpression. If you manage to implement some additional features, you can make a pull request.

  2. Recursive calls already happen. I debugged some programs where a function was passed to itself for a constructor call. These cases are quite common. You can test it by disabling call depth protection and you will see the programs crashing with call stack errors.

  3. There are about 60 evals per program on average (default ProgramOptions). But not all of them affect the output of the program.

I still the current implementation is way to heavy, and hacky. It poses multiple problems, portability, security, maintenance, hacky handling of integers/floats, ad-hoc additional of simple eval.

Portability seems good so far. I'm not aware of any security issues. Can you elaborate about "hacky handling of integers/floats"?

tevador commented 6 years ago

@moneromooo-monero

Also, this PoW is actually useful beyond PoW as a large scale fuzzer for javascript implementations. I'm starting to like it a bit more now :)

Yes, RandomJS already found 2 bugs in the XS interpreter.

SChernykh commented 6 years ago

@moneromooo-monero My latest and greatest shuffle modification: https://github.com/SChernykh/xmr-stak-cpu/commit/9169ef624250e8ab73ec362d7905abcb00ba91a4

Not only it takes advantage of 64-byte wide L1 cache accesses, it also takes advantage of L1 cache size. This one actually needs to be tested on GPUs, because it also makes 2 times more random memory accesses. GPUs have cache too, so I guess people will eventually figure out how to do it without losing performance.

P.S. CPUs are fine, I've already tested it: 2-2.5% slowdown only. P.P.S. And it also makes use of the 4 least significant bits of the scratchpad index which were previously unused. Nice!

Gingeropolous commented 6 years ago

@SChernykh , do u have a monerod and pool version of those mods for testing, or can I just run that miner on my GPUs to get hash results?

SChernykh commented 6 years ago

No, just CPU miner for now. It's an early prototype, it needs testing and tweaking. And I also need to make a GPU version of all these modifications.

P.S. You just run it in benchmark mode to test performance.

SChernykh commented 6 years ago

GPU version of shuffle and division modifications: https://github.com/SChernykh/xmr-stak-amd I've tested it on GTX 1060 6 GB. Division modification doesn't slow it down at all. Shuffle modification slows down GPU a lot. It also forced me to lower intensity. Hashing speed is ~1.8 times slower at the same intensity and ~2 times slower comparing to the original Cryptonight running at max intensity. Maybe it's possible to fix this, but I don't know how at the moment.

@Gingeropolous @tevador @moneromooo-monero Can anyone with AMD card test it? Just compile it, play with config.txt settings and run it.

SChernykh commented 6 years ago

I found a way to squeeze two (!) integer square root calculations in addition to the division, without any troubles with rounding and without any additional slowdown! This starts to look very interesting, I'll test it on GPU tomorrow. Both division and square roots are good to fight ASIC because they are implemented on hardware level either as an iterative logic (slow, many clock cycles), or as a pipelined logic (fast, but occupies a lot of space on chip).

moneromooo-monero commented 6 years ago

I asked hyc to have a look at what this does on ARM (some of them appear to be pretty good at hash/watt).

SChernykh commented 6 years ago

I've added square roots to both CPU and GPU test repos, also updated the description: https://github.com/SChernykh/xmr-stak-cpu https://github.com/SChernykh/xmr-stak-amd

Feel free to test. My tests on Ivy Bridge and Skylake CPUs show 3% slowdown with one division and two square roots per iteration. No slowdown at all on GTX 1060! Awesome.

P.S. It will most likely kill the performance on ARMs which don't have out-of-order execution. And all energy efficient ARMs don't have it. P.P.S. Only high performance ARMs like Cortex A72-A75 can handle it, but they're not so energy efficient and have relatively large die, comparing to other ARMs

tevador commented 6 years ago

@SChernykh My results for AMD Radeon RX 550 on Ubuntu 16.04.

For reference, my cards make 480 H/s on cryptonight-v1 (latest xmr-stak) and never produce invalid hashes.

https://github.com/SChernykh/xmr-stak-amd All tests with intensity 600, worksize 8.

MATH_MOD OFF MATH_MOD ON
__SHUFFLE_MOD OFF__ 425 H/s 170 H/s
__SHUFFLE_MOD ON__ 325 H/s * 130 H/s *

* With SHUFFLE_MOD ON, the cards produce some invalid hashes. I tested 6 different cards and all produce invalid hashes for the same nonce values, so it looks like a bug in opencl code rather than hardware errors.

SChernykh commented 6 years ago

@tevador Yes, CPU checking code doesn't have shuffle mod yet, don't pay attention to these messages. Is it really that bad for AMD cards? I saw no change in hashrate on GTX 1060, even though this was originally OpenCL code for AMD cards. I'll grab RX 560 from my friend for testing today.

SChernykh commented 6 years ago

@tevador I've submitted shuffle mod for CPU checking code, you can pull and test again now. There shouldn't be any CPU/GPU mismatch errors anymore.

tevador commented 6 years ago

Results for AMD Ryzen (1 thread) with https://github.com/SChernykh/xmr-stak-cpu

Mode Hashrate
- 71.1 H/s
shuffle 69.3 H/s
shuffle+int_math 67.0 H/s
int_math 70.0 H/s
shuffle_with_lag 69.1 H/s

Reproducible to within 0.1 H/s.

SChernykh commented 6 years ago

So, it's confirmed now that shuffle and int_math mods can be handled fine by all modern CPUs. Ivy Bridge, Skylake and Ryzen tested so far. Nice.

tevador commented 6 years ago

@SChernykh I was able to increse the hashrate of my RX 550 with the int_math mod to ~315 H/s by increasing the worksize. It's still a significant drop in performance, though (~25%).

This is with intensity 760 (highest possible) and worksize 32.

[2018-06-15 18:46:36] : Compiling code and initializing GPUs. This will take a while...
[2018-06-15 18:46:36] : Device 3 work size 32 / 256.
[2018-06-15 18:46:36] : clBuildProgram options: -I. -DWORKSIZE=32 -DINT_MATH_MOD
[2018-06-15 18:46:41] : Running a 20x10 second benchmark...
[2018-06-15 18:46:41] : Starting GPU thread, no affinity.
[2018-06-15 18:46:51] : Average = 264.9 H/S, Current = 264.9 H/S
[2018-06-15 18:47:01] : Average = 292.1 H/S, Current = 316.6 H/S
[2018-06-15 18:47:11] : Average = 291.0 H/S, Current = 288.3 H/S
[2018-06-15 18:47:21] : Average = 302.7 H/S, Current = 336.6 H/S
[2018-06-15 18:47:31] : Average = 304.0 H/S, Current = 310.2 H/S
[2018-06-15 18:47:41] : Average = 312.9 H/S, Current = 357.3 H/S
[2018-06-15 18:47:51] : Average = 313.6 H/S, Current = 317.5 H/S
[2018-06-15 18:48:01] : Average = 310.9 H/S, Current = 289.6 H/S
[2018-06-15 18:48:11] : Average = 313.9 H/S, Current = 336.7 H/S
[2018-06-15 18:48:21] : Average = 313.3 H/S, Current = 307.5 H/S
[2018-06-15 18:48:31] : Average = 317.1 H/S, Current = 354.4 H/S
[2018-06-15 18:48:41] : Average = 316.9 H/S, Current = 314.7 H/S
[2018-06-15 18:48:51] : Average = 314.5 H/S, Current = 283.7 H/S
[2018-06-15 18:49:01] : Average = 315.7 H/S, Current = 331.1 H/S
[2018-06-15 18:49:11] : Average = 315.1 H/S, Current = 304.8 H/S
[2018-06-15 18:49:21] : Average = 317.6 H/S, Current = 355.5 H/S
[2018-06-15 18:49:31] : Average = 317.5 H/S, Current = 315.8 H/S
[2018-06-15 18:49:41] : Average = 315.9 H/S, Current = 286.4 H/S
[2018-06-15 18:49:51] : Average = 317.0 H/S, Current = 334.9 H/S
[2018-06-15 18:50:01] : Average = 316.5 H/S, Current = 307.5 H/S
SChernykh commented 6 years ago

@tevador I think it's just because RX 550 doesn't have enough computing power. Considering the fact that GTX 1060 works fine with the same hashrate. We need to test it on Vega 56/64. I'll also test it on RX 560 tomorrow.

SChernykh commented 6 years ago

@tevador I've tested RX 560 on Windows 10: all stock, monitor plugged in, intensity 1000, worksize 32:

Mod Hashrate
- 379.9 H/s
INT_MATH_MOD 383.1 H/s
SHUFFLE_MOD 371.6 H/s
Both mods 350.9 H/s

I'll test on Linux and with no monitor plugged in tomorrow. But it already looks like that RX 550 is just too weak to handle all these divisions and square roots.

zawy12 commented 6 years ago

I do not want anyone to interrupt the current thread, but I was thinking about my simple algorithm idea. I previously said ASICs and GPUs present the problem of being able to implement many cores to do the simple calculations, but they are not as efficient in terms of electricity use. Since electricity is half the cost in typical mining, ASICs and GPUs only provide a hardware-cost advantage, so they are a max of 2x more efficient. But if many people have a miner as well as wallet running on their laptop, it's like zero hardware cost. Besides that, I could do a POW that could make laptop or desktop burn 50 W above idle. A 300 W GPU could only do 6x more calculations before getting too hot. It can't use all its cores. Same thing with ASICs. So they should not have anything near the 2x advantage for these reasons, maybe a 2x disadvantage.

If I write a simple POW that changes with every nonce to make my desktop run hot, can someone try to optimize it for use on a GPU to try to beat my hash per electricity costs? Seems like my idea could be implemented and tested a lot quicker. Like 1 day for someone who knows how to optimize a GPU. That could be the starting point rather than CPU: how do you make a GPU core burn the most electricity with the simplest class iterative equations? Is SQRT of a "random" seed enough to do it?

The idea of intentionally trying to burn the most electricity may have prevented this route from being investigated for "moral" reasons, but as I described before, it's not really different than a hard-expensive route.

SChernykh commented 6 years ago

@zawy12 They (ASICs) are very efficient in terms of performance/watt.

how do you make a GPU core burn the most electricity with the simplest class iterative equations? Is SQRT of a "random" seed enough to do it?

Just look at what LinX test does. It burns CPUs like hell. Lots of floating point math, AVX instructions. But I'm not sure it can be used for hashing.

Basically, you need to run as many FMA (multiply and add) instructions per clock cycle as possible. The closer you get to the theoretical FLOPs limit of the device, the better.

zawy12 commented 6 years ago

@SChernykh What do you think of my general idea and reasoning? That since electricity is half the cost, and this is so simple, that it's probably a good route to follow?

zawy12 commented 6 years ago

@SChernykh

But I'm not sure it can be used for hashing.

A hash of the (nonce + previous block hash) would be the seed for the simple N=10,000(?) loop that is required in order to get an output nonce upon which the real hash is performed. Validation would repeat the process.

The extreme idea of changing every nonce is more than synergistic with the idea of making the algo simple but iterative. It's so simple it might cause disbelief, but I can't find an error with it.

I think the loop should require 10x more computation than the 2 hashes, so fast hashing provides minor benefit, and the 10x loop as opposed to 100x loop would not not be a burden to validate, although as far as I know 100x is not be a big burden either.

tevador commented 6 years ago

Final results for RX 550 (with 10 compute units).

Mode Intensity/Worksize Hashrate
- 600/8 425 H/s
shuffle 760/16 380 H/s
int_math 760/32 315 H/s
shuffle+int_math 760/32 255 H/s

Hashrate is rounded to multiples of 5 H/s due to fluctuations.

There are also RX 550s with just 8 compute units which will fare even worse.

tevador commented 6 years ago

@zawy12 The problem is that with these static compute-intensive workloads, an ASIC will be always more efficient than general-purpose hardware.

moneromooo-monero commented 6 years ago

Is anyone familiar enough with ASIC design to estimate what impact having to add div and/or sqrt hardware might have on the performance and cost ?

SChernykh commented 6 years ago

@moneromooo-monero I've spent the last few days learning how FPGAs/ASICs work and how divisions and square roots are implemented in hardware. They're iterative algorithms with high execution latency - something that current Cryptonight lacks. The whole Cryptonight inner loop can be implemented in 1 cycle per iteration. Division and square root logic take more logical elements each (space on chip) than everything else in the loop, and they bring latency. The loop must be unrolled and pipelined many times more to hide this latency (space on chip x15-x20), and of course it will require many times more parallel scratchpads to feed all this logic (won't fit in on-chip memory).

I'm not a hardware designer though. We had @cloudHH and @rufus210 giving valuable feedback in Cryptonight V1 discussion: https://github.com/monero-project/monero/pull/3253#issuecomment-366142870 https://github.com/monero-project/monero/pull/3253#issuecomment-367946170

zawy12 commented 6 years ago

The algorithms for simple math should be already optimized for a given 64 bit-width and I don't think it can be improved upon by going to wider channels. They could create a massive number of miniature CPUs that do only with the simple operations, but they still have to go through the same number of physical transistor state-change operations (think FLOPs) which means the same amount of electricity usage (or more for reasons I mentioned).

It may have to be a single operation, or a sequence of a kind of wide range of operations. If only 4 operations were used as I originally said, they could dedicate "simplified CPUs" to each of the 3^4 possible sequences of 3 operations. A sequence of 3 operations may have an optimization superior to a CPU only optimized for each of them individually.

But if a single operation is used, there might be a a known optimization that uses a huge lookup table to replace calculations that would not be feasible for generic CPUs.

SChernykh commented 6 years ago

@zawy12 Every single computational-intensive algorithm so far has failed to resist ASICs. Take away instruction fetch and decoders, caches, out-of-order execution logic, branch prediction etc. from CPU, leave only the math and you'll get 10x more hashes per watt.

zawy12 commented 6 years ago

@SChernykh that's only because the algos are not changing every nonce. [edit: or they were complicated enough to be optimizable]