ifdefelse / ProgPOW

A Programmatic Proof-of-Work for Ethash. Forked from https://github.com/ethereum-mining/ethminer
GNU General Public License v3.0
257 stars 84 forks source link

Issue 64 #52

Closed AndreaLanfranchi closed 4 years ago

AndreaLanfranchi commented 4 years ago

Amend https://github.com/ifdefelse/ProgPOW/issues/51

Rationale

The problem highlighted by Kik revealed a vulnerable condition which allows the mining activity to be carried out bypassing, almost completely, the characteristics of memory-hardness embedded into the algo. Must be evidenced that this issue could be exploited only in bundle with a custom node implementation and the current "public" mining infrastructure does not allow the threat vector to go through. Nevertheless the concern is present and the code provided by Kik has proven real.

Kik's discovery relies on two basic assumptions :

As additional consideration a "custom node" must be modifed to accept a mining result from work consumer with a modified header_hash (see below)

The two above assumptions are valid for both ethash and progpow algorithms even if the latter, due to different width of "seed" is way more affected. Both ethash and progpow share the same "macro" steps with a difference on step 3:

The logic concatenation of operations allows in ProgPoW (Kik's code) to: 1 skip the first step and directly set an arbitrary value for "seed" 2 obtain mix_hash (doing only one round of memory accesses) 3 linearly iterate through "extraData" increments and build a new_header_hash (Keccak256) starting from the header_hash sent by work provider 4 Apply new_header_hash + seed + mix as input to last keccak round and compare first two words of digest result to target. If matching go ahead otherwise goto 3 5 We now have a new_header_hash and an "extraData" solving the algo. One thing left to do is to find a nonce which, combined with new_header in the initial Keccak round, produces a seed (and by consequence a mix_hash) which is equal to the one arbitrary set 6 To do so simply scan nonce range (2**64) and do as many Keccak rounds as possible to find the right one. Should range be exhausted goto 3 7 Eventually the miner will notify the "custom node" that the hashing has found a solution (nonce) changing header_hash due to extraData.

The difference in width for the seed and the different absorb phase for last keccak round makes the issue orders of magnitude more relevant in ProgPoW than in ethash but nevertheless the same pattern apply. It goes without saying this method is highly impractical and economically unviable on ethash as of today ... but is there.

Basically in both algos we have :

- seed  == F(header_hash, nonce)
- mix   == F(seed)              // where memory access occurr
- final == F(seed, mix)

This means also that setting an arbitrary seed we get

- mix   == F(seed)              // where memory access occurr
- final == F(seed, mix)         // where I find a "new_header"
- nonce == F(new_header, seed)  // which can be iterated very quickly

Thus making possible to brute force keccak while bypassing memory accesses

Purpose of this PR is to extend the dependency from header_hash to mix too so having

- digest_init == F(header_hash, nonce)
- seed  == F(digest_init)
- mix   == F(seed)       // where memory access occurr
- digest_final == F(digest_init, mix)

Thus making final hash depend on 256 bit digest from initial keccak which depends from header and nonce.

OhGodAGirl commented 4 years ago

@solardiz - Would appreciate your eyes on this.

@AndreaLanfranchi - Will review shortly. Are we able to get resulting profiler results for both AMD (perhaps we'll need assistance from a community member) as well as NVIDIA?

We will also need to ensure this is also applied to libprogpow and libethash-cl.

AndreaLanfranchi commented 4 years ago

Benchmarking on Windows EVGA Gtx 1060 3Gb

0.9.3.

 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 56 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
ptxas info    : Function properties for _Z11keccak_f8008hash32_tyS_
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
 cu  15:37:30|cuda-0  |  JIT err:

 cu  15:37:30|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  15:37:30|cuda-0  |  done compiling
Trial 1...
8328502
Trial 2...
8814208
Trial 3...
8924497
Trial 4...
8833632
Trial 5...
8823921
Trial 6...
8818620
Trial 7...
8821270
Trial 8...
8927179
Trial 9...
8836046
Trial 10...
8828343
min/mean/max: 8328502/8795621/8927179 H/s
inner mean: 8837567 H/s

0.9.4 (proposed)

 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 72 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
 cu  15:41:09|cuda-0  |  JIT err:

 cu  15:41:10|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  15:41:10|cuda-0  |  done compiling
Trial 1...
7634696
Trial 2...
7973961
Trial 3...
7973164
Trial 4...
7971569
Trial 5...
7971569
Trial 6...
7972366
Trial 7...
7972366
Trial 8...
7974759
Trial 9...
7974759
Trial 10...
7975558
min/mean/max: 7634696/7939476/7975558 H/s
inner mean: 7973064 H/s
solardiz commented 4 years ago

Hi all. I am impressed by how quickly all of you jumped in to try and fix this. That's great!

I will probably have more comments, but for starters and most relevant to your current experiments: somehow in my experiments I was able to double the number of "ProgPoW VMs" registers and when done along with doubling the block size (from 256 to 512 bytes) this had very little performance impact - much less than we're seeing above. So maybe it's the opportunity to revisit that and try to fix the new issue on top of those changes, so that the revised ProgPoW would also require more registers in custom hardware?

lookfirst commented 4 years ago

Looks like a 9.78% hit on hashrate?

solardiz commented 4 years ago

@AndreaLanfranchi Can you try settings like what I had in my 0.9.3m1 proposal? -

Parameter 0.9.2 0.9.3 0.9.3m1
PROGPOW_PERIOD 50 10 10
PROGPOW_LANES 16 16 16
PROGPOW_REGS 32 32 64
PROGPOW_DAG_LOADS 4 4 8
PROGPOW_CACHE_BYTES 16x1024 16x1024 16x1024
PROGPOW_CNT_DAG 64 64 32
PROGPOW_CNT_CACHE 12 11 22
PROGPOW_CNT_MATH 20 18 36

Due to merging of #35 back then, this should just work.

In 0.9.3m1+, I also had a code tweak to use groups of 4 sequential 512-byte blocks each, thus random reads of 2048-byte blocks, which helped Maxwell a lot. I guess that isn't essential for your testing on Pascal and newer, but is something we can try re-adding as well.

In the above, I set REGS to 64, but with more register pressure from the fix here I guess you can reasonably try e.g. 56 or 48 as well.

solardiz commented 4 years ago

The tweak I had proposed in #40 (with two versions of code changes that are ready for use) might also help regain some of the performance.

AndreaLanfranchi commented 4 years ago

@solardiz there you go

0.9.3m1

 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 96 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
ptxas info    : Function properties for _Z11keccak_f8008hash32_tyS_
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
 cu  17:12:54|cuda-0  |  JIT err:

 cu  17:12:54|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  17:12:54|cuda-0  |  done compiling
Trial 1...
7633847
Trial 2...
8287894
Trial 3...
8184622
Trial 4...
8287065
Trial 5...
8076458
Trial 6...
8075650
Trial 7...
7970771
Trial 8...
8076458
Trial 9...
8077266
Trial 10...
8080499
min/mean/max: 7633847/8075053/8287894 H/s
inner mean: 8103598 H/s

Worth mention that those tweaks have no effect at all in mitigation of #51

solardiz commented 4 years ago

@AndreaLanfranchi Is this the 0.9.3m1 settings (but not its code tweaks) + your code tweaks proposed here? In other words, a slight speedup over what you had? If so, can you also try REGS 56, 48, or such? Thanks!

AndreaLanfranchi commented 4 years ago

No. This is knob tweaks in third column applied over 0.9.3. Want me to apply those knobs to proposed 0.9.4 ?

solardiz commented 4 years ago

@AndreaLanfranchi Hmm, that's a bigger slowdown than I expected. Maybe the move to 2048 byte blocks is needed to reduce the cost of those changes on Pascal as well. Can you try my final code from https://github.com/solardiz/ProgPOW (known as 0.9.3m2) on its own and along with your changes? Thanks!

AndreaLanfranchi commented 4 years ago

@solardiz I will as soon as I have a chance. You have my word. :)

This issue though is not related to hashrate chasing rather to void the "bypass memory hardness" issue. Would like all of us to focus on that,

solardiz commented 4 years ago

@AndreaLanfranchi Of course, we're primarily fixing the vulnerability here. We just want to do so with minimum impact on hashrate, and maybe also use the opportunity to merge other previously postponed tweaks that make better use of GPUs vs. specialized ASICs.

Also, I imagine it should be possible to write the code such that we explicitly spill the now-larger intermediate hash value to memory (maybe to shared aka local? or even to global+cache, which is probably better than having the compiler spill something more frequently used), and then load it back from memory when it's needed, so that there's no increase in register pressure during the main loop.

AndreaLanfranchi commented 4 years ago

@solardiz this proposal (CUDA only yet) increase registers use from 56 to 72 and no spills. There is no "larger" intermediate hash. state[25] was already needed to run any of the 22 keccak round and the "only" modification is that last keccak round absorbs not the header_hash but the carry-over (8 words) from initial keccak round.

solardiz commented 4 years ago

@AndreaLanfranchi OK, I probably need to actually review/try it before speaking, but since I already started: from the source, it looks like the increase in register pressure inside the per-lane loop is only slight, by maybe 3 registers for hash_seed[1] and state[1]. 56 to 72 is 16 extra registers. Perhaps we can reduce that by manual spilling/reloading of some of state (index 2 and higher) for the duration of this loop?

AndreaLanfranchi commented 4 years ago

Inside the per-lane loop there is no increase in register usage. hash_seed in 0.9.3. was a 64bit uint byteswapped from from first two words of first keccak. In proposed 0.9.4. is simply those two words not byteswapped as endianness in the loop is totally irrelevant

solardiz commented 4 years ago

@AndreaLanfranchi What block number are your benchmark results above for? I'd probably use the same for mine.

solardiz commented 4 years ago

For me, the slowdown with these changes (as-is, except for the Hunter/Boost build fixes) on GTX 1080 for block 10M appears to be 3.1% using the Trial 10 speeds below:

@AndreaLanfranchi's master branch (cbd1c160031ce8d2d3e75493ff8f341fa42bbf14):

$ ethminer/ethminer -U -M 10000000 --cuda-devices 0 --benchmark-trials 10
[...]
 cu  17:09:48|cuda-0  |  Using device: GeForce GTX 1080  (Compute 6.1)
 cu  17:09:48|cuda-0  |  Set Device to current
 cu  17:09:48|cuda-0  |  Resetting device
 cu  17:09:48|cuda-0  |  Allocating light with size: 60423872
 cu  17:09:48|cuda-0  |  Generating mining buffers
 cu  17:09:48|cuda-0  |  Generating DAG for GPU # 0  with dagBytes: 3867147904  gridSize: 1024
Trial 1...
0
Trial 2...
 cu  17:10:03|cuda-0  |  Finished DAG
 cu  17:10:03|cuda-0  |  Compile log:
 cu  17:10:03|cuda-0  |  JIT info:
 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 63 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
ptxas info    : Function properties for _Z11keccak_f8008hash32_tyS_
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
 cu  17:10:03|cuda-0  |  JIT err:

 cu  17:10:03|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  17:10:03|cuda-0  |  done compiling
11010048
Trial 3...
13424457
Trial 4...
14028211
Trial 5...
15104025
Trial 6...
15104025
Trial 7...
15104025
Trial 8...
15104025
Trial 9...
15104025
Trial 10...
15104025
min/mean/max: 0/12908686/15104025 H/s
inner mean: 14247855 H/s

@AndreaLanfranchi's Issue-64 branch (04f4bee1223bb59bd874ac1476ecf3d0b760509c):

$ ethminer/ethminer -U -M 10000000 --cuda-devices 0 --benchmark-trials 10
[...]
 cu  17:07:32|cuda-0  |  Using device: GeForce GTX 1080  (Compute 6.1)
 cu  17:07:32|cuda-0  |  Set Device to current
 cu  17:07:32|cuda-0  |  Resetting device
 cu  17:07:32|cuda-0  |  Allocating light with size: 60423872
 cu  17:07:32|cuda-0  |  Generating mining buffers
 cu  17:07:32|cuda-0  |  Generating DAG for GPU # 0  with dagBytes: 3867147904  gridSize: 1024
Trial 1...
0
Trial 2...
 cu  17:07:46|cuda-0  |  Finished DAG
 cu  17:07:47|cuda-0  |  Compile log:
 cu  17:07:47|cuda-0  |  JIT info:
 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 72 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
 cu  17:07:47|cuda-0  |  JIT err:

 cu  17:07:47|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  17:07:47|cuda-0  |  done compiling
11272192
Trial 3...
13214700
Trial 4...
13766001
Trial 5...
14632024
Trial 6...
14632024
Trial 7...
14632024
Trial 8...
14632024
Trial 9...
14632024
Trial 10...
14632024
min/mean/max: 0/12604503/14632024 H/s
inner mean: 13926626 H/s
solardiz commented 4 years ago

For me, the below hack reduces the register usage from 72 to 64 (original was 63 here) and completely removes the performance impact on the GTX 1080 here:

+++ b/libethash-cuda/CUDAMiner_kernel.cu
@@ -159,12 +159,14 @@ progpow_search(
     // Force threads to sync and ensure shared mem is in sync
     __syncthreads();

-    uint32_t state[25];     // Keccak's state
     uint32_t hash_seed[2];  // KISS99 initiator
     hash32_t digest;        // Carry-over from mix output

+    uint32_t state8[8];
+{
     // Absorb phase for initial round of keccak
     // 1st fill with header data (8 words)
+    uint32_t state[25];     // Keccak's state
     for (int i = 0; i < 8; i++)
         state[i] = header.uint32s[i];
     // 2nd fill with nonce (2 words)
@@ -177,6 +179,10 @@ progpow_search(
     // Run intial keccak round
     keccak_f800(state);

+    for (int i = 0; i < 8; i++)
+        state8[i] = state[i];
+}
+
     // Main loop
     #pragma unroll 1
     for (uint32_t h = 0; h < PROGPOW_LANES; h++)
@@ -184,8 +190,8 @@ progpow_search(
         uint32_t mix[PROGPOW_REGS];

         // share the first two words of digest across all lanes
-        hash_seed[0] = __shfl_sync(0xFFFFFFFF, state[0], h, PROGPOW_LANES);
-        hash_seed[1] = __shfl_sync(0xFFFFFFFF, state[1], h, PROGPOW_LANES);
+        hash_seed[0] = __shfl_sync(0xFFFFFFFF, state8[0], h, PROGPOW_LANES);
+        hash_seed[1] = __shfl_sync(0xFFFFFFFF, state8[1], h, PROGPOW_LANES);

         // initialize mix for all lanes using first
         // two words from header_hash
@@ -217,6 +223,10 @@ progpow_search(
             digest = digest_temp;
     }

+    uint32_t state[25];     // Keccak's state
+
+    for (int i = 0; i < 8; i++)
+        state[i] = state8[i];

     // Absorb phase for last round of keccak (256 bits)
     // 1st initial 8 words of state are kept as carry-over from initial keccak
 cu  17:59:12|cuda-0  |  Using device: GeForce GTX 1080  (Compute 6.1)
 cu  17:59:12|cuda-0  |  Set Device to current
 cu  17:59:12|cuda-0  |  Resetting device
 cu  17:59:12|cuda-0  |  Allocating light with size: 60423872
 cu  17:59:12|cuda-0  |  Generating mining buffers
 cu  17:59:12|cuda-0  |  Generating DAG for GPU # 0  with dagBytes: 3867147904  gridSize: 1024
Trial 1... 
0
Trial 2... 
 cu  17:59:27|cuda-0  |  Finished DAG
 cu  17:59:27|cuda-0  |  Compile log: 
 cu  17:59:28|cuda-0  |  JIT info: 
 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
 cu  17:59:28|cuda-0  |  JIT err: 

 cu  17:59:28|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  17:59:28|cuda-0  |  done compiling
10485760
Trial 3... 
13319579
Trial 4... 
13962658
Trial 5... 
15156470
Trial 6... 
15104025
Trial 7... 
15156470
Trial 8... 
15104025
Trial 9... 
15156470
Trial 10... 
15104025
min/mean/max: 0/12854948/15156470 H/s
inner mean: 14174126 H/s

To have greater confidence that this is right, debugging output needs to be enabled (such as with #46) to make sure the computed hashes stay the same. Also, someone needs to implement the same changes in host-only code such as https://github.com/solardiz/c-progpow and in OpenCL, and make sure all 3 compute the same hashes.

AndreaLanfranchi commented 4 years ago

@AndreaLanfranchi What block number are your benchmark results above for? I'd probably use the same for mine.

I'm testing on block 0 (epoch 0)

AndreaLanfranchi commented 4 years ago

For me, the below hack reduces the register usage from 72 to 64 (original was 63 here) and completely removes the performance impact on the GTX 1080 here: [cut]

Nice catch and confirm it voids performance impact. 0.9.4 (proposed) (same environment as in my previous benchmarks)

ethminer -U -M 0 --benchmark-trials 10
  m  23:32:11|main    |  ethminer version 0.15.0.dev0
  m  23:32:11|main    |  Build: windows / release +git. 04f4bee
 cu  23:32:11|main    |  Using grid size 1024 , block size 512
Benchmarking on platform: CUDA
Preparing DAG for block #0
  i  23:32:12|cuda-0  |  No work.
  i  23:32:12|cuda-0  |  No work.
  i  23:32:12|cuda-0  |  No work.
Warming up...
  i  23:32:12|cuda-0  |  Initialising miner 0
 cu  23:32:13|cuda-0  |  Using device: GeForce GTX 1060 3GB  (Compute 6.1)
 cu  23:32:13|cuda-0  |  Set Device to current
 cu  23:32:13|cuda-0  |  Resetting device
 cu  23:32:14|cuda-0  |  Allocating light with size: 16776896
 cu  23:32:14|cuda-0  |  Generating mining buffers
 cu  23:32:14|cuda-0  |  Generating DAG for GPU # 0  with dagBytes: 1073739904  gridSize: 1024
 cu  23:32:20|cuda-0  |  Finished DAG
 cu  23:32:20|cuda-0  |  Compile log:
 cu  23:32:20|cuda-0  |  JIT info:
 ptxas info    : 0 bytes gmem, 96 bytes cmem[3]
ptxas info    : Compiling entry function '_Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb' for 'sm_61'
ptxas info    : Function properties for _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 16384 bytes smem, 385 bytes cmem[0], 4 bytes cmem[2]
 cu  23:32:20|cuda-0  |  JIT err:

 cu  23:32:20|cuda-0  |  Mangled name: _Z14progpow_searchy8hash32_tyPK5dag_tPV14search_resultsb
 cu  23:32:20|cuda-0  |  done compiling
Trial 1...
8448739
Trial 2...
8915570
Trial 3...
8811563
Trial 4...
8813326
Trial 5...
8816855
Trial 6...
8822153
Trial 7...
8827458
Trial 8...
8831884
Trial 9...
8836314
Trial 10...
8839861
min/mean/max: 8448739/8796372/8915570 H/s
inner mean: 8824926 H/s

Nevertheless, optimizations aside, I'd like a comment on effectiveness of change wrt the issue raised by Kikx. Thank you.

solardiz commented 4 years ago

I'd like a comment on effectiveness of change wrt the issue raised by Kikx.

This wasn't something I wanted to speak on without having thought of it while being fully concentrated on the problem and its potential variations. The comments I posted so far were without being fully concentrated on this - I was being distracted. I still intend to think of this properly and comment. We'd also need to test this with host-only code. That said, at first glance your approach looks valid. It would also be great to have a review by @kik.

Meanwhile, it's in this context that I brought up #54 - if we choose to address that issue by increasing seed size, then it could become irrelevant to consider merging your changes and thus to analyze them more thoroughly.

AndreaLanfranchi commented 4 years ago

I thought a lot about increasing seed size (also was initial suggestion/wish from Kristy) but my considerations were:

  1. seed is used as an initiator of RNG
  2. 64 bits are relatively light to shuffle among threads (2 Ops per lane instead of 8).
  3. KIS99 and FNV1A are not cryptographic functions ... Keccak is

Due to the above I really see no point in widening the seed (also considering the amplificator of NUM_LANES)

solardiz commented 4 years ago

All good points, @AndreaLanfranchi. However, I think we do need to address #54, which these changes don't (at least not on their own).

Now, you wanted my review. Here goes:

I think your description of this PR isn't entirely correct. You describe the original algo as follows:

- seed  == F(header_hash, nonce)
- mix   == F(seed)              // where memory access occurr
- final == F(seed, mix)

but it actually is more like:

- seed  == F(header_hash, nonce)
- mix   == F(seed)              // where memory access occurr
- final == F(header_hash, seed, mix)

Then you describe the changed algo as:

Purpose of this PR is to extend the dependency from header_hash to mix too so having

but the PR leaves the (limited) dependency of mix on header_hash unchanged - it's still only the same indirect dependency via seed. The good news is this doesn't matter for #51. The way you prevent that attack is by extending the dependency from nonce to final (via digest in the algo description I quote below), so perhaps that's what you meant to write in the sentence I quoted above?

- digest == F(header_hash, nonce)
- seed  == F(digest)
- mix   == F(seed)       // where memory access occurr
- final == F(digest, mix)

The way this addresses #51 is by making what you described as step 5 invalidate the work done on steps 3 and 4:

4 Apply new_header_hash + seed + mix as input to last keccak round and compare first two words of digest result to target. If matching go ahead otherwise goto 3
5 We now have a new_header_hash and an "extraData" solving the algo. One thing left to do is to find a nonce which, combined with new_header in the initial Keccak round, produces a seed (and by consequence a mix_hash) which is equal to the one arbitrary set

On step 5, we'll most likely arrive at a different nonce, and now that final has a dependency on nonce not only via mix, but also via digest, it will most likely no longer meet target.

A variation of the attack that would still work would involve re-doing steps 3 and 4, then 5 again, etc. until both conditions are met at once. This effectively multiplies the 2^64 search at step 5 by the number of tries needed to meet network difficulty. Since 2^64 Keccak computations is obviously much more costly than one computation of mix (as would be needed in an equivalent loop in normal mining), this variation of the attack is clearly impractical (despite of it bypassing the memory hardness).

Summary: the description is wrong, the changes appear to address #51 anyway (allowing for only an obviously impractical variation of it), but they don't address #54 nor its possible variations (e.g., potential combinations of tricks from #51 and #54, which might be more efficient than #54 alone).

solardiz commented 4 years ago

Regarding this point:

  1. KIS99 and FNV1A are not cryptographic functions ... Keccak is

I agree it's desirable to have a full "entropy bypass" using solely cryptographic hash functions from all inputs to the final output. This means including for the nonce, which ProgPoW misses (without this PR). However, I think this is much simpler and more efficiently achieved by changing this:

    return keccak_f800_progpow(header, seed, digest);

to:

    return keccak_f800_progpow(header, nonce, digest);

I see no reason to have included seed in there: it's just a function of header and nonce, which with my suggested change above we include directly.

I think this trivial change would be just as effective at addressing #51 as the changes you proposed here.

It would similarly not address #54. So we also need to increase seed size along with making that change.

EtherCoreAdmin commented 4 years ago

Why not we fix this also? https://github.com/ifdefelse/ProgPOW/issues/49

AndreaLanfranchi commented 4 years ago

@ethcoreorg cause it's outside the scope of this PR. Feel free to propose a new PR

AndreaLanfranchi commented 4 years ago

@blondfrogs you're absolutely right. Amending

AndreaLanfranchi commented 4 years ago

@solardiz

    return keccak_f800_progpow(header, seed, digest);

to:

    return keccak_f800_progpow(header, nonce, digest);

This IMHO doesnt solve the issue and keeps us in the situation of 0.9.3 as, again, we can pick an arbitrary seed (64 bits) value which produces a digest. brute force last keccak round to find a header (+extraNonce + nonce) producing a result matching target and with a seed output identical to the one chosen. We're still bruteforcing Keccak in search of a 64 bit value.

With carry-over from intial keccak instead of header (pretty much like in ethash) we force the linear search of header+nonce which produces the same keccak squeeze on a 256 bit basis.

AndreaLanfranchi commented 4 years ago

I'd personally add more ... in ethash the first keccak absorb is constrained (10th word is 0x00000001 and 17th word is 0x80000000) and the second too (24th word is 0x00000001 and 37th word is 0x80000000).

Constraining Keccak absorb (initial and last) further reduces (practically voids) the likelyhood of a succesful brute force attempt..

solardiz commented 4 years ago

This IMHO doesnt solve the issue and keeps us in the situation of 0.9.3 as, again, we can pick an arbitrary seed (64 bits) value which produces a digest. brute force last keccak round to find a header (+extraNonce + nonce) producing a result matching target and with a seed output identical to the one chosen. We're still bruteforcing Keccak in search of a 64 bit value.

This is what I described as "variation of the attack that would still work would involve re-doing steps 3 and 4, then 5 again, etc. until both conditions are met at once" and explained why it's "clearly impractical". That said, I agree we'd better defeat even this possibility as well, which we'd do by increasing seed size as part of addressing #54.

With carry-over from intial keccak instead of header (pretty much like in ethash) we force the linear search of header+nonce which produces the same keccak squeeze on a 256 bit basis.

I don't understand this. To me, your changes are equivalent to the trivial change I proposed. The variation of attack described above is possible for both. The only difference appears to be 2 vs. 1 Keccak computations per trial, which is an insignificant difference for our decision-making.

AndreaLanfranchi commented 4 years ago

The only difference appears to be 2 vs. 1 Keccak computations per trial, which is an insignificant difference for our decision-making.

... in search of an identical 256 bit value instead one of 64 bit.

solardiz commented 4 years ago

... in search of an identical 256 bit value instead one of 64 bit.

I don't see your proposal achieve that. Given your own description:

- digest == F(header_hash, nonce)
- seed  == F(digest)
- mix   == F(seed)       // where memory access occurr
- final == F(digest, mix)

Here, let's assume seed and thus mix are fixed. To make final meet target, we'd be altering header_hash and nonce, and we'd also require that seed stays the same. This doesn't imply that digest stays the same - it will freely keep changing.

AndreaLanfranchi commented 4 years ago

I don't see your proposal achieve that. Given your own description:

Yes ... sorry ... please focus on code ... description is messed up.

solardiz commented 4 years ago

please focus on code ... description is messed up.

I see no relevant differences in code, and that part of the description actually looks good enough for the present discussion, despite of the simplifications. The part of description that was relevantly-wrong and unusable for the discussion was "extend the dependency from header_hash to mix too".

AndreaLanfranchi commented 4 years ago

NVIDIA Profiling of this PR 0.9.4 image

this is previous release (0.9.3) image

ifdefelse commented 4 years ago

This looks good to us. It would be good to get profiling and performance results for a HBM2 card and a GDDR5X card.

lookfirst commented 4 years ago

0.88% L2 is ok?

gcolvin commented 4 years ago

It's been a long time. Is there a reason this cannot be merged, or at least referenced as correct? I'm wanting to get a nearly-final EIP-1057 to the other core devs soon.

@AndreaLanfranchi @ifdefelse @OhGodAGirl @lookfirst @solardiz @blondfrogs

gcolvin commented 4 years ago

Thanks @ifdefelse !