Make cache content vary per-hash

Right now, ProgPoW's cache is the first 16 KiB of DAG. This has some drawbacks:

It's the same for all hashes being computed, including those concurrently computed on a GPU, which wastes GPU on-die memory on many copies of this data instead of using that memory for different data.
Even inside each SM's shared memory or each CU's LDS (64 KiB) we may have a few copies of this data. We're giving an ASIC flexibility to provide a 16 KiB SRAM with more read ports and/or banks (or accept more bank conflict stalls) instead of the 64+ KiB SRAM that we have in the GPU.
The very beginning of the DAG might be especially susceptible to TMTO attacks, even though those are probably impractical because 16 KiB SRAM is cheap enough as it is.

We might partially mitigate 3 by using a later portion of the DAG or something else, but addressing 1 and 2 is not trivial. Ideally, we'd use different cache content (such as different portions of the DAG as quickly determined by a fast hash of the block header) for each hash computation. However, our current random reads from the DAG are only of 16 KiB total per hash computed, so reading another 16 KiB of the cache from a random offset as well would cost us half of the global memory bandwidth, halving the bandwidth remaining for the tiny random DAG reads.

Maybe we should consider a ProgPoW revision/mode with much higher PROGPOW_CNT_DAG (number of loop iterations), so that it'd read a lot more data from the DAG per hash computed (and would have a lot lower nominal hashrate as a side-effect), and could then easily afford to also read the 16 KiB cache from a random DAG offset without much effect on memory bandwidth usage (and without much additional effect on the hashrate). This would result in slower PoW verification, but maybe even 100x slower is acceptable (so that reading the 16 KiB caches from random DAG offsets would cost only 1% of total bandwidth)? Of course, it'd also go against #36, but then at least having different hashrates from Ethash's would be justified by actual advantage rather than being arbitrary.

Or maybe we should add cache writes as well, so the caches will become at least to some extent different as ProgPoW runs. (This would also mitigate 3.) Right now, we read approx. 3x cache size's worth of data from each cache, so perhaps we can afford to write 1x cache size's worth of data as well (e.g., maybe read 2x the size and write 1x the size, keeping the total cache access count the same as we currently have)? This is probably more practical than my suggestion above since it allows preserving fast verification and even implementation of #36, but it's a lower-level change.

I'd appreciate any comments.

ifdefelse / ProgPOW

Make cache content vary per-hash #41