ifdefelse / ProgPOW

A Programmatic Proof-of-Work for Ethash. Forked from https://github.com/ethereum-mining/ethminer
GNU General Public License v3.0
258 stars 84 forks source link

Make greater use of MADs #34

Open solardiz opened 5 years ago

solardiz commented 5 years ago

In #26, I obtained/confirmed some benchmark results for ProgPoW 0.9.2, including e.g. 22.7M at block 7M on Vega 64 actually running at 1401 MHz when running that benchmark.

Simulating this version with c-progpow, I get 36864 Merge and 20480 Math operations per hash computed. This gives 22.7M*(20480+36864) = ~1302G of those operations per second on Vega 64. While this is a sane number considering that many of those operations correspond to multiple instructions, let's recall that a Vega 64 at 1401 MHz is capable of computing 4096*1401 = ~5738G operations per cycle (or virtually ~11477G FP32 FLOPS due to how those are traditionally counted with MUL and ADD halves of a MAD as separate FLOPS).

Besides register file and local memory lookups, MULs or MADs are pretty much the only other operations we have on GPUs that consume relatively non-trivial resources in custom hardware. As once again confirmed in a recent comment in #24, all other operations of ProgPoW add relatively little.

I think ProgPoW would be greatly improved by cutting down on its limited programmability and instead making greater use of MADs. Even if we count a MAD as just one operation (unlike it's historically done for FLOPS), we have up to a 4.4x potential improvement here (as ~1302G to ~5738G).

Further, Math only performs MULs 2/11 of the time, and Merge only performs specialized multiplication by 33 (equivalent to shifting the number left by 5 bits and adding that to itself). With this, ProgPoW currently performs only 22.7M*20480*2/11 = ~84.5G arbitrary MULs per second. The potential improvement in terms of arbitrary MUL/s is thus up to 68x (~84.5G to ~5738G).

This is much more important than the limited programmability that ProgPoW now has. That said, some programmability can be retained (thus justifying the name) - it will remain in routing of different registers as inputs and outputs of each MAD. Such routing, and not the choice of instruction (which is just a MUX or such), is the relatively heavier component.

I recognize that we might not be able to run a lengthy stream consisting exclusively of MADs without many intermediate values degrading to zeroes, etc. too often (although frequent register file and local memory lookups should help here - we have lots of internal state, so it won't degrade fully too quickly). So the actual improvement we can achieve will be somewhat less than those theoretical maximums I listed. But we can probably get close to them, certainly much closer than we're now.

We might also want to revisit use of FP32 MADs in addition to integer ones. (IEEE might provide us sufficient guarantees.) This should double the throughput on NVIDIA Turing and Volta, but then we'd need to decide whether we'd make use of that throughput (maybe not fully?) not to slow down other GPUs too much. On NVIDIA's diagrams, the FP32 units are shown as occupying 2x the area of the INT32 ones; if that corresponds to actual die area, that's yet another reason to use them.

Edit: changed "M" (was an error) to "G" in op/s figures.

solardiz commented 5 years ago

I need to add that I'm aware of the discussion on #16 and the corresponding explanation in a Medium post (need to rediscover and re-read it). So yes, the 32-bit integer MULs currently performed may be multiple underlying MUL instructions on some GPUs, meaning that at instruction level the current MUL/s throughput might be better than what I estimated from the hashrate. However, with my proposed much greater focus on MADs in general I think we also need to revisit this and once again try to switch to MAD or MUL operation(s) that has/have full throughput (one per cycle). Latency doesn't matter much.

solardiz commented 5 years ago

Regarding floating-point, I don't buy the argument about NaNs given in https://medium.com/@ifdefelse/the-cost-of-asic-design-a44f9a065b72 - it would apply for arbitrary mix of operations including floating-point ones, but we can instead limit ourselves to FP32 MADs and we can occasionally (such as once per a full sequence of tens of Math/Merge) limit the range (the exponents) just enough for special values to never occur. I think use of FP is worthy of research and experimentation, and is wrong to dismiss just yet.

ifdefelse commented 5 years ago

try to switch to MAD or MUL operation(s) that has/have full throughput (one per cycle).

As I said in #16 I'm pretty sure such an operation doesn't exist across AMD and Nvidia. We easily target Turing's single-cycle IMAD instruction, but that would hurt Polaris GPUs which are already compute limited. If you have a suggested change that includes perf numbers we'd be happy to consider it.

I don't buy the argument about NaNs <...> occasionally (such as once per a full sequence of tens of Math/Merge) limit the range (the exponents) just enough for special values to never occur.

The random values in the cache will naturally include some NaNs. This means that either the floating point instructions sequence can't interact with the cache loads, significantly limiting the point of the cache loads, or every cache load will need to be checked if it's a NaN and converted.

Even if a range check is done only once per loop it needs to be done per register - that's a lot of additional instructions. An ASIC could trivially implement the clamp as an input modifier to their FP ALU, which would be significantly higher performance than the equivalent GPU sequence.

solardiz commented 5 years ago

Thank you for the comments, @ifdefelse. I think we're on the same page.

Yes, it's tough to target multiple GPU architectures with integer MADs - they might very well need to be of different size for maximum throughput. To me, that's one more reason to consider FP32. The published FP32 GFLOPS figures for modern GPUs are universally the number of 32-bit "cores" times 2 times the clock rate, suggesting these instructions have a throughput of one per cycle across all modern GPUs. And IEEE gives us more similarity between different GPUs than we have for integer multiplication.

Yes, I also already arrived at the "cache issue" - if we want to stay within valid and stable floating-point, then can't efficiently use content of the existing cache in those computations. I have two ideas on how to tackle this issue:

  1. We can have two separate streams of instructions, on two non-overlapping subsets of the registers, for FP and integer. We might e.g. try to match NVIDIA's reported-typical ratio of 100% FP32 load combined with 36% INT32 load - I guess we'd have probably bumped into the power limit if we tried 100% load for both at once, and even if not then our dependency on such amount of compute would hurt other GPUs (without separate FP and integer units) too much. Then we'd have to merge the contents of those two separate subsets of registers eventually, which would cost, but would probably not have to be frequent.

  2. We can change the content of cache to be valid FP. Do it just once per DAG generation, and store that cache separately. (I also have unrelated half-baked thoughts on the current cache content and how that might allow for avoiding the need to have the "gather load" circuitry in an ASIC. So there will likely need to be changes in this area anyway, including having the cache content change more frequently.)

I think I like the second idea much better. (And we can then approach possibly re-adding some INT32 processing to the mix as a separate task.)

Yes, I also realized that "one per loop" is actually costly since we use many registers. We'd need to look into a way to make this "range check" less frequent. One way to achieve that might be through tracking worst-case minimum and maximum exponent per register during our code generation, and adjusting the current set of random numbers until the generated instruction is such that it'd fit the constraints on never overflowing until our planned next "range check" (a do-while loop per instruction generated).

The runtime "range check" should not actually be a check - not in the form of an "if this do that" - but rather some kind of masking (easy at low-level, not so great at CUDA and OpenCL level where we'd have to use a union or such, which might confuse the optimizer, or use inline asm) or maybe conversion to/from integer (a better fit at CUDA and OpenCL level, where we wouldn't have to mix types nor use inline asm). We just need to ensure we'd never reach exponent values 0 (where denormals are) and 255 (where infinity and NaNs are), for which we'd need to start with an exponent near the center of this range and proceed with a sequence of operations that guarantees not reaching either limit too soon. Yes, this "range check" would be extremely cheap in an ASIC, so we should aim to have only infrequent "range checks".

I see no show-stoppers there - it's just significant effort to design this well, then test, tweak, then rinse and repeat.

ifdefelse commented 5 years ago

Another issue with a chain of FP operations is that FADD with random inputs will normally do nothing. For an FP32 operation the exponent is 8 bits so can have 256 values. When calculating FADD(A, B) if the exponent of B is more than +/-24 from the exponent of A then the smaller value gets squashed to 0 and the result is just MAX(A, B). Actual addition will only happen 48/256 or <20% of the time.

We handle this in ProgPoW by structuring the merge() functions to produce good, random values in registers even if the input to the merge has low entropy (is all 1s, all 0s, or just a pass-through of one input). Creating an equivalent merge() function in FP I'm not sure is possible.

A sequence of FP ops can also easily accumulate a bias, causing certain elements of the DAG to be accessed more often than others.

solardiz commented 5 years ago

@ifdefelse All good points, thanks! I think we're effectively starting to compile a list of constraints or a checklist for a good design using FP, which will be far from trivial to come up with, yet maybe practical.

ifdefelse commented 5 years ago

Likewise unconstrained random FMUL(A, B) has a ~25% chance of over/under flowing.

If exp_a is 127 then any exp >= 1 will cause overflow - 127/256 if exp_a is 126 then any exp >= 2 will cause overflow - 126/256 <...> if exp_a is 2 then any exp >= 126 will cause overflow - 2 / 256 if exp_a is 1 then any exp >= 127 will cause overflow - 1 / 256 if exp_a is 0 then can't over/underflow if exp_a is -1 then any exp <= -127 will cause underflow - 1 / 256 if exp_a is -2 then any exp <= -126 will cause underflow - 2 / 256 <...> if exp_a is -126 then any exp <= -2 will cause underflow - 126 / 256 if exp_a is -127 then any exp <= -1 will cause underflow - 127 / 256

I might be off-by-1 in this, but you get the idea. The average comes out to about 25% of random FMULs will over/underflow.

This is why you can't clamp once per loop. If you clamp exponents to [63, -63] it just takes 2 sequential MULs to get over/underflow.

You could do something like clamp the exponent to an extremely short range, like [7, -7]. However that would drastically simplify a hardware floating point unit, providing a significant ASIC optimization opportunity.

solardiz commented 5 years ago

Right. So should have few sequential MULs (a code generator constraint) and/or clamp to short range of initial exponent, which I think isn't that bad if our alternative is building upon at most 24-bit integer MULs of same throughput. A more advanced code generator tracking worst cases might end up e.g. interleaving MULs by registers that it knows were initialized with positive vs. negative exponents, to allow for slightly longer sequences. Tricky stuff indeed.

solardiz commented 5 years ago

So I experimented with FP in a ProgPoW hack a little bit. I am currently leaning towards the following design aspects if we actually switch to FP:

I'd appreciate comments.

solardiz commented 5 years ago

JFYI: Staying with integers but maximizing the ratio of arbitrary 32-bit MULs, I am able to get up to ~630G MUL/s on the Vega 64 while also doing the cache lookups, having merge() maintain entropy of its destination, and maintaining adequate memory bandwidth usage (slightly lower than default ProgPoW's, though). This is 7x+ higher than ProgPoW 0.9.2's defaults, but 9x+ lower than the theoretical maximum throughput for FP32 MULs.