regolith-labs / drillx

CPU-friendly hash function for cryptocurrency mining.
104 stars 40 forks source link

Feedback on DrillX #5

Open OrenHash opened 6 months ago

OrenHash commented 6 months ago

@HardhatChad I admire the design goal of thwarting ASICs so we can have PoW without obscene energy consumption. However, if that's the objective, then I would strongly advise against taking the "random operations all over the place" approach of RandomX. The only tool you need is an excessive memory footprint with random access patterns. It's OK if the instructions are easy to implement in an ASIC. What matters is how many transistors I need to construct that ASIC. It seems like you have about a MB of state? That can be tiled many times over in a commodity ASIC (or GPU for that matter), affording parallelism. What if you required a GB of state but you had a really simple hash algorithm that was much easier to scrutinize in a security audit? Or better yet, what if the state size were programmable, so for instance it could scale with Moore's Law, whether programatically or manually?

Why do I care? Because I think the long term value of Ore is to show the world that we don't need national energy budgets to secure financial transactions, even if one refuses to consider PoS. Ore doesn't necessarily need to be recognized as money in order to do so. Its value is substantial even as a gamified demonstration of this concept, and as such is worth the effort.

I'd be happy to write you the sort of hash I've described above, using well-characterized oscillators and a scalable footprint, but if you really want to go the RandomX route, then there are some potential problems with the code that need review.

shekohex commented 6 months ago

What if you required a GB of state but you had a really simple hash algorithm that was much easier to scrutinize in a security audit? Or better yet ...

Remember that to validate the PoW, you will need to run the drillx one time with the provided nonce. The main limitation now is that it needs to run this on Solana program! https://solana.com/docs/programs/faq#heap-size and the max size of the program is 10MB. So yeah, it is a limitation of the runtime here.

I imagine one way to increase the size of the state, or the noise, is by making the noise its own programs. So in theory, to have 1 GB of noise, you would have to deploy 1 GB / 10 MB ~= 100 Programs! Which is insane!

HardhatChad commented 6 months ago

@OrenHash This is great feedback. Thanks for sending. And @shekohex's reply is spot on.

We're ultimately limited by the solana runtime so a GB scratchspace is unfortunately not a viable approach for our use-case. The intent behind the random address reads is simply to hit different cache lines on each noise read. With a 1MB noise buffer, this should certainly be happening. In current benchmarks, the reads are 90-99% of the overall work, which was the original goal. Given our constraints, I think it's generally okay if the noise chunk can be tiled and nonces are parallelized across cores. We're basically already doing this in the cli with cpu multithreading and native cuda implementation.

Would be curious to learn more about why you think the randomized ops are ineffective. My understanding has been it's really the switch statement in the op loop that makes asic design difficult because you get unpredictable branching at the assembly level. You can neither brute force the ops (7^64 potential combinations on each round), nor predict them ahead of time (the result of one op changes the next opcode and inputs for the next op). If this won't make hardware optimization difficult, would love to understand why.

OrenHash commented 6 months ago

@shekohex Thanks for pointing that out. If you're stuck with a 10MB ceiling, and the "winning nonce" verifier has to run as an actual Solana program, then there are a few possible ways to improve on the situation:

  1. Use the social leverage of Ore Supply to push the Solana team to lift the ceiling. More compute budget, more fees, more excitement around the whole affair. It's probably only a matter of changing a constant somewhere, but I don't know for sure.

  2. Admit that this should never be done in a Solana program at all, and find a way to achieve consensus on who is owed how much ORE for finding the solution. This consensus would ride on top of Solana, essentially as an L2. Perhaps this could be an ORE v3 thing. It would be really exciting because ORE would start to look a lot more like BTC but with far less energy consumption per dollar of marketcap.

  3. Give up fighting off ASICs and stick with something under 10MB (say 8MB?). Not great, but better than 1MB and a quick fix.

OrenHash commented 6 months ago

@HardhatChad First of all I guess there are really 2 goals here, which are subtly different.

One is to prevent ASICs (or ideally even GPUs) from having a multifold advantage over CPUs per hardware dollar invested. If you succeed in that goal, then you maximize holder growth because some poor kid in Africa can actually mine with his commodity phone and at least collect rewards here and there. If you fail, then you're just an also-ran version of Bitcoin with the "centralization of SQLana".

The second goal is to minimize joules per dollar worth of reward. This forces, by design, a lack of investment in custom hardware which would otherwise fuel "big boys" centralization (and landfill) waiting to happen. Instead, I think you're trying to incentive the use of mobile devices and home PCs that already exist.

Am I wrong about this?

I think you realize that maximizing memory footprint (within the constraints of the platform you have) will minimize the variance in performance-per-hardware-dollar. A GPU might still do 10X as well as a CPU by this metric, but probably not 100X. But it all depends on how you leverage that footprint.

If you use a readonly noise table, then you're opening yourself up to an ASIC design wherein a memory crossbar (or a tree of them) allows many hashes to be tried simultaneously by breaking the table across all the cores (and facilitating peer-to-peer memory forwarding). In other words, it costs the ASIC designer 1MB to store your table. But then they can have N individual cores, each owning an equal fraction of that 1MB. It only reaches saturation when the total transistor count in those cores becomes comparable to the transistor count of the 1MB table.

Instead of having a readonly table, ideally, you would want to write out 1MB of nonce-dependent noise, then go back and read it in nonce-dependent random order in order to produce the final hash. There's an elegant way to do this, and read back every byte of that synthetic table exactly once, with a known oscillator. What this does is to destroy the foregoing ASIC crossbar approach, and ensure that writes equal reads for maximum energy efficiency. Now the ASIC designer needs 1MB for every single core! Or maybe 8MB if you buy number 3 above.

Randomized ops do exactly what you said they do: thwart branch prediction. That actually gives the ASIC designer even more of an advantage because they don't generally incorporate branch prediction at all. (To some extent, this could be mitigated on CPUs by implementing a branch target lookup table in assembly language.) Granted, you can't do operations in parallel because your next opcode depends on your previous one. But you still get the crossbar attack above. And by the way an ASIC would probably do all your ops at once, then issue only the one that it was actually supposed to, so no branching at all, just predication. (The divide would probably be implemented in a lookup table.)

But the real reason to avoid random ops has nothing to do with ASICs. It's a cryptographic sanitation issue. You want security people to be able to audit your code and make cogent statements about its security in terms of what is known and not known. If you jump all over the place with random ops, it becomes impossible to make any useful statements at all. And sometimes complexity is actually "less random" than simplicity. Case in point:

Opcode::Right => a >> (b % 8), Opcode::Left => a << (b % 8), Opcode::Div => {

Entropy is being lost in each of these. All else being equal, I would expect this to reduce your limit cycle length. Does it matter? Maybe, maybe not. Dealing with a known oscillator would preempt such concerns.

You're correct that reading with random access patterns will touch different cache lines and defeat the memory prefetcher. But again, this just takes away one more would-be advantage of CPUs. 1MB will, at least, blow the L1, although L2 is on the order of 10MB to 100MB on a modern CPU. That's yet another way in which they lose to ASICs, which wouldn't have the overhead of dealing with a cache hierarchy.

All in all there's a huge opportunity here to blow this up way beyond your existing holder population. But there's a reason that rockstar cryptosystems such Bitcoin, AES, and elliptic curve all use math that's as simple as possible to get the job done: it allowed them to get more buyin from the security community.

I'm a big fan because this might eventually draw people away from excessive energy waste. I'm aware of "diet" approaches such as Litecoin but they don't have all the advantages of Solana. But there's a major fork in the road here.

I'll be back in a few days.

HardhatChad commented 6 months ago

This is incredibly insightful.

You're right on the basic goal. Intent is not to be "gpu resistant". There's tons of people with GPUs lying around and I'd like them to be usable for ore mining. By shifting most of the work to memory reads, idea was we could close the performance gap between CPUs and GPUs.

A nonce specific noise file would be nice. Generating multiple MB of noise just becomes infeasible to do in a single solana tx for verification. We can certainly advocate for changes in solana, but these are unlikely to materialize on any timeline relevant to v2. One idea I had original was to periodically change the noise (saving it as an onchain account and updating it every so often). Users would have to pull the diffs to generate correct hashes. I'm not sure if that would beneficial here as it would still be readonly and global at any given point in time.

Your point on brute forcing through the ops makes sense. This was different than I had originally conceptualized it, but totally makes sense. Would definitely like to simplify as much as possible to make reviews easier. Do you have any suggestions for how to implement branching in a way that can't be brute forced?

HardhatChad commented 6 months ago

Your L2 concept is interesting. If we want to get real creative with the security model, we don't have to necessarily recompute the hash on-chain. We could have a stake based voting system to validate user hashes. This would also allow us to change the hash function w/o needing to redeploy the contract.

RajeshRk18 commented 6 months ago

Given the constraints we have, I think we have to look for a better way to verify the nonce is valid. Like kinda "Recomputing the hash is reduced to verifying something(which should be efficient and feasible onchain)" approach can be explored.

HardhatChad commented 6 months ago

@OrenHash are you familiar with Equihash? I just met with one of the authors this morning and realized this may be exactly what we're looking for. As @RajeshRk18 suggests, Equihash is intentionally designed as an asymmetric, memory hard proof of work where verifications are cheap. I just used Zcash's implementation of verification in a sample solana program with k=5 and n=96 and verification only uses 185k cus (which is very good).

HardhatChad commented 6 months ago

Based on tevador's blog post, I'm currently testing Equix (a variation of Equihash where k=3, n=60). I'm only running on cpu atm, but this appears very promising.

@OrenHash would be curious to get your feedback on tevador's assertion that reducing the memory requirement to 1.8MB increases cpu friendliness as it means the entire footprint can fit in cpu cache (thereby allowing cpus to compete with gpus on bandwidth).

shekohex commented 6 months ago

Interesting, TIL about Equix! Also I did find an implementation in Rust https://docs.rs/equix/latest/equix/

@HardhatChad I believe we should give it a try. People always say do not roll out your crypto, maybe they are right. I believe in making it fair for everyone, regardless of what hardware they are using. The nice thing about Ore v2 is that we still have time to experiment. Lets go with Equix and see how it goes on devnet.

HardhatChad commented 6 months ago

I just created a branch hardhat/equihash using the equix crate. Seems to be working pretty well. On-chain verification currently coming in at 460k cus which isn't amazing but isn't terrible either. Maybe slightly better than current drillx. Only thing I don't understand is why it seems to be generating 8 solutions. I'm just picking the first one atm, but wondering if it'd be cheaper to only generate 1 solution.

shekohex commented 6 months ago

The output of hashx function is the PoW, you should calculate the difficulty from there, you should not hash it again with keccak256.

HardhatChad commented 6 months ago

The problem is I was printing out the difficulty distribution of the equix hashes, and there are weird statistical anomalies. As I understand, the output poof of equix are eight 16-bit numbers. When you stack these together into a single 128-bit buffer, the 8 bit difficulty comes up very infrequently and is far more rare than the 9-bit difficulty. This is odd and would mess up our reward rate calculations. The keccak at the end to get a difficulty distribution where p(N) = 2^-N

shekohex commented 6 months ago

Let me clarify what I wanted to say, in hashx section here: Interactive client puzzle is a good start for us here, verification is very cheap too! Also see this issue from equix https://github.com/tevador/equix/issues/4 that could be another idea for us!

These people are really smart, damn! 😅

HardhatChad commented 6 months ago

I see. I could be wrong, but I think hardhat/equihash branch is doing almost exactly this but with keccak instead of blake2b. Blake might be a little more cpu friendly than keccak. There is a solana syscall for blake3 so we could potentially use that.

As of right now, the complete hash is: keccak(equix(challenge, nonce), nonce)

The miner would have to submit (nonce, equix(challenge, nonce)). On chain, we would verify the provided equix(challenge,nonce) proof is correct and then compute the keccak with a syscall to get the difficulty for the reward payout.

shekohex commented 6 months ago

Yeah, I guess we should switch to blake3, it is very optimized for CPU, supported by Solana runtime. also, most of the PoW is done in the equix, so there is no need to do more extra work using keccak.

shekohex commented 6 months ago

What stops someone from only running E = equix(challenge, nonce) only once and start using blake3(E, nonce) with different nonces until they find a good higher difficulty? I guess we would enforce that you submit the nonce + E to the chain, and we will calculate it? right?

HardhatChad commented 6 months ago

The extra hash (whether keccak or blake) is needed to get a difficulty distribution where p(N) = 2^-N. Without this, difficulty based payouts don't make sense.

A miner has to submit (E, N). On chain we verify E is valid and then calculate blake3(E, N) in the contract to get the difficulty value. A user can search blake3(E, N*) all they want, but these hashes are useless if N* != N

OrenHash commented 6 months ago

I'll try to address all the questions I see (while hopefully making it clear when L1/L2 refer to blockchains vs CPU caches):

shekohex commented 6 months ago

@OrenHash we started rolling out Equix, inside drillx, and we are curious to hear what are your thoughts on that?

OrenHash commented 6 months ago

@OrenHash we started rolling out Equix, inside drillx, and we are curious to hear what are your thoughts on that?

Well, in addition to the above, I think the biggest problem is that BLAKE3 (if that's what you've rolled out) is designed to be more easily parallelized, which is by definition an impediment to hardware democratization. (By "hardware democratization", I mean that it doesn't much matter what kind of circuit you spend $1 on; it won't make much difference to the rate at which you can try nonces; only buying more more entire machines will do that. If that's actually not something you want, then so be it, but then you effectively become an also-ran heat generator like Bitcoin Cash or something.)

Anything that could be done by plugging in an off-the-shelf hash into Equihash could be done more efficiently with a monolithic fit-for-purpose iterator. Just because you can't afford GB-scale memory footprints doesn't mean that you can't make the task arbitrarily hard. Like you could iterate over the same MB several times. It would just be more subject to ASIC optimization (but not nearly so much as starting with a parallelizable hash).

If I'm misunderstanding something then please explain.

HardhatChad commented 6 months ago

The blake3 hash is less than 0.01% of the work. The equix step is 7000x to 9000x times more expensive (in time) than the blake3 step in benchmarks. Blake3 is just to get a difficulty measurement that makes sense for token payouts

OrenHash commented 6 months ago

The blake3 hash is less than 0.01% of the work. The equix step is 7000x to 9000x times more expensive (in time) than the blake3 step in benchmarks. Blake3 is just to get a difficulty measurement that makes sense for token payouts

If that's accurate, then nevermind the issues with BLAKE3; my concern would be Equihash itself. It would probably be an improvement on Ore Supply v1, but you won't get hardware democratization: "Antminer Z15, is designed for Equihash mining and is compatible with all cryptocurrencies based on the Equihash algorithm"

https://www.antminertech.com/shop/antminer-z15-equihash-miner

Perhaps their engineers read this 2017 paper, which seems about right, considering that their first Equihash ASIC was evidently launched back in 2018.

https://dl.acm.org/doi/10.1145/3140649.3140652

shekohex commented 6 months ago

If that's accurate, then nevermind the issues with BLAKE3; my concern would be Equihash itself. It would probably be an improvement on Ore Supply v1, but you won't get hardware democratization: "Antminer Z15, is designed for Equihash mining and is compatible with all cryptocurrencies based on the Equihash algorithm"

Equihash has a pluggable hash function, what @tevador did was basically a modified version where he swapped the blake hash function from the original equihash to his new hash function "hashx" which in turn is a variant of his original RandomX but modified for faster verification speed with less memory footprint.

I always said this, equix is like imagine RanomX and Equihash had a baby.

Someone on Discord did a benchmark of different components of drillx and it turns out that over ~80% of the work is in hashx, the remaining is on the Equihash wrapper and a tiny tiny bit on the last Blake3 on top. So I don't think even having a GPU implementation for Equix (or basically hashX) would make a huge difference between a GPU and a CPU. Having an ASIC is even more difficult if not impossible due to how hashx works.

HardhatChad commented 6 months ago

Yeah, there are no asics for equix afaik. All equihash asics I've seen are configured for the Zcash variation using n=200 k=9. According to the data in the tevador blogpost, a cpu can do 300 sols/s on the Zcash variation. Let's assume this is on 125 watts of power. The antminer you linked advertises 420k sol/s on 1510W. So the asic is only doing ~100x more hashes per watt than the cpu. This alone would be a major improvement over the keccak we used in v1.

But regardless, the math in equix is different. They swapped out blake for hashx, changed an xor to sum, and use k=3 n=60. The claim is all this should make it harder to parallelize an even more friendly to the cpu.

Screenshot 2024-05-17 at 08 05 20
OrenHash commented 6 months ago

@HardhatChad @shekohex First of all, I must confess that I read too fast and mistook your "equix step" for "Equihash step" in my previous comment. Nevertheless I stand by the comment, even though it's not so relevant to EquiX.

Anyhow my take on EquiX (from tevador's blog) is that it's probably not too bad for the purposes here. It's just way more complicated than necessary. This in particular was interesting:

"Some small fraction of HashX instances produce a significant number of hash collisions."

This points to a short limit cycle, which should never happen in a well-designed hash (not that tevador claimed it to be cryptographically hard). It's just a subjective comment but such behavior shouldn't be easily detected, if at all. Replacing XOR with addition (which does indeed enhance security) simply obfuscated this behavior without addressing the root cause (likely entropy dropping due to shifting or whatnot).

Not to be annoying but I'm not on board with the whole random instruction thing (the "X" in "EquiX"). That just makes such entropy losses harder to detect. It also (maybe counterintuitively) favors ASICs, and more straightforwardly FPGAs, which ironically tevador aimed to disadvantage. To be clear, ASICs can handle instruction variance by doing all the possibilities at once, then only finalizing the result that turns out to have been needed. They can even abort longer instructions when the right result becomes known. No need to waffle around with branch prediction, or branching at all. But CPUs have no choice, so they're disadvantaged. And as tevador pointed out, ASICs can do XOR faster because it's bitwise parallel. Furthermore, random instructions exercise only a small fraction of the logic in a modern CPU, so an ASIC designer can throw out everything else in exchange for packing more cores into the same transistor count.

But who cares because EquiX is memory-bound, right? The problem is that, as Bitmain has proven, that's a qualitative observation about Equihash (not EquiX specifically) which is not rigorously true.

Both realizations above jointly imply that one could achieve (maybe tremendous) performance gains using an FPGA, say just a rented one on Azure.

But rather than just sitting here complaining, I should probably ask: what's the stickyness of this hash algorithm commitment? Would you consider replacing it wholesale in v3, burning all the FPGA investors? Are only parameters going to change going forward, or the whole kitchen sink? It's your baby and I won't argue with your roadmap.