tromp / cuckoo

a memory-bound graph-theoretic proof-of-work system
Other
818 stars 173 forks source link

is memory hard PoW possible? #102

Closed zack-bitcoin closed 4 years ago

zack-bitcoin commented 4 years ago

This paper gives an explanation for why memory hard PoW is not possible. Is there some reason that this problem doesn't apply to cuckoo cycle? or is cuckoo cycle broken?

https://medium.com/@jeffrey.emanuel/loaded-pow-a-new-direction-in-proof-of-work-algorithms-ae15ae2ae66a

"Unfortunately, the “memory hard” approach suffers from an obvious flaw: if you could find a way to store all that data in memory once, using a lot of expensive DRAM, but then share this data across a large group of inexpensive processors, you would effectively share the cost across a large number of processors and thus undermine the supposed difficulty of the problem."

For a mining algorithm to be secure, it needs to be the case that someone who invests Nx more resources is only earning Nx more profit. Otherwise we are over-rewarding the richest miners, which causes centralization.

Since a single copy of the memory used for mining can be shared between an unlimited number of computational units running concurrently, it seems like a memory hard algorithm will always over-reward the richest miners.

burdges commented 4 years ago

I doubt it. You might ask if memory bandwidth hard is possible. I doubt that too, because all those different CPUs could pull data from different memory locations in the same memory array faster than one CPU could do so, but at least here you're asking for a bespoke memory controller.

It's sounds plausible that an isogenies-based verifiable delay function could be made memory bandwidth hard, but only proof-of-stake systems can exploit a verifiable delay function, not proof-of-work systems.

tromp commented 4 years ago

This paper gives an explanation for why memory hard PoW is not possible.

No; it doesn't. It only points out that an ethash style PoW, in which memory is used read-only, suffers from being sharable by many processors.

In Cuckoo Cycle, every instance (i.e. random bipartite graph) needs to completely rewrite memory. So every instance needs its own memory.

zack-bitcoin commented 4 years ago

How much time is spent analyzing a random bipartite graph before you realize it doesn't have a solution and move on to the next? I guess you have to look at every byte at least once. and it is like 3.5 gigabytes right? So I guess this takes at least like 0.5 seconds to scan that much RAM with a single cpu.

It seems like if you had 10 CPU sharing the same memory, then they could each scan different bytes at the same time, and finish in like 0.05 seconds.

zack-bitcoin commented 4 years ago

is it expected that just loading the data into RAM is the expensive step, and finding if it has a solution is relatively cheap?

tromp commented 4 years ago

A really fast GPU needs about 1/3 of a second to analyze a Cuckatoo31 graph, which involves writing 8-10.5 GB of DRAM multiple times. CPUs have way less memory bandwidth and are an order of magnitude slower.

ASICs can use a more efficient algorithm that only needs 512 MB of SRAM, such as in this one: https://obelisk.tech/products/grn1.html

zack-bitcoin commented 4 years ago

So what portion of the time is spent on writing data to memory vs reading/processing that data?

tromp commented 4 years ago

everything that is written is also read, so writing and reading is pretty balanced. hash computation is a much smaller fraction; maybe around 10-20%

zack-bitcoin commented 4 years ago

There is no way to parallelize the acts of reading/writing this data to a shared memory?

zack-bitcoin commented 4 years ago

If I have a giant mining farm and spend thousands of dollars on electricity every day , wold I need to store more than one copy of the cuckoo cycle work in different memory at the same time?

tromp commented 4 years ago

You would create/process an instance every 1/3 second on a fast GPU, taking up pretty much all of its memory.

zack-bitcoin commented 4 years ago

Yes, I know people are using gpu to mine for now. I'm trying to imagine what the ideal cuckoo mining hardware would look like, and if it would be significantly more efficient than gpu.

zack-bitcoin commented 4 years ago

More importantly, I want to know if ASICs for cuckoo cycle scales linearly with the amount of money the owner is investing into their mining hardware and electricity.

tromp commented 4 years ago

The ideal cuckoo mining ASIC has hundreds of MB of SRAM on a chip for the node bitmaps. The total graphrate is linear in number of ASICs and power used. Of course, a more expensive ASIC on a smaller scale (7nm instead of 12nm) would have lower power; so there's always a tradeoff between hardware cost and electricity cost.

zack-bitcoin commented 4 years ago

Thanks for the quick answers to my questions. You are doing a great job on cuckoo cycle. The documentation is clear.

tromp commented 4 years ago

There is no way to parallelize the acts of reading/writing this data to a shared memory?

This is already extremely parallelized in existing GPU miners, with thousands of threads active at every stage...

tromp commented 4 years ago

Thanks for the quick answers to my questions.

You're welcome. And thanks for the praise. Feel free to close the issue if you have no further questions...

zack-bitcoin commented 4 years ago

ok. I have a link pointed here https://twitter.com/zack_bitcoin/status/1163737778791862273?s=20 So if you have more ideas on this topic, you might want to add them to the closed thread.

zack-bitcoin commented 4 years ago

It seems like if it is possible to parallelize any aspect of the computation of the PoW algorithm, then then that means we can still make a better ASIC. And if it is possible to make a better ASIC, that means that there will always be some people who are getting better than linear returns from their investment, which would be a centralizing force on the network.

No matter how much we parallelize cuckoo cycle, it is still possible to make cuckoo cycle faster by parallelizing even more things.

I feel like we need some sort of proof that shows one version of hardware is optimal, and that the open source version is within like 5% of optimal, or some small bounds.

tromp commented 4 years ago

You cannot always parallelize more. You're limited by die space and memory bandwidth. C32 computes 4B siphashes in every round, but you wouldn't build a chip with millions of siphash circuits. Similarly, you cannot do millions of simultaneous memory accesses; your on-chip memory buses cannot handle that. Once your ASIC can do so many TB/s of internal memory bandwidth, the costs to further double that will more than double and it makes more sense just to make more smaller simpler units.