tromp / cuckoo

a memory-bound graph-theoretic proof-of-work system
Other
822 stars 173 forks source link

Support for N > 32 (EDGEBITS > 32) #52

Closed dlongley closed 6 years ago

dlongley commented 6 years ago

@tromp,

Thanks for creating the Cuckoo Cycle and providing this excellent implementation!

This library has some defines to handle EDGEBITS > 32 but, for example, neither the lean nor the mean implementation compile if you set it to 33. Here's some output from attempting this with the mean implementation:

g++ -march=native -std=c++11 -Wall -Wno-format -Wno-deprecated-declarations -D_POSIX_C_SOURCE=200112L -O3 -DPREFETCH -I.  -pthread -o mean34x1 -DNSIPHASH=1 -DEDGEBITS=33 mean_miner.cpp -L. -lblake2b
mean_miner.cpp: In function ‘int main(int, char**)’:
mean_miner.cpp:94:54: error: cannot convert ‘u32* {aka unsigned int*}’ to ‘edge_t* {aka long unsigned int*}’ for argument ‘1’ to ‘int verify(edge_t*, siphash_keys*)’
       int pow_rc = verify(prf, &ctx.trimmer->sip_keys);

Hacking the appropriate line to use a 64-bit type doesn't solve the problem, so before going further down the rabbit hole I figured I'd ask if EDGEBITS > 32 is intended to work without significant changes.

For context: I'm interested in very long running (runtime of hours) memory-hardened PoWs. This is not for a consensus/mining mechanism, but as an authorization mechanism for writing data to a blockchain. It seems like Cuckoo Cycle could be a good fit. An option being considered is chaining solutions for multiple proofs together to enable a much longer running PoW without prohibitively high memory requirements. This causes a larger aggregate solution size -- something that I'd also like to minimize by finding the right balance. Hence, I'd like to know if this implementation supports larger graphs. I suppose requiring a longer cycle is also something that could be played with.

tromp commented 6 years ago

dear Dave,

Thanks for creating the Cuckoo Cycle and providing this excellent implementation!

You're welcome:-)

This library has some defines to handle EDGEBITS > 32 but, for example, neither the lean nor the mean implementation compile if you set it to 33. Here's some output from attempting this with the mean implementation:

g++ -march=native -std=c++11 -Wall -Wno-format -Wno-deprecated-declarations -D_POSIX_C_SOURCE=200112L -O3 -DPREFETCH -I. -pthread -o mean34x1 -DNSIPHASH=1 -DEDGEBITS=33 mean_miner.cpp -L. -lblake2b mean_miner.cpp: In function ‘int main(int, char*)’: mean_miner.cpp:94:54: error: cannot convert ‘u32 {aka unsigned int}’ to ‘edge_t {aka long unsigned int}’ for argument ‘1’ to ‘int verify(edge_t, siphash_keys*)’ int pow_rc = verify(prf, &ctx.trimmer->sip_keys);

Hacking the appropriate line to use a 64-bit type doesn't solve the problem, so before going further down the rabbit hole I figured I'd ask if EDGEBITS > 32 is intended to work without significant changes.

No; it's not. It would require a lot of expertise to figure out appropriate settings to make it work. And it may still require actual source code changes.

For context: I'm interested in very long running (runtime of hours) memory-hardened PoWs. This is not for a consensus/mining mechanism, but as an authorization mechanism for writing data to a blockchain. It seems like Cuckoo Cycle could be a good fit. An option being considered is chaining solutions for multiple proofs together to enable a much longer running PoW without prohibitively high memory requirements. This causes a larger aggregate solution size -- something that I'd also like to minimize by finding the right balance. Hence, I'd like to know if this implementation supports larger graphs. I suppose requiring a longer cycle is also something that could be played with.

What exactly are your requirements for running time and memory use? Upper bounds? Lower bounds? And what aggregate proof size is acceptable or desirable?

regards, -John

dlongley commented 6 years ago

No; it's not. It would require a lot of expertise to figure out appropriate settings to make it work. And it may still require actual source code changes.

Thanks for the response! Good to know.

What exactly are your requirements for running time and memory use? Upper bounds? Lower bounds?

We have a target run time of 10 hours +/- 2 hours. Memory requirements are largely dependent on the goals of enabling the use of commodity hardware and preventing centralized systems with specialized hardware from circumventing or reducing the spam protection in a negative way. We've estimated 1GB-2GB but welcome other expert advice. The time/memory tradeoffs of Cuckoo Cycle here are also attractive (~order of magnitude less memory => ~order of magnitude slower).

And what aggregate proof size is acceptable or desirable?

Acceptable is 1KB with 2KB pushing it. Desirable is 0! :)

My understanding is that the proof sizes are under 200 bytes, so naively chaining roughly five of them together would be acceptable.

Do you think that implementing a difficulty factor (as described in the whitepaper) would be the way to go here?

tromp commented 6 years ago

dear Dave,

What exactly are your requirements for running time and memory use? Upper bounds? Lower bounds?

We have a target run time of 10 hours +/- 2 hours.

Is that on a modern Intel CPU with 4 or more threads? There will be some slowdown on systems without AVX2 (and they need to run a different executable).

Memory requirements are largely dependent on the goals of enabling the use of commodity hardware and preventing centralized systems with specialized hardware from circumventing or reducing the spam protection in a negative way. We've estimated 1GB-2GB.

The default mean30x8 miner uses 2.3 GB. There is no Makefile target for mean29x8, but the one for mean30x8 can be trivially adapted (using -DEDGEBITS=28) to make one. That will use only half as much memory.

A single run of mean30x8 with 4 threads takes 4 seconds on my iMac. Getting to 10 hours means requiring 106015 = 9000 attempts. On average 214 of those will result in 42 cycles, but as you note this number can be arbitrarily reduced by putting a threshold on the cyclehash.

The time/memory tradeoffs of Cuckoo Cycle here are also attractive (~order of magnitude less memory => ~order of magnitude slower). And what aggregate proof size is acceptable or desirable? Acceptable is 1KB with 2KB pushing it. Desirable is 0! :)

1KB only allows for 6 proofs of a 42 cycle (one proof is 168, or 153 bytes compressed).

Do you think that implementing a difficulty factor (as described in the whitepaper) would be the way to go here?

The problem you have is that many attempts can be run in parallel. The proofs you find can be chained in sequence, but that only cuts down potential for parallellism to 9000 / #proofs, which is still huge. If we shrink cycles to just length 8, then a proof fits in 29 bytes, and 1KB allows for a chain of 35 proofs. But we can still expect to find one proof after running 257 attempts in parallel.

How bad is parallellism for your use case?

regards, -John

dlongley commented 6 years ago

Is that on a modern Intel CPU with 4 or more threads? There will be some slowdown on systems without AVX2 (and they need to run a different executable).

There's no assumption that it will be on a modern Intel CPU with 4 threads. But given our use case and that the target time is so long -- we don't think that being off by a couple of hours is a problem. It's effectively: "Start your PoW and go to sleep. It will be ready in the morning." There is no direct reward for being the fastest to solve a proof and the only potential indirect reward must be balanced against an alternative instant authorization mechanism for writing data to the ledger. I describe that a bit below ... I'm trying to keep it short and relevant to Cuckoo Cycle but some context may be required. :)

How bad is parallellism for your use case?

Well, it somewhat depends on the specific PoW. Some more context:

  1. The data that is written to the ledger is essentially a set of public keys and service information (see: DID Spec if you're interested). We do not expect significantly heavy writes in the same way you see with, for example, cryptocurrency ledgers.

  2. Obtaining authorization to write to the ledger can happen through two mechanisms: PoW or an authorization obtained from a third party that has agreed to pay X USD per bytes written to the ledger to a non-profit organization. That non-profit pays money to the entities that run elector nodes in the ledger and to the software maintainers. We expect users to pay Y USD to these third parties to get authorization (or become a third party themselves).

These two methods are essentially in conflict with one another -- the former starves the ledger of its funding and the latter replenishes it. The PoW is to ensure that permissionless writes are always possible, however, the goal is for the majority of writes to go through an entity that pays to help support the ledger. The system intentionally does not involve any tokens to avoid speculation and the debt incurred by ICO models. The goal is to create a relatively cheap public utility.

So, the PoW characteristics should help drive a strong preference for the mechanism that pays for the ledger whilst still allowing the PoW mechanism. A long running PoW pitted against an instant low cost alternative is likely sufficient to address this goal when considering individual users. However, the PoW requirements should also deter entities from setting up bulk PoW services, i.e., the PoW cost should be greater than the instant alternative.

If a bulk PoW service can significantly reduce the PoW time whilst beating the cost of the instant alternative, that could begin to starve the ledger's funding. So, this is primarily where parallelism could be a problem.

Other factors include that the proof solutions are stored on ledger (so we would prefer to minimize their size) and that a bulk PoW service could potentially precompute PoWs. The latter is at least partially mitigated by the fact that private key material must be paired against the data that is the input to the PoW, leading trust issues with such a "precompute" service (as it would have to generate and hold user private keys until distributed). The non-profit, with the consent of the ledger nodes, also has the option to change the PoW (or potentially bring a lawsuit) should such a service become trustworthy and popular.

It could be that perhaps the best mechanism for preventing bulk PoW services from arising is to change the entire PoW mechanism often enough, though it would be ideal to avoid that kind of the churn with simpler tuning for as long as possible.

Given the above, do you think a better approach is to turn the difficulty threshold on the cyclehash very high such that we only need a single proof or to choose something lower coupled with lowering the cycle length and increasing the number of chained proofs required? Or perhaps something else?

tromp commented 6 years ago

dear Dave,

If a bulk PoW service can significantly reduce the PoW time whilst beating the cost of the instant alternative, that could begin to starve the ledger's funding. So, this is primarily where parallelism could be a problem.

I see; you don't need to limit parallellism, you just need to lower bound the cost of obtaining a proof.

Your first worry should be that GPUs are an order of magnitude more efficient at producing proofs.

The second worry is that production of Cuckoo Cycle ASICs is imminent, what with Grin launching near year end with Cuckoo Cycle as PoW, yielding yet another order of magnitude reduction in cost per proof. The second worry can be mitigated though by just tweaking Cuckoo Cycle a little.

Given the above, do you think a better approach is to turn the difficulty threshold on the cyclehash very high such that we only need a single proof or to choose something lower coupled with lowering the cycle length and increasing the number of chained proofs required? Or perhaps something else?

The shorter the proof chain, the more variance you have in total time needed. You could be running 500 attempts and still be unlucky enough never to hit on a 42-cycle. Or, with difficulty threshold, you could be running 2000 attempts and be similarly unlucky never to hit a 42-cycle passing the 1 out of 4 difficulty threshold. So make proof chain long enough to give acceptable variance, then shorten cycle to get acceptable proof chain size, and finally pick threshold to get the right amount of total effort needed.

But getting back to the earlier issue: how do you want to deal with GPUs?

regards, -John

dlongley commented 6 years ago

The second worry is that production of Cuckoo Cycle ASICs is imminent, what with Grin launching near year end with Cuckoo Cycle as PoW, yielding yet another order of magnitude reduction in cost per proof. The second worry can be mitigated though by just tweaking Cuckoo Cycle a little.

When you say tweaking a little -- are you referring to the tweaks we've already discussed or something more?

The shorter the proof chain, the more variance you have in total time needed. You could be running 500 attempts and still be unlucky enough never to hit on a 42-cycle. Or, with difficulty threshold, you could be running 2000 attempts and be similarly unlucky never to hit a 42-cycle passing the 1 out of 4 difficulty threshold. So make proof chain long enough to give acceptable variance, then shorten cycle to get acceptable proof chain size, and finally pick threshold to get the right amount of total effort needed.

Thanks, John, that's very useful. I think we can figure out the parameters we need.

But getting back to the earlier issue: how do you want to deal with GPUs?

Short answer: Wait until a problem arises, then increase the difficulty, potentially eliminating the option to use CPUs, or switch to a different PoW. We estimate that using cloud computing resources (renting GPUs) to do these proofs in 1 hour will not be at a lower total cost than the instant authorization method. Additional costs are incurred by a bulk PoW service via other mechanisms.

Longer answer: Our primary concern is with the threat of a bulk PoW service that can sell "instant" PoWs. If the service can't deliver in a near instant time frame, we believe that most users will opt for the actually instant mechanism that funds the ledger because the cost is already so low. To that end, we believe we've come up with a way to make it significantly difficult for a PoW service to efficiently provide precomputed PoWs -- such that whether they are using GPUs or not may be a separable issue.

By requiring the chained proof to mix in proofs of existence (hashes from contemporaneous blocks in the blockchain) we can cause ledger nodes to reject PoWs that do not fall into an acceptable window. We can also keep the proof size relatively small by merely referencing these hashes via block height in the proof itself.

While it is true that, for each link in the chain, some hardware may more efficiently compute a solution (and therefore be freed up to be used for other work), the total chain cannot be successfully submitted until the requisite time period has passed (Note: there are exceptions to this rule for cases when the ledger is idle, but submitting just one PoW to the ledger removes this idleness property, and, the conditions for even attempting to make a profit here require a non-idle ledger).

This forces a viable bulk PoW service to precompute PoWs -- there is no option to compute them instantly. This means that the PoW service must do two things: cover the cost to securely generate, store (and destroy) private keys and, secondly, mitigate wasted effort (generated PoWs that never get sold). It must either drive the cost of the PoWs to near zero or develop sophisticated logistics software to maintain a profitable level of efficiency. If a PoW service is successfully at either of these, it can be combated by increasing the difficulty required or changing the PoW.

Under certain conditions, the difficulty may need to reach a threshold that prohibits individuals from easily using the current PoW. It's possible that this will be an acceptable outcome at that time. If not, it could potentially be addressed by switching to a new PoW. Perhaps more ideally, the need for a PoW at the ledger layer could be removed entirely because a healthy ecosystem of "instant" authorization mechanisms has surfaced in the meantime. Given that sort of ecosystem, experimenting with PoWs or accepting various other forms of payment can be pushed out to the edges (to the decentralized entities that can hand out instant authorizations).

tromp commented 6 years ago

dear Dave,

The second worry is that production of Cuckoo Cycle ASICs is imminent, what with Grin launching near year end with Cuckoo Cycle as PoW, yielding yet another order of magnitude reduction in cost per proof. The second worry can be mitigated though by just tweaking Cuckoo Cycle a little.

When you say tweaking a little -- are you referring to the tweaks we've already discussed or something more?

Something more; you'll need to solve a different class of graphs. Which you'd do by slightly changing the mapping from indices to endpoints. You could for instance use siphash_3_4 or siphash_2_5 instead of siphash_2_4.

Longer answer: Our primary concern is with the threat of a bulk PoW service that can sell "instant" PoWs. If the service can't deliver in a near instant time frame, we believe that most users will opt for the actually instant mechanism that funds the ledger because the cost is already so low. To that end, we believe we've come up with a way to make it significantly difficult for a PoW service to efficiently provide precomputed PoWs -- such that whether they are using GPUs or not may be a separable issue.

By requiring the chained proof to mix in proofs of existence (hashes from contemporaneous blocks in the blockchain) we can cause ledger nodes to reject PoWs that do not fall into an acceptable window. We can also keep the proof size relatively small by merely referencing these hashes via block height in the proof itself.

While it is true that, for each link in the chain, some hardware may more efficiently compute a solution (and therefore be freed up to be used for other work), the total chain cannot be successfully submitted until the requisite time period has passed (Note: there are exceptions to this rule for cases when the ledger is idle, but submitting just one PoW to the ledger removes this idleness property, and, the conditions for even attempting to make a profit here require a non-idle ledger).

This forces a viable bulk PoW service to precompute PoWs -- there is no option to compute them instantly.

I don't quite follow. Why couldn't I buy a proof from the bulk provider just prior to the end of the requisite time period? At that point the bulk provider can insert proofs of existence of any desired time in the past.

regards, -John

dlongley commented 6 years ago

Something more; you'll need to solve a different class of graphs. Which you'd do by slightly changing the mapping from indices to endpoints. You could for instance use siphash_3_4 or siphash_2_5 instead of siphash_2_4.

Ok, sounds good.

I don't quite follow. Why couldn't I buy a proof from the bulk provider just prior to the end of the requisite time period? At that point the bulk provider can insert proofs of existence of any desired time in the past.

If the bulk provider can't generate a proof for you in near instant time, the presumption is that you will buy an authorization from someone else (for a marginally higher price) that can. The alternative instant method provides this option. Knowing proofs of existence from the past does not enable the bulk provider to generate proofs more quickly. Rather, it forces precomputed PoWs to expire.

In order for the bulk provider to generate the proof quickly enough, they must have vastly superior hardware (which raising the difficulty threshold can potentially address) or they must have actually precomputed it. To use a precomputed value means to trust the bulk provider to have securely generated and stored your private keys -- and to destroy them once they hand over the proof to you. Furthermore, the bulk provider must ensure that enough users show up to purchase the precomputed proofs they've already spent resources on -- or they will considered be losses as the proof of existence mechanism forces them to expire.

tromp commented 6 years ago

In order for the bulk provider to generate the proof quickly enough, they must have vastly superior hardware (which raising the difficulty threshold can potentially address)

Yes; I imagined a big GPU mining farm with tens of thousands of GPUs, that could use a small fraction of them to generate the required proof chain in a few seconds.

But perhaps that is too much overhead for them given the tiny proceeds...

regards, -John

dlongley commented 6 years ago

But perhaps that is too much overhead for them given the tiny proceeds...

That's the gamble. If we begin to see that that gamble was wrong then we can adjust the difficulty threshold. Using a big GPU mining farm for this task is probably a dramatic misappropriation of resources. I would expect using them to mine cryptocurrency would yield a more reasonable reward -- and I believe we have the levers to make that true if we begin to find out otherwise.

tromp commented 6 years ago

dear Dave,

do you mind I mention your project at the bottom of the Cuckoo Cycle project page? if not, then please provide a URL to link to. Thanks!

regards, -John

dlongley commented 6 years ago

do you mind I mention your project at the bottom of the Cuckoo Cycle project page?

Not at all. Thanks again for creating Cuckoo Cycle!

Link: Veres One

tromp commented 5 years ago

dear Dave,

do you mind I mention your project at the bottom of the Cuckoo Cycle project page?

Not at all. Thanks again for creating Cuckoo Cycle!

Link: Veres One

I tried to find mention of Cuckoo Cycle anywhere on the website, in vain:-( Is Veres One still using Cuckoo Cycle?

regards, -John

dlongley commented 5 years ago

@tromp,

I tried to find mention of Cuckoo Cycle anywhere on the website, in vain:-( Is Veres One still using Cuckoo Cycle?

We still plan on using some variant of Cuckoo Cycle, but we're working out which one. We haven't updated the Veres One site in while as we've been heads down doing a big development push (on many items, not just related to PoW). We saw that there are now Cuckatoo and Cuckaroo variants in the works. We've written some libraries to add proof chaining for our use case here:

https://github.com/digitalbazaar/cuckoo-cycle/tree/dev

And some node.js wrappers for your solver engine here:

https://github.com/digitalbazaar/cuckoo-cycle-engine-tromp/tree/dev

And we're putting together an implementation to make some variant of cuckoo cycle a Linked Data Proof here:

https://github.com/digitalbazaar/cuckoo-ldp/tree/implementation

At the moment, the main thing that's actually driving which variant we'll use is how well supported we think the code base will be and how easily we can run that code on consumer systems and via platforms like node.js. We were doing some work in our fork to make the code more library friendly here, but my understanding is that some of that may be getting incorporated already:

https://github.com/digitalbazaar/cuckoo/tree/lib

Anyway, all of this has us trending away from "cuckoo" and towards either "cuckatoo" or "cuckaroo". I think we'd probably prefer "cuckaroo" as it targets GPUs, but it seems like we may use "cuckatoo" to begin with since the code appears to be more ready given our timeline.

Our model has changed slightly to allow for more agility with respect to which PoW we go with, which is especially vital given the recent flux we're seeing with cuckoo cycle. We will certainly have updates to our Veres One site once we get through our current development push. We will include details on what PoW we've chosen -- and will certainly give attribution to you!

tromp commented 5 years ago

dear Dave,

We still plan on using some variant of Cuckoo Cycle, but we're working out which one. We haven't updated the Veres One site in while as we've been heads down doing a big development push (on many items, not just related to PoW). We saw that there are now Cuckatoo and Cuckaroo variants in the works. We've written some libraries to add proof chaining for our use case here:

Thanks for the update. Nice to see Cuckoo Cycle code in development.

Anyway, all of this has us trending away from "cuckoo" and towards either "cuckatoo" or "cuckaroo". I think we'd probably prefer "cuckaroo" as it targets GPUs,

It just aims to prevent lean mining, which is easier for ASICs. Grin ASICs will target cuckatoo, while Aeternity ASICs will target their choice of plain Cuckoo Cycle. So cuckaroo is indeed a good choice for limiting the attainable advantage over GPUs.

regards, -John