tromp / cuckoo

a memory-bound graph-theoretic proof-of-work system
Other
822 stars 173 forks source link

Speed challenge #18

Closed xenoncat closed 7 years ago

xenoncat commented 7 years ago

Performance summary: 3x speed at the cost of 20x memory usage. The speed advantage can be 4x or higher, depending on your computer and multithreading mode.

This work does not actually follow the bounty requirements.

  1. Only cuckoo28 and cuckoo30 are prepared. cuckoo32 is not available mainly because it is estimated to require 16GB memory and my computer does not have that.
  2. There is no SHA-256 hashing to convert 80-byte header to (k0,k1). Instead, you supply (k0,k1) directly from the command line. The reason is because I want to reduce clutter and focus on the Cuckoo Cycle.
  3. No mining mode because there is no SHA-256 hashing. The program only does one-shot cycle finding for a (k0,k1) input.
  4. You should trust the timing printed by my program. It excludes memory allocation time, and it makes sense to do so to gauge mining performance. You can still use Linux time command to see that the difference is reasonable.

Despite the above, this submission is interesting for the following reasons.

  1. I think the DRAM latency problem is bypassed and the performance is somewhat limited by memory bandwidth instead, but I could be wrong.
  2. Your reference implementation now becomes the TMTO.
  3. It sounds cool to perform Cuckoo Cycle PoW without Cuckoo hashing. The algorithm used is extreme form of edge trimming followed by naive traversing of graph to find cycle. The main ingredient for the edge trimming is bucket sorting to detect collision.
  4. This program contains multiple techniques put together to make it fast.

A description of the algorithm is still in progress. It may take some time, as I am busy with full-time job. The program currently outputs a lot of progress and debug information. Expect it to be buggy.

https://github.com/xenoncat/cuckoo_pow

tromp commented 7 years ago

Thanks for taking up the challenge, xenoncat! Testing on my iMac, I see a speedup on cuckoo28 from 11s to 4s with linux time. Mac OSX doesn't support the clock_gettime() so I can't get your alloc-free time yet. I changed my program to print out the (k0,k1) keypair so I can match results, and they indeed have matching solutions. Looks you have a winner here, at least for the x2 bounty, and possibly for the 4x one. I'll do more testing on a Linux box later this weekend.

Could you prepare a cuckoo32 version as well for me to try, even without being able to test it yourself?

On a 4GHz Intel Core i7-4790K, the minimum speedup I measured is 2.88x (from 7.46s to 2.59s). I expect a cuckoo30st and cuckoo32*, if implemented, will show a greater speedup. Since your average speedup is well beyond this, I shall award you $5000 (which happens to equal my Runner Up prize for the Zcash Open Source Miner Challenge). Would you like this paid in USD or in BTC?

xenoncat commented 7 years ago

Cuckoo32 uploaded. 16GB is indeed an accurate estimation. Very limited testing was done. On a borrowed machine, Core i7-4770K: cuckoo32stx8 vs cuckoo32_1t: 251s to 53.9s cuckoo32x8 vs cuckoo32_8t: 88.6s to 14.6s I currently don't have an idea on how to get a good speedup on single thread. cuckoo28_st and cuckoo28_1t perform similarly.

For your information, "dead end" messages are harmless debug messages expected when the edge trimming is not fully completed. Incomplete edge trimming is signaled through "24 loops" in cuckoo28 and cuckoo30, or "30 loops" in cuckoo32. When you get "dead end" messages, you may also get "unexpected degree-2 traversal". I should have suppressed the message but forgot to do so.

Please send BTC to 1DAUXeA3GKbTCrHCRQLCrbzQZwBA7hYVjD Thanks.