mimblewimble / grin

Minimal implementation of the Mimblewimble protocol.
https://grin.mw/
Apache License 2.0
5.04k stars 989 forks source link

Creation of cuckoo-miner library and integration into grin #70

Closed yeastplume closed 7 years ago

yeastplume commented 7 years ago

Grin is going to need a rust library to call into John Tromp's Cuckoo Mining code for access to the latest and greatest Cuckoo Mining algorithms, and the most logical way to do this is to maintain a separate library/rust crate to provide this. A project has been started here to address this here:

https://github.com/mimblewimble/cuckoo-miner

For design goals, the library should try to maintain compatibility with what's already in the Cuckoo repository as much as possible, to make it relatively easy to update with any changes or new algorithms that come along. The caller should also be able to select which algorithm/cuckoo version it wants to use, and be able to pass in any parameters needed.

Once this library exists, grin's miner should be changed to use it (possibly removing the existing simple-miner implementation, and options need to be provided to make the miner implementation selectable.

yeastplume commented 7 years ago

Now unfortunately, although the above design goals are a nice ideal, after reviewing the cuckoo repository it's clear there's going to be some pain here.

The miners contained in the current cuckoo project are clearly (and understandably) designed as reference implementations and optimised for speed and efficiency rather than re-use. Parameters into the algorithms (primarily EDGEBITS,) which is manipulated dynamically in Grin are manipulated by placing them preprocessor directives, specifying a value in a makefile and building separate executables to vary sizes. It also won't be possible to include all of the code into a single library as is, as different versions of the miner would contain conflicting function names.

The implementations tend to be C++, which can't be called directly from Rust, so C wrappers are going to be needed regardless. Would basically have to replace any main functions currently there with a C wrapper callable from rust.

So, a few options here:

1) Convert the algorithms to Rust outright, which I don't think is a runner, given that it would be good to be able to (relatively) easily integrate changes or algorithms from any new findings.

2) Maintain a separate version of the repository, with files modified to ensure different implementations are in unique namespaces, and current preprocessor directives changed to dynamic values, and replace main functions with Rust callable C wrappers... this would obviously have to be done in a way that minimises changes from the code in the existing cuckoo repository, to make it relatively straightforward to manually merge changes as and when they happen. This could produce a single static library allowing rust to select and call various versions of the miners dynamically.

3) Try to compile all of the many variations of the miner currently found in cuckoo's makefile into separate DLLs, then create a C wrapper to expose whatever functions are needed and call into each. Then a top-level rust-callable library would instantiate and invoke each library as requested by the caller. This approach would probably go the furthest towards keeping compatible with the existing source, but is also going to the messiest from a build standpoint. Also, if we're trying to avoid changing the base code we'll need a separate lib for each size variation of each miner.

I'm very happy to hear more thoughts or suggestions here, but I think option 2 would ultimately be the least painful. We'd have to maintain our own distinct versions of the miners, but changing preprocessor values to values read from the stack would provide needed flexibility to the library (probably with a mild performance hit), and being able to give all the functions unique namespaces would make integration far easier. Although updates would be a very manual process, you'd hope they should be straightforward as the code would look more or less the same.

ignopeverell commented 7 years ago

I agree number 2 is likely the best choice. Number 1 is appealing but it'd be a fair amount of duplicated effort to end up with something probably slower.

Btw the final solution will also have to be fairly familiar with GPUs, parallelizing Cuckoo computations on each detected one. Version 1 could simply parse the output of a cuda command line though (not sure if there are good Rust cuda libs).

yeastplume commented 7 years ago

Yes, I have the cuda miner on my distant radar. I think I'll start off by converting and integrating the reference miner found in cuckoo_miner.cpp and get that working end to end before moving on to integrating others.

urza commented 7 years ago

Something to be aware of -some of the miners in Tromps repo (mean_miner specifically) contains inline assembly code with very specific instructions (AVX2) that only relatively recent CPUs (Haswell+) are able execute. There is directive to compile it without these instructions, at a price of 2x slowdown (by our measurements). Probably it shouild be possible to optimise this with SSE2 instruction for older CPUs to avoid the 2x slowdown.

yeastplume commented 7 years ago

Okay, thanks for pointing that out. For this reason and others, mainly that the max edge and node space is typedefed in at compile time based on the cuckoo size, we're probably going to end up with a couple of different versions of the library, possibly a main library that loads the appropriate dll based on whatever is detected on the users system. I'll get the first few cases working anyhow and we can consider this in a refactor later

yeastplume commented 7 years ago

Bit of an update, it's actually looking like the 3rd option, i.e. a range of DLLs implementing various size options and other flags (such as the AVX2 instruction above), is definitely the best way to go to avoid extensive changes to the source libs. The problem is that they're very const and preprocessor heavy (again not a criticism, they're just designed for speed,) and doing something as simple as changing a const to a variable has waterfall effects, like an array size can't be deduced at compile time and needs to be changed to become dynamic, changing the structure, etc. I'm think I'm going to go with this strategy for a first out, simply to minimise changes to the originals.

ignopeverell commented 7 years ago

Won't that bre really difficult to port from one OS to another? Even between Linux/Mac?

yeastplume commented 7 years ago

No, I don't think so. I work on Mac some of the time and even the current checked in version now supports both. Cargo is what's actually producing the dlls and exposing the interface, and it automatically builds .dylibs on mac, .so on linux and .dll on windows. So the c libs themselves are built statically, and cargo links them and puts the dll wrapper around them. I already have a library in place to load/unload them and the project now builds a few variations on the c source so it seems the architecture will work.

The more I get into this now the more I think this approach is better, and it's starting to take some shape in my head. Think of it as a plugin architecture... each dll can expose a set interface functions exposing its name/version, whether it can run on the current platform, a function for setting parameters and of course a mine function. These can all sit in a particular directory of the installation, and the library can query them to expose what miners it can currently use. This will cover the specific instruction issue above, as well as allow to send parameters into miners when needed.

Then adding a new miner is trivial: just produce a dll exposing that interface (using whatever language they like), drop it in the directory, have the system select it somehow and nothing else has to change.

yeastplume commented 7 years ago

Right now cuckoo-miner works when integrated into grin, and the simple and edge_trim variants work quite well if dropped in instead of the default grin implementation. I've been able to successfully test the CUDA version, which finds solutions for cuckoo30 in about a second (on a 980ti). Instructions on how to play with this are in the cuckoo-miner docs.

Just an update on what I have planned here for the next couple of weeks, in roughly the order I intend to do them. Mostly about refining the architecture some and making what's there more robust.

-Changing the interface into the plugin dlls to allow them to report what parameters they accept and implement a 'setparameter' function on each one, as the parameters can differ between implementations, and want to keep the main 'mine' function clean and standardised. -More changes to mining setup in grin to allow mining plugins and their parameters to be specified in the new config file. -I think the 'is_test', cuckoo_size, TEST_SIZESHIFT, etc.. parameters in grin are starting to get a bit muddled, want to change this to ensure the cuckoo size is coming from one place and is consistent across everything that uses it. -Implementation of mean_miner and tomato_miner dll variants, (the mean one in particular might be challenging, as it tends to exit(0) without cleaning its threads at the moment in a lot of cases). -Tests to currently implemented plugin variants (simple, edgetrim, CUDA) to ensure they're not leaking, no segfaults, etc). -Perhaps extend the plugin interface to allow plugins to report errors (especially important in the case of CUDA) -More tests to exercise the plugins and ensure the results are consistent across implementations. -Packaging of cuckoo-miner crate, and upload initial 0.1.0 version onto crates.io to allow for seamless inclusion into a developer's grin build.

yeastplume commented 7 years ago

Just some more thoughts here on this, in case anyone has any comments.

The design of the current version, as integrated into Grin (and tagged as grin_integration), was implemented fairly simply to keep backwards compatibility with what was already there and to allow for some testing. The main interface is fairly simple, mostly just a 'mine' function that accepts a header and spits back the result of its computation. This is fine for the time being and a first out, but it's fairly evident now that cuckoo cycle mining in the real world using cuckoo 30 or the like is going to be an entirely GPU-based affair.

If cuckoo 30 is used, then depending on the eventual implementation the number of searches per second will be on the order of low single digits on a bleeding edge card. I might be wrong depending on the results of John's current rewrite, but I don't think I'm off by an order of magnitude. This will be more if a lesser size graph is used, but I'm not sure that's desirable.

In any case, the workflow currently involves: -grin creating a hash from the chain, -Grin sending hash to cuckoo-miner for a search, and having the result returned from cuckoo-miner -If a result is found - then doing a difficulty check -If no joy, then create another hash and repeat. -Repeat this for a set interval, then collect transactions again and try again.

This is fine in the simple case for an extremely casual CPU mine, but falls apart where you have multiple GPUs and want to mine them. The idea is obviously that you want to keep all of the GPUs occupied at as close to 100% of the time as you can, and only stop when you have a valid solution. There are a few ways this can be done, and I'd like a fairly simple solution that avoids awkward threading and callbacks, and is also able to collect transactions from the pool more or less continually (because collecting transactions from the pool is hopefully going to be many orders of magnitude quicker than a cuckoo cycle attempt at cuckoo 30) One possibility is:

On the plugin side: -An input queue of possible hashes is continually being fed by grin (up to a predefined limit), along with a potential blockID used to identify back in Grin it if a solution is found (and a difficulty target) -The hashes are read from the queue and given to the next available GPU as and when they come free. -The queue is fed from grin up to limit and hashes are fed to GPUs for search. -Results for each hash are placed on a result queue (with the vast majority being "no-hit") -When a 42 cycle is found, the difficulty calculation is performed within the plugin (instead of in grin). If the difficulty check doesn't pass, everything continues as normal. -Once a difficulty-appropriate solution is found, all GPU activity stops, the queue is cleared and the solution is placed into the output queue.

The grin mining thread/loop is then changed to a simple:

-If the plugin's input queue isn't full, create a possible block with an up-to-date transaction set, give it an ID, store it temporarily, and send the hash, ID, and difficulty target to the plugin. -If the plugin's input queue isn't full, insert the possible hash of the block -If there's something in plugin's output queue, pull out the result and check it. -If the result is a 'no solution' remove the corresponding block from temporary storage -If there is a solution, send it off for validation, move on to the next block and clear all storage

That's a first pass at how it might work. If anyone has any other thoughts or opinions on ways to do this more simply or more efficiently, I'd appreciate hearing them, especially around my assumptions on how long collecting transactions for each attempt is likely to take.

ignopeverell commented 7 years ago

At a high level it sounds good to me. You may want to consider passing the whole header to the plugin, instead of just the hash. It may be beneficial to allow the plugin to mutate the nonce and do the hashing.

I would also double-check that nothing in there is incompatible with Stratum, it'd be nice to not have to re-implement most of this for pool mining. The queuing logic seems that it'd play nice with Stratum at least.

yeastplume commented 7 years ago

I'd considered passing in the whole header and allowing the plugin to mutate it, but it would also have the effect of making the library very grin-specific (which may not be that bad a thing for now,) but I also don't want it sub-optimal in order to keep it generic.

Actually, mining.notify in Stratum seems like a good model for how this should work. I'll try and keep the implementation as close to that as I can.

yeastplume commented 7 years ago

Looking at Stratum further, it looks like it's currently very bitcoin specific, but I think for a 'job' we can do something similar to stratum, like send in: pre_hash, post_hash where pre_hash is the serialised block header up until and not including the nonce, then post_hash is the rest of it after the nonce(not including the POW, obviously), then the miner should be able to mutate the nonce as needed and send units of work to the plugin. That should keep it reusable.

Could probably do the queue and job management on the rust side and leave the plugins themselves as dumb as possible, just reporting when they're free for a job.

If we look at this as job-based rather than continually feeding a queue with up-to-the-millisecond transaction sets as I had suggested earlier, then there's still the notion of when to stop all jobs to let new transactions in. In practice, you'd probably want to stop all jobs for a new transaction set once you've detected a 'more profitable' combination of transactions (perhaps on another thread, based on some threshold) in which it's calculated that it's 'more profitable' to stop everyone and work on a job with the new transaction set instead? I know we don't yet know what 'more profitable' means in the context of grin mining, but it's something to keep in mind.

Might hit the PlantUML before implementing.

yeastplume commented 7 years ago

I think perhaps we can close this particular issue on the grin side.. I've added a list of issues that reflect planned work on the cuckoo miner side:

https://github.com/mimblewimble/cuckoo-miner/issues