poanetwork / hbbft

An implementation of the paper "Honey Badger of BFT Protocols" in Rust. This is a modular library of consensus.
Other
357 stars 96 forks source link

Enable DynamicHoneyBadgers to rejoin after connection loss #366

Closed sgeisler closed 5 years ago

sgeisler commented 5 years ago

Implementing an epoch setter for the DynamicHoneyBadgerBuilder enables the creation of a DynamicHoneyBadger that will join the consensus at a given epoch (fixes #365 ).

I'm not yet familiar enough with your code base to know what testing would be expected for that change and where to put the tests. You seem to mainly rely on integration tests, so I guess I could patch TestNetwork to support late joiners and then add a test with one?

TODO:

vkomenda commented 5 years ago

I need to think about the problem more. The part that I understood could be done at the networking layer as I wrote in #365.

afck commented 5 years ago

Did you try out whether this change would actually work, even after some nodes joined and left? My concern is this: The epoch exposed to the user is the sum of the internal era and epoch. Every time the set of validators changes, the internal HoneyBadger instance is restarted at epoch 0, and the era is set to the current epoch + 1. Rejoining will only work if the new node's era and epoch match the network's. It doesn't suffice if the (user-visible) sum matches.

afck commented 5 years ago

Regarding the tests, our plans in the long term are to replace tests/network with tests/net, so it would be better to use VirtualNet.

sgeisler commented 5 years ago

The epoch exposed to the user is the sum of the internal era and epoch

Oh, I didn't notice that, I always thought of them as two different things (era for key sets, epoch for rounds in an era). Actually my use case stays at era 0 forever, so I never tried to get the current era :smile: I just wanted to use the queuing HB. Since that's probably not the standard use case there should be two "getters" imo: one for the era and one for the epoch.

If that's too much of a change for you I'm ok with just forking and maintaining that diff myself and maybe try to build my own queuing HB on top of HB instead of dynamic HB later.

mbr commented 5 years ago

I can't see at the moment how manually setting the epoch on construction could result in an invalid state (you may end up with a confused honey badger, but not an invalid one). So I guess this is okay?

When rejoining after a connection loss, would it not be prudent to keep the old algorithm instance around? After all, it's built to be asynchronous; as @vkomenda said, it sounds like a networking problem (the honey badger would ideally sort itself out).

Crashing the program is a different matter however, so there's definitely merit in recovering from that by storing state and picking up where one left off.

I'll leave it up @afck and @vkomenda to judge if these changes make sense, purely from an implementation standpoint they look good to me =).

@sgeisler: We recently (just now!) changed the API regarding the passing of random number generators, those lines are in close proximity to your changes, so you will have to rebase. Sorry.

Also, the TestNetwork is, as @afck pointed out, being phased out in favor of VirtualNet. New-style tests use proptest and the net module; see tests/README.md for a detailed intro of how the new framework works. It's also fairly well documented (I hope).

afck commented 5 years ago

@mbr: Yes, in case of a short-term connection loss, the Honey Badger should just keep going! The application logic must make sure that all messages are delivered after reconnecting. SenderQueue does that automatically. After a crash or a long disconnect, however, this makes sense to me.

I'm fine with merging this, if @vkomenda is.