mosaicnetworks / babble

Distributed Consensus Middleware
MIT License
478 stars 95 forks source link

BadgerStore #127

Closed arrivets closed 5 years ago

arrivets commented 5 years ago

On the Monet testnet - Camille-3 - we saw evidence that there is a bug in BadgerStore whereby some Events are not found when responding to a pull request.

We started a node without a database (so no bootstrap), forcing it to pull the entire hashgraph from other peers, and the peers would encounter and error like this:

[0124] ERROR babble: requestSync() error=EventCache, 0X0EF22788B30904B2B46488E4BFF41160A4A6C74D9607377E1FD4E1B1FE81AB93, Not Found [0124] ERROR babble: gossip pull error=EventCache, 0X0EF22788B30904B2B46488E4BFF41160A4A6C74D9607377E1FD4E1B1FE81AB93, Not Found [0125] DEBUG babble: requestSync() duration=1353432 [0125] ERROR babble: requestSync() error=EventCache, 0X0EF22788B30904B2B46488E4BFF41160A4A6C74D9607377E1FD4E1B1FE81AB93, Not Found [0125] ERROR babble: gossip pull error=EventCache, 0X0EF22788B30904B2B46488E4BFF41160A4A6C74D9607377E1FD4E1B1FE81AB93, Not Found This suggests that BadgerStore is not finding some "old" events, which is potentially a bug with the caching system.

It could be useful to reproduce this issue with a very small cache-size setting in Babble.

arrivets commented 5 years ago

Fixed in v0.5.9

arrivets commented 5 years ago

We fixed the GetEvent method, but we should also fix the GetParticipantEvents method.

Just looking at the core.EventDiff method, which is used by nodes to respond to sync requests, we can see that it uses GetParticipantEvents which will eventually start failing if we don't use the persistant storage for it.

More generally, there is a case for cleanly separating store methods for pure hashgraph objects, from consensus objects. The hashgraph objects are just enough to construct and gossip the DAG, whereas the consensus objects are used by the consensus methods. Hashgraph objects need to be persisted, because they are used by the sync methods to enable people to join the network. But consensus objects can live in the inmem store, as they are recalculated locally by every node.

arrivets commented 5 years ago

Working on branch issue127

Having updated GetParticipantEvents to read from the database, I am still seeing sync errors.

Here's a way to reproduce the issue locally with the docker demo (note that cache-limit is set to 2000):

make 
make bombard txs=1000
... repeat bombard until consensus_events reaches more than 2000 
make join index=5

node5 cannot sync - it keeps throwing Creator 0 not found errors when attempting to sync from other nodes.

The fact that the error only occurs when there are more than 2000 events confirms that this is a cache problem.

This happens when reading a WireEvent. The WireEvents it receives from other nodes have 0 values for selfParentIndex, otherParentCreatorID, selfParentIndex, and creatorID. This suggests that the other nodes are not properly converting Events to WireEvents.

When Events are read from the DB, all the private fields are lost. This explains why they all have the value 0 when Events are converted to WireEvents.

We need to call SetWireInfo when we read Events from the DB, before converting them to WireEvents.

arrivets commented 5 years ago

Now that WireInfo is correctly set/read we have another problem because topologicalIndex is also lost.

The Events are not returned in topological order, which leads to errors when trying to insert (or convert from wire representation) an event whose parents are not yet inserted.

We need to preserve the topological order when reading Events from DB.