Closed arrivets closed 5 years ago
Fixed in v0.5.9
We fixed the GetEvent
method, but we should also fix the GetParticipantEvents
method.
Just looking at the core.EventDiff
method, which is used by nodes to respond to sync requests,
we can see that it uses GetParticipantEvents
which will eventually start failing if we don't use
the persistant storage for it.
More generally, there is a case for cleanly separating store methods for pure hashgraph objects, from consensus objects. The hashgraph objects are just enough to construct and gossip the DAG, whereas the consensus objects are used by the consensus methods. Hashgraph objects need to be persisted, because they are used by the sync methods to enable people to join the network. But consensus objects can live in the inmem store, as they are recalculated locally by every node.
Working on branch issue127
Having updated GetParticipantEvents to read from the database, I am still seeing sync errors.
Here's a way to reproduce the issue locally with the docker demo (note that cache-limit is set to 2000):
make
make bombard txs=1000
... repeat bombard until consensus_events reaches more than 2000
make join index=5
node5 cannot sync - it keeps throwing Creator 0 not found
errors when attempting to sync from other nodes.
The fact that the error only occurs when there are more than 2000 events confirms that this is a cache problem.
This happens when reading a WireEvent. The WireEvents it receives from other nodes have 0 values for selfParentIndex
, otherParentCreatorID
, selfParentIndex
, and creatorID
. This suggests that the other nodes are not properly converting Events to WireEvents.
When Events are read from the DB, all the private fields are lost. This explains why they all have the value 0 when Events are converted to WireEvents.
We need to call SetWireInfo when we read Events from the DB, before converting them to WireEvents.
Now that WireInfo is correctly set/read we have another problem because topologicalIndex
is also lost.
The Events are not returned in topological order, which leads to errors when trying to insert (or convert from wire representation) an event whose parents are not yet inserted.
We need to preserve the topological order when reading Events from DB.
On the Monet testnet - Camille-3 - we saw evidence that there is a bug in BadgerStore whereby some Events are not found when responding to a pull request.
We started a node without a database (so no bootstrap), forcing it to pull the entire hashgraph from other peers, and the peers would encounter and error like this:
[0124] ERROR babble: requestSync() error=EventCache, 0X0EF22788B30904B2B46488E4BFF41160A4A6C74D9607377E1FD4E1B1FE81AB93, Not Found [0124] ERROR babble: gossip pull error=EventCache, 0X0EF22788B30904B2B46488E4BFF41160A4A6C74D9607377E1FD4E1B1FE81AB93, Not Found [0125] DEBUG babble: requestSync() duration=1353432 [0125] ERROR babble: requestSync() error=EventCache, 0X0EF22788B30904B2B46488E4BFF41160A4A6C74D9607377E1FD4E1B1FE81AB93, Not Found [0125] ERROR babble: gossip pull error=EventCache, 0X0EF22788B30904B2B46488E4BFF41160A4A6C74D9607377E1FD4E1B1FE81AB93, Not Found
This suggests that BadgerStore is not finding some "old" events, which is potentially a bug with the caching system.It could be useful to reproduce this issue with a very small cache-size setting in Babble.