vegaprotocol / vega

A Go implementation of the Vega Protocol, a protocol for creating and trading derivatives on a fully decentralised network.
https://vega.xyz
GNU Affero General Public License v3.0
37 stars 22 forks source link

Test for 0069-VCBS-043 fails to add tendermint node #5918

Closed claudiumilea closed 2 years ago

claudiumilea commented 2 years ago

While trying to setup a network with 7 slots and 5 tendermint validators, when trying to announce 2 more nodes and self-delegate only 1 node we are getting an error.

Test steps - 0069-VCBS-043

  1. Set up a network with 7 slots for Tendermint validators and 5 actual Tendermint validators.
  2. Self-delegate to all of them.
  3. Announce 2 new nodes but self-delegate only to one of them.
  4. Verify that, after 1000 blocks and on the following epoch, only the validator to which we self-delegated got promoted, and we now have 4 Tendermint validators and 1 pending validator.

PR for the test here Test here

For validator-7 we get sdterr: failed to start state sync: failed to set up light client state provider: post failed: Post "http://127.0.0.1:26667": context deadline exceeded

Image

This does not happen at all runs - the test is flaky - one run 1 was successful but another run 2 failed.

and the stacktrace of the test suggests that the test failed because it was not able to find the node to self-delegate.

wwestgarth commented 2 years ago

This is due to me switching on state-syncing in new nodes when they are announced in this commit to the system-tests: https://github.com/vegaprotocol/system-tests/commit/8fd487b623c5279d184bdb396055df37b8606569

The problem isn't related to anything in core, but more to do with Tendermint. My hunch is that the test fixture is not setting the state-sync options properly.

From what I've seen it always works fine when adding one node, and the problem only occurs when adding two at roughly the same time. The node added second can time-out when trying to connect to another node and I wonder if this is it trying to connect to the node added first the isn't quite up yet -- if vegacapsule is just templating the tendermint config then the second node will have the RPC addresses of the first node in its config even though in-real-life the second node shouldn't know about other nodes in the process of syncing in. We should probably have the system-tests just remove those.

To unblock I will switch the state-sync snapshot stuff off in the network infra tests while I investigate and move this ticket over to the system-test repo.

vega-paul commented 2 years ago

Thanks @wwestgarth Just confirming that concurrent announcing of nodes is supported by the core?

wwestgarth commented 2 years ago

Thanks @wwestgarth Just confirming that concurrent announcing of nodes is supported by the core?

Yup it is.

wwestgarth commented 2 years ago

Been playing with this today and when state-syncing multiple nodes at the same time, if I exclude the new-still-starting nodes RPC address from the state-sync rpc option in the tendermint config then I see no issues with nodes starting (I've run it multiple times to check). And this kind of make sense because in real life if two independent people are state-syncing in how would one even know about that other, and even if they did why would they list them.

So I'm going to close this ticket since its not a core issue, and open one in the system-tests to fix up the state-sync code in the node-generation fixture.