vegaprotocol / data-node

A rich API server for Vega Protocol
https://vega.xyz
Other
3 stars 1 forks source link

Data node validator disconnects during full system test run #681

Closed jgsbennett closed 2 years ago

jgsbennett commented 2 years ago

Problem encountered

For the better part of a week, possibly with increasing frequency, we've seen the data node validator disconnecting during the full test run, after which point the data node APIs only return errors, and the rest of the tests all fail with either weird test errors (where it didn't nicely cope) or references to data node API errors. See, for instance, this job: https://jenkins.ops.vega.xyz/job/common/job/system-tests/2603/ The test run finishes around 01:09:00, which will be when Jenkins kills any remaining living nodes, but you can see this in the network log ~8-9 minutes earlier

Job: testnet-nodeset-full-2, Task: vega-full-2: 2022-06-14T01:00:18.532Z    ERROR   tm.abci.socket-server   server/socket_server.go:142 Connection error [err error reading message: read tcp 127.0.0.1:26628->127.0.0.1:34484: use of closed network connection]
Job: testnet-nodeset-full-2, Task: vega-full-2: 2022-06-14T01:00:18.532Z    ERROR   tm.abci.socket-server   server/socket_server.go:142 Connection error [err error reading message: read tcp 127.0.0.1:26628->127.0.0.1:34486: use of closed network connection]
Job: testnet-nodeset-full-2, Task: vega-full-2: 2022-06-14T01:00:18.532Z    ERROR   tm.abci.socket-server   server/socket_server.go:142 Connection error [err error reading message: read tcp 127.0.0.1:26628->127.0.0.1:34488: use of closed network connection]
Job: testnet-nodeset-full-2, Task: vega-full-2: 2022-06-14T01:00:18.532Z    ERROR   tm.abci.socket-server   server/socket_server.go:142 Connection error [err error reading message: read tcp 127.0.0.1:26628->127.0.0.1:34490: use of closed network connection]
Job: testnet-nodeset-full-2, Task: vega-full-2: 2022-06-14T01:00:18.532Z    ERROR   tm.abci.socket-server   server/socket_server.go:150 Error closing connection [err connection 0 does not exist]
Job: testnet-nodeset-full-2, Task: vega-full-2: 2022-06-14T01:00:18.532Z    ERROR   tm.abci.socket-server   server/socket_server.go:150 Error closing connection [err connection 1 does not exist]
Job: testnet-nodeset-full-2, Task: vega-full-2: 2022-06-14T01:00:18.532Z    ERROR   tm.abci.socket-server   server/socket_server.go:150 Error closing connection [err connection 2 does not exist]
Job: testnet-nodeset-full-2, Task: vega-full-2: 2022-06-14T01:00:18.532Z    ERROR   tm.abci.socket-server   server/socket_server.go:150 Error closing connection [err connection 3 does not exist]

In this particular case, all of this seems to occurs during the withdrawal step of the deposit/withdrawal test case, and all tests after that fail:

2022-06-13 16:02:56.745:[INFO] MainProcess contractMod.py:get_withdrawal_vega_approval:1793 ----- Submitting core request to withdraw tokens using id 56fded67ecfede1b4f9b0e4918b2011d0d40cca5aebff1f1d218d92f5e64c83a
2022-06-13 16:02:56.746:[INFO] MainProcess contractMod.py:get_withdrawal_vega_approval:1795 ----- Waiting for core response to withdraw tokens using id 56fded67ecfede1b4f9b0e4918b2011d0d40cca5aebff1f1d218d92f5e64c83a
2022-06-13 16:02:56.748:[INFO] MainProcess contractMod.py:get_withdrawal_vega_approval:1808 ----- GRPC error ERC20WithdrawalApproval-><_InactiveRpcError of RPC that terminated with:
    status = StatusCode.NOT_FOUND
    details = "NotFound error"
    debug_error_string = "{"created":"@1655136176.748409919","description":"Error received from peer ipv4:127.0.0.1:3027","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"NotFound error","grpc_status":5}"
>

I don't remember noticing this in the previous instances I spotted last week, but I'll now go back and check.

Observed behaviour

~300 failures. No panics in any of the network logs, but if you grep the network for the "full-2" and "ERROR" around the time of the first failure, you'll see the node disconnect.

Expected behaviour

No failures

Steps to reproduce

Run the full test run. (It's possible that we also sometimes see this on the smoke run. Some evidence that may support that this week, but currently unconfirmed. If we do, it's much rarer).

Evidence

Logs

Network logs: network.zip System test logs: system test log.txt

Additional context

This appears to be an example of a run which behaved similarly, but which didn't do it after a withdrawal test. Just part way through the runs everything starts returning errors like:

failed on setup with "grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.INTERNAL
    details = "Internal error"
    debug_error_string = "{"created":"@1654861056.559943731","description":"Error received from peer ipv4:127.0.0.1:3027","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"Internal error","grpc_status":13}"
>"

This is an example of a smoke run which happens to go bad after a withdrawal test, similar to the example I listed in the main part of the ticket.

Definition of Done

ℹ️ Not every issue will need every item checked, however, every item on this list should be properly considered and actioned to meet the DoD.

Before Merging

After Merging