vegaprotocol / data-node

A rich API server for Vega Protocol
https://vega.xyz
Other
3 stars 1 forks source link

Data-node possible crash seen on full run #757

Closed wwestgarth closed 2 years ago

wwestgarth commented 2 years ago

Problem encountered

A suspected data-node crash may have occurred on this test run: https://jenkins.ops.vega.xyz/blue/organizations/jenkins/common%2Fsystem-tests/detail/system-tests/3404/pipeline/

Any further details are limited because after the crash vegacapsule restarted the job, the the new core/data-node caught up and everything was fine. But the logs from the initial task were lost.

The only reason I know something went wrong with that node is because when grepping the tendermint logs we see:

Job: testnet-nodeset-full-2-full, Task: tendermint-full-2: I[2022-06-29|17:58:26.353] ABCI Replay Blocks                           module=consensus appHeight=0 storeHeight=6180 stateHeight=6180
Job: testnet-nodeset-validators-0-validator, Task: tendermint-validator-0: I[2022-06-29|15:43:28.784] ABCI Replay Blocks                           module=consensus appHeight=0 storeHeight=0 stateHeight=0
Job: testnet-nodeset-validators-1-validator, Task: tendermint-validator-1: I[2022-06-29|15:43:28.797] ABCI Replay Blocks                           module=consensus appHeight=0 storeHeight=0 stateHeight=0

Tendermint for testnet-nodeset-full-2-full started 2 hours after the other nodes, with 6180 blocks in its block-store.

I'm not expecting much hope on this one, but thought maybe running the event-file through the data-node on loop might show up some instability?

Update I've seen this three times now, and it always fall over in this test test_funding_reward_accounts_oneoff with an internal error when getting a party account balance:

tests/rewards/trading_rewards_test.py:44: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
modules/smartContractsMod/contractMod.py:719: in erc20_deposits
    accountsMod.WaitUpdatePartyAccountBal(context, trader, 'General', erc20_asset_id, expected_wallet_balance)
modules/acctsMod/accountsMod.py:352: in WaitUpdatePartyAccountBal
    accbal = (GetPartyAccountBal(context, party, accType, assetId))
modules/acctsMod/accountsMod.py:266: in GetPartyAccountBal
    response = grpc_stub_trd_data().PartyAccounts(request)
/usr/local/lib/python3.8/dist-packages/grpc/_channel.py:923: in __call__
    return _end_unary_response_blocking(state, call, False, None)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

state = <grpc._channel._RPCState object at 0x7f914df8fd30>
call = <grpc._cython.cygrpc.SegregatedCall object at 0x7f915774b540>
with_call = False, deadline = None

    def _end_unary_response_blocking(state, call, with_call, deadline):
        if state.code is grpc.StatusCode.OK:
            if with_call:
                rendezvous = _MultiThreadedRendezvous(state, call, None, deadline)
                return state.response, rendezvous
            else:
                return state.response
        else:
>           raise _InactiveRpcError(state)
E           grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
E              status = StatusCode.INTERNAL
E              details = "Internal error"
E              debug_error_string = "{"created":"@1656611856.035398418","description":"Error received from peer ipv4:127.0.0.1:3027","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"Internal error","grpc_status":13}"
E           >

/usr/local/lib/python3.8/dist-packages/grpc/_channel.py:826: _InactiveRpcError

Observed behaviour

Data-node may or may have not crashed.

Expected behaviour

Data-node did not crash, maybe?

Automation

Link to automation and explanation on how to run it to reproduce the problem/bug

Evidence

Logs

If applicable, add logs and/or screenshots to help explain your problem.

Additional context

Add any other context about the problem here including; system version numbers, components affected.

Definition of Done

ℹ️ Not every issue will need every item checked, however, every item on this list should be properly considered and actioned to meet the DoD.

Before Merging

After Merging

wwestgarth commented 2 years ago

This has just happened a second time: https://jenkins.ops.vega.xyz/blue/organizations/jenkins/common%2Fsystem-tests/detail/system-tests/3422/pipeline/

wwestgarth commented 2 years ago

And a third time: https://jenkins.ops.vega.xyz/blue/organizations/jenkins/common%2Fsystem-tests/detail/system-tests/3467/pipeline

wwestgarth commented 2 years ago

The latest on this is:

I am out of ideas. Other that hardcoding core to write its logs to a file in the home directory and circumventing VC's log collection entirely.

gordsport commented 2 years ago

From planning today:

gordsport commented 2 years ago

@jgsbennett @MuthuVega - have we seen data-node crash on the full runs during this sprint?

gordsport commented 2 years ago

The full runs have been passing green, closing this issue for now