wwestgarth commented 2 years ago

Problem encountered

A suspected data-node crash may have occurred on this test run: https://jenkins.ops.vega.xyz/blue/organizations/jenkins/common%2Fsystem-tests/detail/system-tests/3404/pipeline/

Any further details are limited because after the crash vegacapsule restarted the job, the the new core/data-node caught up and everything was fine. But the logs from the initial task were lost.

The only reason I know something went wrong with that node is because when grepping the tendermint logs we see:

Job: testnet-nodeset-full-2-full, Task: tendermint-full-2: I[2022-06-29|17:58:26.353] ABCI Replay Blocks                           module=consensus appHeight=0 storeHeight=6180 stateHeight=6180
Job: testnet-nodeset-validators-0-validator, Task: tendermint-validator-0: I[2022-06-29|15:43:28.784] ABCI Replay Blocks                           module=consensus appHeight=0 storeHeight=0 stateHeight=0
Job: testnet-nodeset-validators-1-validator, Task: tendermint-validator-1: I[2022-06-29|15:43:28.797] ABCI Replay Blocks                           module=consensus appHeight=0 storeHeight=0 stateHeight=0

Tendermint for testnet-nodeset-full-2-full started 2 hours after the other nodes, with 6180 blocks in its block-store.

I'm not expecting much hope on this one, but thought maybe running the event-file through the data-node on loop might show up some instability?

Update I've seen this three times now, and it always fall over in this test test_funding_reward_accounts_oneoff with an internal error when getting a party account balance:

[1m[31mtests/rewards/trading_rewards_test.py[0m:44: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[1m[31mmodules/smartContractsMod/contractMod.py[0m:719: in erc20_deposits
    accountsMod.WaitUpdatePartyAccountBal(context, trader, 'General', erc20_asset_id, expected_wallet_balance)
[1m[31mmodules/acctsMod/accountsMod.py[0m:352: in WaitUpdatePartyAccountBal
    accbal = (GetPartyAccountBal(context, party, accType, assetId))
[1m[31mmodules/acctsMod/accountsMod.py[0m:266: in GetPartyAccountBal
    response = grpc_stub_trd_data().PartyAccounts(request)
[1m[31m/usr/local/lib/python3.8/dist-packages/grpc/_channel.py[0m:923: in __call__
    return _end_unary_response_blocking(state, call, False, None)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

state = <grpc._channel._RPCState object at 0x7f914df8fd30>
call = <grpc._cython.cygrpc.SegregatedCall object at 0x7f915774b540>
with_call = False, deadline = None

    def _end_unary_response_blocking(state, call, with_call, deadline):
        if state.code is grpc.StatusCode.OK:
            if with_call:
                rendezvous = _MultiThreadedRendezvous(state, call, None, deadline)
                return state.response, rendezvous
            else:
                return state.response
        else:
>           raise _InactiveRpcError(state)
[1m[31mE           grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:[0m
[1m[31mE              status = StatusCode.INTERNAL[0m
[1m[31mE              details = "Internal error"[0m
[1m[31mE              debug_error_string = "{"created":"@1656611856.035398418","description":"Error received from peer ipv4:127.0.0.1:3027","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"Internal error","grpc_status":13}"[0m
[1m[31mE           >[0m

[1m[31m/usr/local/lib/python3.8/dist-packages/grpc/_channel.py[0m:826: _InactiveRpcError

Observed behaviour

Data-node may or may have not crashed.

Expected behaviour

Data-node did not crash, maybe?

Automation

Link to automation and explanation on how to run it to reproduce the problem/bug

Evidence

Logs

If applicable, add logs and/or screenshots to help explain your problem.

Additional context

Add any other context about the problem here including; system version numbers, components affected.

Definition of Done

ℹ️ Not every issue will need every item checked, however, every item on this list should be properly considered and actioned to meet the DoD.

Before Merging

[ ] Code refactored to meet SOLID and other code design principles
[ ] Code is compilation error, warning, and hint free
[ ] Carry out a basic happy path end-to-end check of the new code
[ ] All APIs are documented so auto-generated documentation is created
[ ] All bug recreation steps can be followed without presenting the original error/bug
[ ] All Unit, Integration and BVT tests are passing
[ ] Implementation is peer reviewed (coding standards, meeting acceptance criteria, code/design quality)
[ ] Create front end or console tickets with feature labels (should be done when starting the work if dependencies known i.e. API changes)

After Merging

[ ] Move development ticket to Done if there is NO requirement for new system-tests
[ ] Resolve any issues with broken system-tests
[ ] Create documentation tickets with feature labels if functionality has changed, or is a new feature

wwestgarth commented 2 years ago

This has just happened a second time: https://jenkins.ops.vega.xyz/blue/organizations/jenkins/common%2Fsystem-tests/detail/system-tests/3422/pipeline/

wwestgarth commented 2 years ago

And a third time: https://jenkins.ops.vega.xyz/blue/organizations/jenkins/common%2Fsystem-tests/detail/system-tests/3467/pipeline

wwestgarth commented 2 years ago

The latest on this is:

@pscott31 has had a look and doesn't think the data-node is the issue, hes run the event-source-file thing and saw no issues
I've pull the tendermint block data locally and replayed the chain over and over again a few times, and saw no problems
I did a run with "stop-on-first-error" set in the hope to stop VC bouncing the node, and just got no logs at all for that node
I've check the snapshots in the for each node to see if there was a consensus mismatch on a snapshot block, and there is not, all hashes match

I am out of ideas. Other that hardcoding core to write its logs to a file in the home directory and circumventing VC's log collection entirely.

gordsport commented 2 years ago

From planning today:

Review the logs we have and see if its been present since last enabling the postgres DB
If the issue has not been seen recently we will close this issue pending it happening again

gordsport commented 2 years ago

@jgsbennett @MuthuVega - have we seen data-node crash on the full runs during this sprint?

gordsport commented 2 years ago

The full runs have been passing green, closing this issue for now

vegaprotocol / data-node

Data-node possible crash seen on full run #757

Problem encountered

Observed behaviour

Expected behaviour

Automation

Evidence

Logs

Additional context

Definition of Done