txpipe / dolos

Cardano Data Node
https://dolos.txpipe.io
Apache License 2.0
65 stars 18 forks source link

Unable to recover after forceful shutdown #293

Open Quantumplation opened 1 month ago

Quantumplation commented 1 month ago

If the dolos node is killed with kill -9, for example, it is unable to start up and re-establish the socket.

In particular, the file already exists, so it will encounter an "address already in use" error, but it won't print this until the node is gracefully shutting down.

The logs on startup don't indicate that anything is amiss:

2024-07-13T20:16:03.425860Z  INFO dolos::serve: found gRPC config
2024-07-13T20:16:03.425830Z  WARN serve: dolos::relay: relay not enabled, skipping serve
2024-07-13T20:16:03.426886Z  INFO stage{stage="roll"}: gasket::runtime: stage bootstrap ok
2024-07-13T20:16:03.426904Z  INFO stage{stage="roll"}: gasket::runtime: switching stage phase prev_phase=Bootstrap next_phase=Working
2024-07-13T20:16:03.427317Z  INFO stage{stage="ledger"}:bootstrap: dolos::sync::ledger: cursor found cursor=Some(ChainPoint(54245659, Hash<32>("c4888
56f1695fbc2a73c8c0809e804ff9a15295f29eed20872460b3a73eabbae")))
2024-07-13T20:16:03.427505Z  INFO stage{stage="ledger"}:bootstrap: dolos::sync::ledger: wal sequence found seq=2285931
2024-07-13T20:16:03.427518Z  INFO stage{stage="ledger"}: gasket::runtime: stage bootstrap ok
2024-07-13T20:16:03.427524Z  INFO stage{stage="ledger"}: gasket::runtime: switching stage phase prev_phase=Bootstrap next_phase=Working
2024-07-13T20:16:03.429317Z  INFO dolos::serve::grpc: serving via gRPC on address: [::]:50051
2024-07-13T20:16:03.431380Z  INFO dolos::serve: found Ouroboros config
2024-07-13T20:16:03.858860Z  INFO stage{stage="pull"}:bootstrap: dolos::sync::pull: connected to peer address="preview-node.world.dev.cardano.org:300
02" magic=2
2024-07-13T20:16:03.972779Z  INFO stage{stage="pull"}:bootstrap: dolos::sync::pull: found intersection intersection=(54245659, c488856f1695fbc2a73c8c
0809e804ff9a15295f29eed20872460b3a73eabbae)
2024-07-13T20:16:03.972917Z  INFO stage{stage="pull"}: gasket::runtime: stage bootstrap ok
2024-07-13T20:16:03.972949Z  INFO stage{stage="pull"}: gasket::runtime: switching stage phase prev_phase=Bootstrap next_phase=Working
2024-07-13T20:16:03.973013Z  INFO stage{stage="pull"}:execute: dolos::sync::pull: pulling block batch from upstream peer
2024-07-13T20:16:04.041244Z  INFO stage{stage="pull"}:execute: dolos::sync::pull: pulling block batch from upstream peer
2024-07-13T20:16:04.042483Z  INFO stage{stage="roll"}:execute: dolos::sync::roll: rolling back wal point=Specific(54245659, Hash<32>("c488856f1695fbc
2a73c8c0809e804ff9a15295f29eed20872460b3a73eabbae"))
2024-07-13T20:16:04.501124Z  INFO stage{stage="pull"}:execute: dolos::sync::pull: block batch pulled from peer len=3
2024-07-13T20:16:04.503961Z  INFO stage{stage="pull"}:execute: dolos::sync::pull: reached tip, waiting for new block
2024-07-13T20:16:04.503976Z  INFO stage{stage="roll"}:execute: dolos::sync::roll: extending wal block.slot=54245712 block.hash=5e55089ba23f129d23bc00
92a15cb93880aaaf8fd86e2597a6b6da87c08f397a
2024-07-13T20:16:04.512811Z  INFO stage{stage="ledger"}:execute: dolos::sync::ledger: applying block slot=54245712
2024-07-13T20:16:04.512710Z  INFO stage{stage="roll"}:execute: dolos::sync::roll: extending wal block.slot=54245759 block.hash=39ffd38eb7ffc0ab982906
10a74124caf32289fa162fa12783ff9b39849c62d0
2024-07-13T20:16:04.519305Z  INFO stage{stage="roll"}:execute: dolos::sync::roll: extending wal block.slot=54245761 block.hash=80fe04c697f00399bcc60c
b77aa1887b4f13d5aca03e2c8c90e24b2d1715c4b8

But once delivering a SIGTERM, it reveals what the problem is:

2024-07-13T20:17:17.507180Z  WARN dolos::common: SIGTERM detected
2024-07-13T20:17:17.507242Z  INFO gasket::daemon: dismissing stage stage="pull" state=Alive(Working)
2024-07-13T20:17:17.507257Z  INFO gasket::daemon: dismissing stage stage="roll" state=Alive(Working)
2024-07-13T20:17:17.507264Z  INFO gasket::daemon: dismissing stage stage="ledger" state=Alive(Working)
2024-07-13T20:17:17.507270Z  INFO gasket::daemon: waiting for stages to end
2024-07-13T20:17:47.373730Z  INFO stage{stage="pull"}:execute: dolos::sync::pull: new block sent by upstream peer point=(54245867, 469d14d0625f2185a5
17fda5d498894e5bc4836bee3e857917f89255004db3f4)
2024-07-13T20:17:47.441494Z  INFO stage{stage="pull"}: gasket::runtime: stage dismissed
2024-07-13T20:17:47.441647Z  INFO stage{stage="pull"}: gasket::runtime: switching stage phase prev_phase=Working next_phase=Teardown
2024-07-13T20:17:47.441692Z  INFO stage{stage="pull"}: gasket::runtime: stage teardown ok
2024-07-13T20:17:47.441706Z  INFO stage{stage="pull"}: gasket::runtime: switching stage phase prev_phase=Teardown next_phase=Ended
2024-07-13T20:17:47.441620Z  INFO stage{stage="roll"}: gasket::runtime: stage dismissed
2024-07-13T20:17:47.441795Z  INFO stage{stage="roll"}: gasket::runtime: switching stage phase prev_phase=Working next_phase=Teardown
2024-07-13T20:17:47.441813Z  INFO stage{stage="roll"}: gasket::runtime: stage teardown ok
2024-07-13T20:17:47.441824Z  INFO stage{stage="roll"}: gasket::runtime: switching stage phase prev_phase=Teardown next_phase=Ended
2024-07-13T20:17:47.442449Z ERROR stage{stage="ledger"}:schedule: gasket::framework: x=error receiving work unit through input port
2024-07-13T20:17:47.442477Z ERROR stage{stage="ledger"}: gasket::runtime: stage should stop retry=Retry(None)
2024-07-13T20:17:47.442491Z  INFO stage{stage="ledger"}: gasket::runtime: switching stage phase prev_phase=Working next_phase=Teardown
2024-07-13T20:17:47.442528Z  INFO stage{stage="ledger"}: gasket::runtime: stage teardown ok
2024-07-13T20:17:47.442541Z  INFO stage{stage="ledger"}: gasket::runtime: switching stage phase prev_phase=Teardown next_phase=Ended
Error:   × serve thread
  ├─▶ serving Ouroboros
  ╰─▶ server error: Address already in use (os error 48)

One way to solve this is to use a lockfile with a pid to decide whether it is safe to delete the existing node socket and recover.

Mercurial commented 1 month ago

I was actually gonna report this too, I have to delete the socket file to make it work.

This also blocks grpc serve from starting

image