neard can't be stopped during `reset_data_pre_state_sync()` which can take a long time

nikurt commented 2 years ago

neard handles SIGINT during most of the time, but when the client is downloading state it doesn't stop upon receiving Ctrl+C. Repeated Ctrl+C have no effect. Maybe it's not ignored and will be handled when the state sync finishes, haven't waited that long. Though I waited for more than 1 minute, and nothing happened.

Forcing a stop with SIGKILL got me database corruptions.

neard should respond to signals also during state sync.

nikurt commented 2 years ago

Synced a betanet node. Stopped it. In 12 hours continued syncing. After downloading headers, the client started syncing state. The first Ctrl+C was handled, the shutdown was initiated but nothing indicated that the client was actually going to shutdown. After I minute pressed Ctrl+C multiple times - nothing happened. Killed from another terminal by doing killall -9 neard.

$ cargo fmt && cargo build -p neard --release --features nightly_protocol,nightly_protocol_features && RUST_LOG=info ./target/release/neard --home /home/orfest/.near/betanet run
...
...
...
2022-05-12T10:09:43.332151Z  INFO stats: #59558457 Downloading headers 99.49% (3277 left; at 59620402) 13 peers ⬇ 511.3kiB/s ⬆ 0.6kiB/s 0.00 bps 0 gas/s CPU: 41%, Mem: 586.3 MiB
2022-05-12T10:09:43.832230Z  INFO client: Received 512 block headers from ed25519:6DSjZ8mvsRZDvFqFxo8tCKePG96omXW7eVYVSySmDk8e
2022-05-12T10:09:45.069056Z  INFO client: Received 512 block headers from ed25519:9saR916jwNjxPa7RyhesTvmkiYY2iqW6x9Z1eZ3Yc3BB
2022-05-12T10:09:46.748685Z  INFO client: Received 512 block headers from ed25519:AtxyywuiieiDCCH7c893rQSx3hrK8X6qr2kkkHTZWkT1
2022-05-12T10:09:47.677098Z  INFO client: Received 512 block headers from ed25519:ydgzeXHJ5Xyt7M1gXLxqLBW1Ejx6scNV5Nx2pxFM8su
2022-05-12T10:09:48.546437Z  INFO client: Received 512 block headers from ed25519:2fQJ46A2okUhnYG1Zkcgi1kDSPP8fMEmv3c3earyaLL8
2022-05-12T10:09:49.240543Z  INFO client: Received 213 block headers from ed25519:BpfDYpNvNTfUDVXp5q2xmvsjsJQwgkP2T6p12aPXKWsv
2022-05-12T10:09:50.729861Z  INFO network: Error connecting to addr=3.95.251.187:24567 err=Elapsed(())
^C2022-05-12T10:09:55.226876Z  WARN neard: SIGINT, stopping... this may take a few minutes.
2022-05-12T10:09:55.228750Z  INFO neard: Waiting for RocksDB to gracefully shutdown
2022-05-12T10:09:55.228766Z  INFO db: Waiting for the 1 remaining RocksDB instances to gracefully shutdown
^C^C^C^C^C^C^C^Czsh: killed     RUST_LOG=info ./target/release/neard --home /home/user/.near/betanet run

mina86 commented 2 years ago

It looks like there’s some kind of leak in Store objects? After getting SIGINT the node stops doing anything it looks like so it seems like actix::System::current().stop(); successfully stopped everything but we’re stuck on RocksDB::block_until_all_instances_are_dropped(); which means that one of the RocksDB instances didn’t get dropped.

nikurt commented 2 years ago

Reproduced this issue with lots of extra debug output. The issue is this loop https://github.com/near/nearcore/blob/master/chain/chain/src/chain.rs#L1225;L1252 During this loop ClientActor doesn't respond to actix::System::current().stop(). Iterations in that loop take between 0.01 and 20 milli-seconds. Resuming a stopped instance may need updating a lot of blocks, which took multiple minutes in my case.

After more testing, state sync itself doesn't seem to be a problem and SIGINTs are properly handled.

near / nearcore

neard can't be stopped during `reset_data_pre_state_sync()` which can take a long time #6788