Horcrux crashes and then cannot restart because of pid file

Quilamir commented 1 year ago

Version v3.2.0

Here is the log for the crash

E[2023-11-29|17:02:42.517] Cosigner failed to set nonces and sign       chain_id=froopyland_100-1 height=1481301 round=0 type=prevote cosigner=1 err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E[2023-11-29|17:02:43.519] Cosigner failed to set nonces and sign       chain_id=froopyland_100-1 height=1481301 round=0 type=prevote cosigner=3 err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
panic: runtime error: slice bounds out of range [6:4]
goroutine 4156 [running]:
github.com/strangelove-ventures/horcrux/signer.(*NonceCache).Delete(...)
        /horcrux/signer/cosigner_nonce_cache.go:86
github.com/strangelove-ventures/horcrux/signer.(*CosignerNonceCache).ClearNonces(0xc0001241c0, {0x17f0ac8, 0xc000121710})
        /horcrux/signer/cosigner_nonce_cache.go:368 +0x4a6
github.com/strangelove-ventures/horcrux/signer.(*ThresholdValidator).Sign.func2()
        /horcrux/signer/threshold_validator.go:743 +0x57a
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 4153
        /go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:72 +0x96
horcrux.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
horcrux.service: Failed with result 'exit-code'.
horcrux.service: Scheduled restart job, restart counter is at 583.
Stopped MPC Signer node.
Started MPC Signer node.
Error: unclean shutdown detected. PID file exists at /root/.horcrux/horcrux.pid but PID 1173404 is not running.
manual deletion of PID file required

since this is a crash the PID file is not removed, and the service can not restart properly after the crash making the signer stuck until manual deletion of the PID file.

I have not seen this error happen in previous versions so I am assuming its a bug of the latest release, for now the only thing i can think of doing is downgrade to the previous version.

agouin commented 1 year ago

Thank you for the report! This is indeed due to an edge case bug in the new nonce pre-sharing mechanism in v3.2.0. We have a patch in #227 and will cut a v3.2.1 patch release ASAP.

agouin commented 1 year ago

This edge case being triggered suggests that you might have a different issue as well though. You may want to increase your grpcTimeout value to account for the failed grpc requests from the leader to the other cosigners. Additionally the raftTimeout value may need to be increased if you are seeing frequent leader elections.

Quilamir commented 1 year ago

Is there any kind of documentation about these settings i can go over?

also it seems like the PID mechanism needs a change, if the PID in the file dosent exist then a new instance should be started and the PID updated, as my understanding is the PID file is supposed to stop the running of two instances at the same time, this will allow the instance to recover from an un expected restart or an edge case like this one.

agouin commented 1 year ago

Is there any kind of documentation about these settings i can go over?

Documentation for these parameters is here, but a fine-tuning guide would be a useful addition. I'd suggest determining the happy path minimum value for both:

no grpc timeouts occurring in the logs with the configured grpc-timeout
no unwanted leader elections with the configured raft-timeout Then add a buffer to each with the level of expected network latency variance in your network.

this will allow the instance to recover from an un expected restart

I agree this would be a nice feature. The required manual removal of the pid file under unclean shutdown was intentional originally, but it is more painful than it is worth I think in operation

Quilamir commented 1 year ago

The PID mechanism is good, but it just needs to add a check if the mentioned process exists or not, if it does not exist then it means it is safe to start a new process and update the PID in the file, its also possible in order to make sure there is no race condition to have it just delete the PID file and continue to shutdown so the next restart attempt will be successful (assuming this is running with some auto restart mechanism like systemd or docker restarts)

agouin commented 1 year ago

Yes exactly, we don't want to remove the PID mechanism but we can delete the PID file on startup if (and only if) the process no longer exists by the PID within. That has been added to the PR with updated tests.

Quilamir commented 1 year ago

On second thought the PID file is not necessary, if a new instance tries to run while one is already running it will fail to bind the port and crash anyways.

I would remove the PID mechanism or at least provide a flag to disable it.

activenodes commented 1 year ago

On second thought the PID file is not necessary, if a new instance tries to run while one is already running it will fail to bind the port and crash anyways.

I would remove the PID mechanism or at least provide a flag to disable it.

I agree with you, @agouin correct us if we are wrong.

In my setups I had added

[Service]
ExecStart=...
...
ExecStopPost=rm -f /xxx/xxx/.horcrux/xxx/horcrux.pid

and I have never had problems in the past

strangelove-ventures / horcrux

Horcrux crashes and then cannot restart because of pid file #226