Closed Quilamir closed 1 year ago
Thank you for the report! This is indeed due to an edge case bug in the new nonce pre-sharing mechanism in v3.2.0. We have a patch in #227 and will cut a v3.2.1 patch release ASAP.
This edge case being triggered suggests that you might have a different issue as well though. You may want to increase your grpcTimeout
value to account for the failed grpc requests from the leader to the other cosigners. Additionally the raftTimeout
value may need to be increased if you are seeing frequent leader elections.
Is there any kind of documentation about these settings i can go over?
also it seems like the PID mechanism needs a change, if the PID in the file dosent exist then a new instance should be started and the PID updated, as my understanding is the PID file is supposed to stop the running of two instances at the same time, this will allow the instance to recover from an un expected restart or an edge case like this one.
Is there any kind of documentation about these settings i can go over?
Documentation for these parameters is here, but a fine-tuning guide would be a useful addition. I'd suggest determining the happy path minimum value for both:
grpc-timeout
raft-timeout
Then add a buffer to each with the level of expected network latency variance in your network.this will allow the instance to recover from an un expected restart
I agree this would be a nice feature. The required manual removal of the pid file under unclean shutdown was intentional originally, but it is more painful than it is worth I think in operation
The PID mechanism is good, but it just needs to add a check if the mentioned process exists or not, if it does not exist then it means it is safe to start a new process and update the PID in the file, its also possible in order to make sure there is no race condition to have it just delete the PID file and continue to shutdown so the next restart attempt will be successful (assuming this is running with some auto restart mechanism like systemd or docker restarts)
Yes exactly, we don't want to remove the PID mechanism but we can delete the PID file on startup if (and only if) the process no longer exists by the PID within. That has been added to the PR with updated tests.
On second thought the PID file is not necessary, if a new instance tries to run while one is already running it will fail to bind the port and crash anyways.
I would remove the PID mechanism or at least provide a flag to disable it.
On second thought the PID file is not necessary, if a new instance tries to run while one is already running it will fail to bind the port and crash anyways.
I would remove the PID mechanism or at least provide a flag to disable it.
I agree with you, @agouin correct us if we are wrong.
In my setups I had added
[Service]
ExecStart=...
...
ExecStopPost=rm -f /xxx/xxx/.horcrux/xxx/horcrux.pid
and I have never had problems in the past
Version v3.2.0
Here is the log for the crash
since this is a crash the PID file is not removed, and the service can not restart properly after the crash making the signer stuck until manual deletion of the PID file.
I have not seen this error happen in previous versions so I am assuming its a bug of the latest release, for now the only thing i can think of doing is downgrade to the previous version.