Open DChalcraft opened 1 year ago
Interesting. Why are you using einhorn without sidekiqswarm?
Why are you using einhorn without sidekiqswarm?
We only have a single process on each server that's running sidekiq so not sure sidekiqswarm offers any obvious benefits. The documentation suggests that rolling restarts should work with either, which is why we are using einhorn with sidekiq.
Ok. I'm not sure what's wrong. I ask because sidekiqswarm has specific code to implement the systemd watchdog for its children and no one else has had a problem with sidekiqswarm's systemd integration recently. If possible I would set SIDEKIQ_COUNT=1, switch to sidekiqswarm and see if that works. If it doesn't, that's good info for debugging the underlying issue.
Following a bit more investigation:
I think the only way systemd's Type=Notify
is going to work with einhorn is if we've also got NotifyAccess=all
in the service file, since otherwise systemd only accepts notifications from the main pid (einhorn).
Once NotifyAccess=all
is added, the initial READY notification works. However, we also have WatchdogSec=10 in the service file.
sidekiqswarm actually handles this ok: it calls Sidekiq.start_watchdog
as long as ENV["NOTIFY_SOCKET"]
is set.
However, regular non-swarm sidekiq calls Sidekiq.start_watchdog if Sidekiq::SdNotify.watchdog?
. That returns false because ENV["WATCHDOG_PID"]
isn't the current pid (it's the einhorn pid). So no watchdog notifications are sent, and systemd kills einhorn+sidekiq after the 10 second timeout.
For now I've added Sidekiq.configure_server { Sidekiq.start_watchdog }
to an initializer, which works around that.
I'm still struggling to get reloads working properly (via ExecReload=/usr/local/bin/einhornsh --execute upgrade
). It does stop the old sidekiq process & start a new one, but then after another TimeoutSec
delay, systemd decides that it didn't actually restart, prints sidekiq.service: State 'stop-sigterm' timed out. Killing.
to journalctl, and SIGKILLs the new sidekiq process before relaunching it. Not entirely sure what's going on there, I'm going to put it down for now.
When einhorn performs the rolling upgrade, it shuts down the old sidekiq process which calls SdNotify.stopping. As far as I can tell, that tells systemd to expect that the service will exit completely and it forgets about the reload. systemctl status
shows Active: deactivating (stop-sigterm) since Thu 2023-03-16 15:37:33 UTC; 25s ago
.
After TimeoutStopSec, systemd then kills the einhorn process and restarts it:
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: State 'stop-sigterm' timed out. Killing.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857274 (einhorn) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857275 (bundle) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857327 (agent_thread.r*) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857328 (n/a) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857331 (Timeout stdlib ) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857333 (n/a) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857340 (tracer.rb:424) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857341 (n/a) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857342 (scheduler) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857343 (n/a) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857344 (default/process) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857346 (n/a) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857347 (n/a) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857349 (n/a) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857351 (n/a) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857352 (n/a) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 860472 (default/process) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 860857 (default/process) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Main process exited, code=killed, status=9/KILL
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Killing process 857327 (agent_thread.r*) with signal SIGKILL.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Failed with result 'timeout'.
Mar 16 15:36:14 staging1-3 systemd[1]: sidekiq.service: Consumed 9.795s CPU time.
Mar 16 15:36:15 staging1-3 systemd[1]: sidekiq.service: Scheduled restart job, restart counter is at 2.
Mar 16 15:36:15 staging1-3 systemd[1]: Stopped sidekiq.
Mar 16 15:36:15 staging1-3 systemd[1]: sidekiq.service: Consumed 9.795s CPU time.
Mar 16 15:36:15 staging1-3 systemd[1]: Starting sidekiq...
If I comment out sidekiq's SdNotify.stopping
notification, I can successfully perform a systemctl reload sidekiq
, but that doesn't seem like a great solution. 😕
Possibly einhorn and sdnotify are a bad mix. I'm considering dropping einhorn and going back to old fashioned kill TSTP
and a regular systemctl restart sidekiq
.
bin/sidekiqswarm
is designed to run under Einhorn within systemd. bin/sidekiq
assumes it has full control over the process lifecycle. I'd happily accept smarter integration as a PR.
I see the same issue when using sidekiqswarm under einhorn. Here's a reload:
Mar 17 10:38:29 staging1-3 sidekiq[1276014]: [MASTER 1276014] INFO: Starting smooth upgrade from version 0...
Mar 17 10:38:29 staging1-3 sidekiq[1276884]: Starting smooth upgrade from version 0...
Mar 17 10:38:29 staging1-3 sidekiq[1276014]: [MASTER 1276014] INFO: ===> Launched 1276887 (index: 1)
Mar 17 10:38:29 staging1-3 sidekiq[1276887]: [WORKER 1276887] INFO: About to exec ["/usr/local/bin/bundle", "exec", "sidekiqswarm", "-e", "staging", "-c", "10"]
Mar 17 10:38:29 staging1-3 sidekiq[1276884]: ===> Launched 1276887 (index: 1)
Mar 17 10:38:29 staging1-3 sidekiq[1276884]: ===> Exited state passing process 1276885
Mar 17 10:38:30 staging1-3 sidekiq[1276887]: 2023-03-17T10:38:30.854Z pid=1276887 tid=ragr INFO: Enabling systemd notification integration
Mar 17 10:38:30 staging1-3 sidekiq[1276887]: [swarm] pid=1276887 Preloading application
Mar 17 10:38:30 staging1-3 sidekiq[1276014]: [MASTER 1276014] INFO: Worker 1276887 has been up for 1s, so we are considering it alive.
Mar 17 10:38:30 staging1-3 sidekiq[1276014]: [MASTER 1276014] INFO: Up to 1 / 1 timer ACKs
Mar 17 10:38:30 staging1-3 sidekiq[1276014]: [MASTER 1276014] INFO: Upgraded successfully to version 1 (Einhorn 1.0.0).
Mar 17 10:38:30 staging1-3 sidekiq[1276014]: [MASTER 1276014] INFO: Killing off 1 old workers.
Mar 17 10:38:30 staging1-3 sidekiq[1276014]: [MASTER 1276014] INFO: Sending USR2 to [1276016]
Mar 17 10:38:30 staging1-3 sidekiq[1276884]: Upgraded successfully to version 1 (Einhorn 1.0.0).
Mar 17 10:38:30 staging1-3 sidekiq[1276884]: Upgrade done
Mar 17 10:38:30 staging1-3 sidekiq[1276096]: 2023-03-17T10:38:30.910Z pid=1276096 tid=rajg INFO: Got USR2, starting graceful shutdown
Mar 17 10:38:30 staging1-3 sidekiq[1276096]: 2023-03-17T10:38:30.911Z pid=1276096 tid=rajg INFO: Terminating quiet threads for default capsule
Mar 17 10:38:30 staging1-3 systemd[1]: Reloaded sidekiq.
Mar 17 10:38:30 staging1-3 sidekiq[1276096]: 2023-03-17T10:38:30.911Z pid=1276096 tid=p9z0 INFO: Scheduler exiting...
Mar 17 10:38:31 staging1-3 sidekiq[1276096]: 2023-03-17T10:38:31.918Z pid=1276096 tid=p9w0 INFO: Graceful shutdown complete, bye!
Mar 17 10:38:32 staging1-3 sidekiq[1276016]: [[--notify--]] STOPPING=1 unset:false /run/systemd/notify
Mar 17 10:38:32 staging1-3 sidekiq[1276887]: 2023-03-17T10:38:32.868Z pid=1276887 tid=ragr INFO: Filesystem mounts are ok!
Mar 17 10:38:32 staging1-3 sidekiq[1276014]: [MASTER 1276014] INFO: ===> Exited worker 1276016
Mar 17 10:38:37 staging1-3 sidekiq[1276887]: [[--notify--]] READY=1 unset:false /run/systemd/notify
Mar 17 10:38:37 staging1-3 sidekiq[1276887]: 2023-03-17T10:38:37.018Z pid=1276887 tid=ragr INFO: Pinging systemd watchdog every 30.0 sec
Mar 17 10:38:37 staging1-3 sidekiq[1276928]: 2023-03-17T10:38:37.017Z pid=1276928 tid=ragr INFO: Booted Rails 7.0.4.2 application in staging environment
Mar 17 10:38:37 staging1-3 sidekiq[1276928]: 2023-03-17T10:38:37.018Z pid=1276928 tid=ragr INFO: Running in ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
Mar 17 10:38:37 staging1-3 sidekiq[1276928]: 2023-03-17T10:38:37.018Z pid=1276928 tid=ragr INFO: Sidekiq Pro 7.0.7 / Sidekiq Enterprise 7.0.5, commercially licensed.
(I've manually added the [[--notify--]]
debug lines into sidekiq's SdNotify.notify
)
So the reload has successfully shut down the old worker 1276016 and started a new one 1276928.
However, systemctl status
shows:
Active: deactivating (stop-sigterm) since Fri 2023-03-17 10:38:32 UTC; 4min 44s ago
and after TimeoutStopSec, it forcibly restarts einhorn:
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: State 'stop-sigterm' timed out. Killing.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276014 (einhorn) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276887 (bundle) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276928 (bundle) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276908 (reaper.rb:40) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276910 (agent_thread.r*) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276913 (Timeout stdlib ) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276929 (tracer.rb:424) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276931 (n/a) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276932 (n/a) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276933 (tracer.rb:424) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276934 (n/a) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276935 (n/a) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276936 (n/a) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276937 (default/process) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276939 (n/a) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276940 (n/a) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276941 (n/a) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276942 (n/a) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276943 (n/a) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276944 (n/a) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276945 (n/a) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276951 (n/a) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1276955 (n/a) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Killing process 1277159 (default/process) with signal SIGKILL.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Main process exited, code=killed, status=9/KILL
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Failed with result 'timeout'.
Mar 17 10:43:32 staging1-3 systemd[1]: sidekiq.service: Consumed 17.020s CPU time.
Mar 17 10:43:33 staging1-3 systemd[1]: sidekiq.service: Scheduled restart job, restart counter is at 2.
Mar 17 10:43:33 staging1-3 systemd[1]: Stopped sidekiq.
Mar 17 10:43:33 staging1-3 systemd[1]: sidekiq.service: Consumed 17.020s CPU time.
Mar 17 10:43:33 staging1-3 systemd[1]: Starting sidekiq...
Mar 17 10:43:34 staging1-3 sidekiq[1279077]: [MASTER 1279077] INFO: Blowing away old Einhorn command socket at /tmp/einhorn.sock. This likely indicates a previous Einhorn master which exited uncleanly.
Mar 17 10:43:34 staging1-3 sidekiq[1279077]: [MASTER 1279077] INFO: Writing PID to /tmp/einhorn.pid
Mar 17 10:43:34 staging1-3 sidekiq[1279077]: [MASTER 1279077] INFO: Launching 1 new workers
Mar 17 10:43:34 staging1-3 sidekiq[1279077]: [MASTER 1279077] INFO: ===> Launched 1279079 (index: 0)
Mar 17 10:43:34 staging1-3 sidekiq[1279079]: [WORKER 1279079] INFO: About to exec ["/usr/local/bin/bundle", "exec", "sidekiqswarm", "-e", "staging", "-c", "10"]
Mar 17 10:43:34 staging1-3 sidekiq[1279079]: 2023-03-17T10:43:34.930Z pid=1279079 tid=rhmz INFO: Enabling systemd notification integration
Mar 17 10:43:34 staging1-3 sidekiq[1279079]: [swarm] pid=1279079 Preloading application
Mar 17 10:43:35 staging1-3 sidekiq[1279077]: [MASTER 1279077] INFO: Worker 1279079 has been up for 1s, so we are considering it alive.
Mar 17 10:43:35 staging1-3 sidekiq[1279077]: [MASTER 1279077] INFO: Up to 1 / 1 timer ACKs
Mar 17 10:43:36 staging1-3 sidekiq[1279079]: 2023-03-17T10:43:36.754Z pid=1279079 tid=rhmz INFO: Filesystem mounts are ok!
Mar 17 10:43:40 staging1-3 sidekiq[1279079]: [[--notify--]] READY=1 unset:false /run/systemd/notify
Mar 17 10:43:40 staging1-3 sidekiq[1279079]: 2023-03-17T10:43:40.502Z pid=1279079 tid=rhmz INFO: Pinging systemd watchdog every 30.0 sec
Mar 17 10:43:40 staging1-3 systemd[1]: Started sidekiq.
Mar 17 10:43:40 staging1-3 sidekiq[1279123]: 2023-03-17T10:43:40.502Z pid=1279123 tid=rhmz INFO: Booted Rails 7.0.4.2 application in staging environment
Mar 17 10:43:40 staging1-3 sidekiq[1279123]: 2023-03-17T10:43:40.503Z pid=1279123 tid=rhmz INFO: Running in ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
Mar 17 10:43:40 staging1-3 sidekiq[1279123]: 2023-03-17T10:43:40.503Z pid=1279123 tid=rhmz INFO: Sidekiq Pro 7.0.7 / Sidekiq Enterprise 7.0.5, commercially licensed.
(As a sidenote, we've been running einhorn+sidekiq with systemd's Type=simple for ages, but the other day we hit what appears to be a rare deadlock in rmagick/imagemagick which hung the entire sidekiq process. I started looking into Type=notify and the watchdog timer as a way of mitigating that, but unless I'm missing something sidekiqswarm's watchdog timer is only going to report that the sidekiqswarm top-level process is still responding, since it's not doing anything to check if its children are locked up. I'm not saying it necessarily ought to, just explaining why we're trying to use sdnotify with a regular single sidekiq instance rather than moving everything to swarm.)
Ruby version: 3.1.2p20 Rails version: 7.0.4.2 Sidekiq / Pro / Enterprise version(s): sidekiq-7.0.6 / sidekiq-pro-7.0.7 / sidekiq-ent-7.0.5
sidekiq.yml
sudo sytemctl status sidekiq.service
sidekiq.service
Possibly due to systemd expecting the notify to come from the Einhorn process rather than Sidekiq. Attemped to set
NotifyAccess=all
in sidekiq.service, but results in systemd killing it every 10 seconds.