Open dejonghb opened 7 years ago
Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: State 'stop-sigterm' timed out. Killing.
Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: Main process exited, code=killed, status=9/KILL
Jul 14 10:25:53 ftcmp02 systemd[1]: Stopped Open vStorage workers.
Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: Unit entered failed state.
Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: Failed with result 'signal'.
Jul 14 10:25:53 ftcmp02 systemd[1]: Stopped Watcher for Open vStorage master services.
Jul 14 10:25:53 ftcmp02 systemd[1]: Starting Watcher for Open vStorage master services...
Jul 14 10:25:54 ftcmp02 systemd[1]: Started Watcher for Open vStorage master services.
Jul 14 10:25:54 ftcmp02 systemd[1]: Started Open vStorage volumerouter consumer.
Jul 14 10:28:49 ftcmp02 systemd[1]: Starting Cleanup of Temporary Directories...
Maybe the watcher goes down because one service is missing (as designed) and takes along all framework services (as designed). However, based on the logging, it looks like the workers don't stop fast enough? Maybe that's the cause why they don't start again once the wacher is restarted.
Jul 14 11:45:07 ftcmp02 systemd[1]: Stopped User Manager for UID 1001.
Jul 14 11:45:07 ftcmp02 systemd[1]: Removed slice User Slice of ovs.
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Main process exited, code=exited, status=1/FAILURE
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Unit entered failed state.
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Failed with result 'exit-code'.
Jul 14 11:56:52 ftcmp02 systemd[1]: Stopping Open vStorage volumerouter consumer...
Jul 14 11:56:52 ftcmp02 systemd[1]: Stopped Open vStorage volumerouter consumer.
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Service hold-off time over, scheduling restart.
Jul 14 11:56:52 ftcmp02 systemd[1]: Stopped Watcher for Open vStorage master services.
Jul 14 11:56:52 ftcmp02 systemd[1]: Starting Watcher for Open vStorage master services...
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Watcher for Open vStorage master services.
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Open vStorage scheduled tasks.
Jul 14 11:56:52 ftcmp02 systemd[1]: Starting Open vStorage workers...
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Open vStorage webapp API.
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Open vStorage volumerouter consumer.
Jul 14 11:56:53 ftcmp02 systemd[1]: Started Open vStorage workers.
--> trying to restart workers
After discussions with @dejonghb we figured following possible solution:
sys.exit(1)
, we touch a temp file, e.g. /tmp/watcher
.This should make sure that all depending services are down before it gets started again, being the possible cause of the issue we see here.
Also remove the temp file when startup succeeds ;-) Maybe also check timestamps on the tempfile so leftovers can be ignored?
Makes sense, an automatic restart shouldn't take more than a few seconds.
@dejonghb Doesn't this duplicate https://github.com/openvstorage/framework/issues/1568 ?
It could be that the exit code is actually 0, a clean shutdown. We should set the restart to always instead of on failure to restart in all cases
While testing I've noticed some behavior that might indicate this issue:
Further looking into a fix.
The testing setup is as follows: test_main.service: a simple overlooking service which can be bound onto by others (mimicks the watcher)
[Unit]
Description=Application Control
[Service]
# Execute a dummy program. Fail after 10 seconds and make sure the second component cant start
ExecStart=/bin/sleep infinity
Restart=on-failure
[Install]
# Components of this application should be started at boot time
WantedBy=multi-user.target
test_component_1.service: A simple component which binds onto the test_main (simulates either webapp-api or worker) This service is controllable through files
[Unit]
Description=Application Component 1
# When systemd stops or restarts the test_main.service, the action is propagated to this unit
BindsTo=test_main.service
# Start this unit after the test_main.service start
After=test_main.service
[Service]
# Controllable start: Pretend that the component is running
ExecStart=/bin/bash -c 'while [ ! -f /tmp/test_component_1.start.lock ]; do sleep 2; done; exit 1;'
# Controllable de-activating state
ExecStopPost=/bin/bash -c 'while [ ! -f /tmp/test_component_1.lock ]; do sleep 2; done'
# Do it with python logic
# ExecStart=/usr//bin/python /tmp/test_component_1.py
# Restart the service on non-zero exit code when terminated by a signal other than SIGHUP, SIGINT, SIGTERM or SIGPIPE
Restart=on-failure
[Install]
# This unit should start when test_main.service is starting
WantedBy=test_main.service
test_component_2.service: A simple component which binds onto the test_main (simulates either webapp-api or worker). This component will always follow the test_main state
[Unit]
Description=Application Component 2
# When systemd stops or restarts the test_main.service, the action is propagated to this unit
BindsTo=test_main.service
# Start this unit after the app.service start
After=test_main.service
[Service]
# Pretend that the component is running
ExecStart=/bin/sleep infinity
# Restart the service on non-zero exit code when terminated by a signal other than SIGHUP, SIGINT, SIGTERM or SIGPIPE
Restart=on-failure
[Install]
# This unit should start when test_main.service is starting
WantedBy=test_main.service
When signal 9 is sent to the watcher, both workers and webapp-api go down. When one of these is in the deactivating state and the watcher gets restarted manually, that service won't start anymore
This is just an observation, nothing to do with this ticket (since watcher got killed with exit 1 by itself)
Just like the ticket: the watcher service should exit with 1. The restart section should then schedule the service to restart
When all components start without failing: everything seems to be back in order.
The test_main service waits until the deactivating state is resolved before starting itself.
If a component fails to start during the restart. That component will be stuck in failure state while the test_main component is still running. This seems to be state of the ticket.
If this is the case: not much can be done in Systemd. I'd suggest fixing the reason why the worker cannot start while the watcher can. The watcher checks if all services are available for use.
On serveral occasions, for some (yet unknown) reason the ovs-watcher-framework goes down (memcache?) which takes along ovs-workers and ovs-webapp-api
After some retries ovs-watcher-framework gets restarted (ovs-watcher-framework.service: Service hold-off time over, scheduling restart) but the ovs-workers and ovs-webapp-api remain inactive.
Filtered syslog attached (hope there's still enough info in there) syslog-filtered.gz