openvstorage / framework

The Framework is a set of components and tools which brings the user an interface (GUI / API) to setup, extend and manage an Open vStorage platform.
Other
27 stars 23 forks source link

ovs-workers and ovs-webapp-api not restarted after being killed via ovs-watcher-framework going down #1709

Open dejonghb opened 7 years ago

dejonghb commented 7 years ago

On serveral occasions, for some (yet unknown) reason the ovs-watcher-framework goes down (memcache?) which takes along ovs-workers and ovs-webapp-api

After some retries ovs-watcher-framework gets restarted (ovs-watcher-framework.service: Service hold-off time over, scheduling restart) but the ovs-workers and ovs-webapp-api remain inactive.

Filtered syslog attached (hope there's still enough info in there) syslog-filtered.gz

dejonghb commented 7 years ago
Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: State 'stop-sigterm' timed out. Killing.
Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: Main process exited, code=killed, status=9/KILL
Jul 14 10:25:53 ftcmp02 systemd[1]: Stopped Open vStorage workers.
Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: Unit entered failed state.
Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: Failed with result 'signal'.
Jul 14 10:25:53 ftcmp02 systemd[1]: Stopped Watcher for Open vStorage master services.
Jul 14 10:25:53 ftcmp02 systemd[1]: Starting Watcher for Open vStorage master services...
Jul 14 10:25:54 ftcmp02 systemd[1]: Started Watcher for Open vStorage master services.
Jul 14 10:25:54 ftcmp02 systemd[1]: Started Open vStorage volumerouter consumer.
Jul 14 10:28:49 ftcmp02 systemd[1]: Starting Cleanup of Temporary Directories...
khenderick commented 7 years ago

Maybe the watcher goes down because one service is missing (as designed) and takes along all framework services (as designed). However, based on the logging, it looks like the workers don't stop fast enough? Maybe that's the cause why they don't start again once the wacher is restarted.

dejonghb commented 7 years ago
Jul 14 11:45:07 ftcmp02 systemd[1]: Stopped User Manager for UID 1001.
Jul 14 11:45:07 ftcmp02 systemd[1]: Removed slice User Slice of ovs.
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Main process exited, code=exited, status=1/FAILURE
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Unit entered failed state.
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Failed with result 'exit-code'.
Jul 14 11:56:52 ftcmp02 systemd[1]: Stopping Open vStorage volumerouter consumer...
Jul 14 11:56:52 ftcmp02 systemd[1]: Stopped Open vStorage volumerouter consumer.
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Service hold-off time over, scheduling restart.
Jul 14 11:56:52 ftcmp02 systemd[1]: Stopped Watcher for Open vStorage master services.
Jul 14 11:56:52 ftcmp02 systemd[1]: Starting Watcher for Open vStorage master services...
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Watcher for Open vStorage master services.
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Open vStorage scheduled tasks.
Jul 14 11:56:52 ftcmp02 systemd[1]: Starting Open vStorage workers...
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Open vStorage webapp API.
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Open vStorage volumerouter consumer.
Jul 14 11:56:53 ftcmp02 systemd[1]: Started Open vStorage workers.

--> trying to restart workers

khenderick commented 7 years ago

After discussions with @dejonghb we figured following possible solution:

This should make sure that all depending services are down before it gets started again, being the possible cause of the issue we see here.

dejonghb commented 7 years ago

Also remove the temp file when startup succeeds ;-) Maybe also check timestamps on the tempfile so leftovers can be ignored?

khenderick commented 7 years ago

Makes sense, an automatic restart shouldn't take more than a few seconds.

wimpers commented 7 years ago

@dejonghb Doesn't this duplicate https://github.com/openvstorage/framework/issues/1568 ?

dejonghb commented 7 years ago

1568 is indeed related but more logging and info on what could be the problem + workaround in here, so guess #1568 can be closed.

JeffreyDevloo commented 6 years ago

It could be that the exit code is actually 0, a clean shutdown. We should set the restart to always instead of on failure to restart in all cases

JeffreyDevloo commented 6 years ago

While testing I've noticed some behavior that might indicate this issue:

Further looking into a fix.

JeffreyDevloo commented 6 years ago

Test report

Setup

The testing setup is as follows: test_main.service: a simple overlooking service which can be bound onto by others (mimicks the watcher)

[Unit]
Description=Application Control

[Service]
# Execute a dummy program. Fail after 10 seconds and make sure the second component cant start
ExecStart=/bin/sleep infinity
Restart=on-failure

[Install]
# Components of this application should be started at boot time
WantedBy=multi-user.target

test_component_1.service: A simple component which binds onto the test_main (simulates either webapp-api or worker) This service is controllable through files

[Unit]
Description=Application Component 1
# When systemd stops or restarts the test_main.service, the action is propagated to this unit
BindsTo=test_main.service
# Start this unit after the test_main.service start
After=test_main.service

[Service]
# Controllable start: Pretend that the component is running
ExecStart=/bin/bash -c 'while [ ! -f /tmp/test_component_1.start.lock ]; do sleep 2; done; exit 1;'
# Controllable de-activating state
ExecStopPost=/bin/bash -c 'while [ ! -f /tmp/test_component_1.lock ]; do sleep 2; done'
# Do it with python logic
# ExecStart=/usr//bin/python /tmp/test_component_1.py
# Restart the service on non-zero exit code when terminated by a signal other than SIGHUP, SIGINT, SIGTERM or SIGPIPE
Restart=on-failure

[Install]
# This unit should start when test_main.service is starting
WantedBy=test_main.service

test_component_2.service: A simple component which binds onto the test_main (simulates either webapp-api or worker). This component will always follow the test_main state

[Unit]
Description=Application Component 2
# When systemd stops or restarts the test_main.service, the action is propagated to this unit
BindsTo=test_main.service
# Start this unit after the app.service start
After=test_main.service

[Service]
# Pretend that the component is running
ExecStart=/bin/sleep infinity
# Restart the service on non-zero exit code when terminated by a signal other than SIGHUP, SIGINT, SIGTERM or SIGPIPE
Restart=on-failure

[Install]
# This unit should start when test_main.service is starting
WantedBy=test_main.service

Sending signal 9

When signal 9 is sent to the watcher, both workers and webapp-api go down. When one of these is in the deactivating state and the watcher gets restarted manually, that service won't start anymore

This is just an observation, nothing to do with this ticket (since watcher got killed with exit 1 by itself)

Watcher failure

Just like the ticket: the watcher service should exit with 1. The restart section should then schedule the service to restart

All components start without failure

When all components start without failing: everything seems to be back in order.

Single component going in deactivating

The test_main service waits until the deactivating state is resolved before starting itself.

Components fails to start

If a component fails to start during the restart. That component will be stuck in failure state while the test_main component is still running. This seems to be state of the ticket.

Hypothesis

If this is the case: not much can be done in Systemd. I'd suggest fixing the reason why the worker cannot start while the watcher can. The watcher checks if all services are available for use.