dejonghb commented 7 years ago

On serveral occasions, for some (yet unknown) reason the ovs-watcher-framework goes down (memcache?) which takes along ovs-workers and ovs-webapp-api

After some retries ovs-watcher-framework gets restarted (ovs-watcher-framework.service: Service hold-off time over, scheduling restart) but the ovs-workers and ovs-webapp-api remain inactive.

Filtered syslog attached (hope there's still enough info in there) syslog-filtered.gz

dejonghb commented 7 years ago

Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: State 'stop-sigterm' timed out. Killing.
Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: Main process exited, code=killed, status=9/KILL
Jul 14 10:25:53 ftcmp02 systemd[1]: Stopped Open vStorage workers.
Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: Unit entered failed state.
Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: Failed with result 'signal'.
Jul 14 10:25:53 ftcmp02 systemd[1]: Stopped Watcher for Open vStorage master services.
Jul 14 10:25:53 ftcmp02 systemd[1]: Starting Watcher for Open vStorage master services...
Jul 14 10:25:54 ftcmp02 systemd[1]: Started Watcher for Open vStorage master services.
Jul 14 10:25:54 ftcmp02 systemd[1]: Started Open vStorage volumerouter consumer.
Jul 14 10:28:49 ftcmp02 systemd[1]: Starting Cleanup of Temporary Directories...

khenderick commented 7 years ago

Maybe the watcher goes down because one service is missing (as designed) and takes along all framework services (as designed). However, based on the logging, it looks like the workers don't stop fast enough? Maybe that's the cause why they don't start again once the wacher is restarted.

dejonghb commented 7 years ago

Jul 14 11:45:07 ftcmp02 systemd[1]: Stopped User Manager for UID 1001.
Jul 14 11:45:07 ftcmp02 systemd[1]: Removed slice User Slice of ovs.
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Main process exited, code=exited, status=1/FAILURE
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Unit entered failed state.
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Failed with result 'exit-code'.
Jul 14 11:56:52 ftcmp02 systemd[1]: Stopping Open vStorage volumerouter consumer...
Jul 14 11:56:52 ftcmp02 systemd[1]: Stopped Open vStorage volumerouter consumer.
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Service hold-off time over, scheduling restart.
Jul 14 11:56:52 ftcmp02 systemd[1]: Stopped Watcher for Open vStorage master services.
Jul 14 11:56:52 ftcmp02 systemd[1]: Starting Watcher for Open vStorage master services...
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Watcher for Open vStorage master services.
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Open vStorage scheduled tasks.
Jul 14 11:56:52 ftcmp02 systemd[1]: Starting Open vStorage workers...
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Open vStorage webapp API.
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Open vStorage volumerouter consumer.
Jul 14 11:56:53 ftcmp02 systemd[1]: Started Open vStorage workers.

--> trying to restart workers

khenderick commented 7 years ago

After discussions with @dejonghb we figured following possible solution:

Before self-exiting with e.g. sys.exit(1), we touch a temp file, e.g. /tmp/watcher.
On pre-start
- If the file exist, we first wait (max 15 minutes) for all depending services to be in a halted state. Else, we just continue
- We wait for the master services to be up again
Then the usual flow continues.

This should make sure that all depending services are down before it gets started again, being the possible cause of the issue we see here.

dejonghb commented 7 years ago

Also remove the temp file when startup succeeds ;-) Maybe also check timestamps on the tempfile so leftovers can be ignored?

khenderick commented 7 years ago

Makes sense, an automatic restart shouldn't take more than a few seconds.

wimpers commented 7 years ago

@dejonghb Doesn't this duplicate https://github.com/openvstorage/framework/issues/1568 ?

dejonghb commented 7 years ago

1568 is indeed related but more logging and info on what could be the problem + workaround in here, so guess #1568 can be closed.

JeffreyDevloo commented 6 years ago

It could be that the exit code is actually 0, a clean shutdown. We should set the restart to always instead of on failure to restart in all cases

JeffreyDevloo commented 6 years ago

While testing I've noticed some behavior that might indicate this issue:

Bring watcher down
- Brings down web-api & workers
- web-api in deactivating state & workers in inactive state
Make sure watcher recovers
- Workers were started
- web-api is now inactive

Further looking into a fix.

JeffreyDevloo commented 6 years ago

Test report

Setup

The testing setup is as follows: test_main.service: a simple overlooking service which can be bound onto by others (mimicks the watcher)

[Unit]
Description=Application Control

[Service]
# Execute a dummy program. Fail after 10 seconds and make sure the second component cant start
ExecStart=/bin/sleep infinity
Restart=on-failure

[Install]
# Components of this application should be started at boot time
WantedBy=multi-user.target

test_component_1.service: A simple component which binds onto the test_main (simulates either webapp-api or worker) This service is controllable through files

[Unit]
Description=Application Component 1
# When systemd stops or restarts the test_main.service, the action is propagated to this unit
BindsTo=test_main.service
# Start this unit after the test_main.service start
After=test_main.service

[Service]
# Controllable start: Pretend that the component is running
ExecStart=/bin/bash -c 'while [ ! -f /tmp/test_component_1.start.lock ]; do sleep 2; done; exit 1;'
# Controllable de-activating state
ExecStopPost=/bin/bash -c 'while [ ! -f /tmp/test_component_1.lock ]; do sleep 2; done'
# Do it with python logic
# ExecStart=/usr//bin/python /tmp/test_component_1.py
# Restart the service on non-zero exit code when terminated by a signal other than SIGHUP, SIGINT, SIGTERM or SIGPIPE
Restart=on-failure

[Install]
# This unit should start when test_main.service is starting
WantedBy=test_main.service

test_component_2.service: A simple component which binds onto the test_main (simulates either webapp-api or worker). This component will always follow the test_main state

[Unit]
Description=Application Component 2
# When systemd stops or restarts the test_main.service, the action is propagated to this unit
BindsTo=test_main.service
# Start this unit after the app.service start
After=test_main.service

[Service]
# Pretend that the component is running
ExecStart=/bin/sleep infinity
# Restart the service on non-zero exit code when terminated by a signal other than SIGHUP, SIGINT, SIGTERM or SIGPIPE
Restart=on-failure

[Install]
# This unit should start when test_main.service is starting
WantedBy=test_main.service

Sending signal 9

When signal 9 is sent to the watcher, both workers and webapp-api go down. When one of these is in the deactivating state and the watcher gets restarted manually, that service won't start anymore

This is just an observation, nothing to do with this ticket (since watcher got killed with exit 1 by itself)

Watcher failure

Just like the ticket: the watcher service should exit with 1. The restart section should then schedule the service to restart

All components start without failure

When all components start without failing: everything seems to be back in order.

Single component going in deactivating

The test_main service waits until the deactivating state is resolved before starting itself.

Components fails to start

If a component fails to start during the restart. That component will be stuck in failure state while the test_main component is still running. This seems to be state of the ticket.

Hypothesis

Watcher died because a main service was not responding (exit 1)
- Workers and webapp-api went down due to the BindsTo dependency
Watcher restarts because of the failure state
- Workers tries to restart. Fails 5x (default retry is only 5x)
- Workers in a failed state
Watcher is restarted while workers retry and eventually workers go in a failed state

If this is the case: not much can be done in Systemd. I'd suggest fixing the reason why the worker cannot start while the watcher can. The watcher checks if all services are available for use.

openvstorage / framework

ovs-workers and ovs-webapp-api not restarted after being killed via ovs-watcher-framework going down #1709

1568 is indeed related but more logging and info on what could be the problem + workaround in here, so guess #1568 can be closed.

Test report

Setup

Sending signal 9

Watcher failure

All components start without failure

Single component going in deactivating

Components fails to start

Hypothesis