vmware-archive / service-metrics-release

Service Metrics BOSH Release
https://docs.pivotal.io/svc-sdk/service-metrics
Apache License 2.0
2 stars 11 forks source link

Monit should not time out before stop script times out #3

Open jacknewberry opened 6 years ago

jacknewberry commented 6 years ago

Problem

The timeout in the service-metrics job is longer than the monit timeout, so the stop script never has a chance to kill a process that is timing out.

In the wild

We have a customer who reports on two occasions observing a service instance upgrade aborting due to service-metrics failing to stop. We have confirmed this in the logs they provide. We do not have an explanation for why service-metrics took a long time to stop, but had the kill_and_wait timeout been less than 30s, a kill -9 would probably have prevented the upgrade from aborting.

Analysis

The service_metrics job monit stop command calls stop program "/var/vcap/jobs/service-metrics/bin/service_metrics_ctl stop.

service_metrics_ctl invokes kill_and_wait $pidfile 40. kill_and_wait will allow the service_metrics process 40 seconds to shutdown gracefully before issuing a kill -9. Unfortunately monit will timeout after 30 seconds, raising "failed to stop".

Suggestions

  1. service-metrics-release/jobs/service-metrics/templates/service_metrics_ctl.erb could specify a timeout <30s to kill_and_wait. The default is 25 seconds which should work.
  2. or service-metrics-release/jobs/service-metrics/monit could allow longer than 40s for the stop command by appending 'with timeout {n} seconds' to the monit stop command.
  3. or a drain script could be implemented as an alternative to using kill_and_wait in the stop script.
cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.