The timeout in the service-metrics job is longer than the monit timeout, so the stop script never has a chance to kill a process that is timing out.
In the wild
We have a customer who reports on two occasions observing a service instance upgrade aborting due to service-metrics failing to stop. We have confirmed this in the logs they provide. We do not have an explanation for why service-metrics took a long time to stop, but had the kill_and_wait timeout been less than 30s, a kill -9 would probably have prevented the upgrade from aborting.
Analysis
The service_metrics job monit stop command calls stop program "/var/vcap/jobs/service-metrics/bin/service_metrics_ctl stop.
service_metrics_ctl invokes kill_and_wait $pidfile 40. kill_and_wait will allow the service_metrics process 40 seconds to shutdown gracefully before issuing a kill -9. Unfortunately monit will timeout after 30 seconds, raising "failed to stop".
Suggestions
service-metrics-release/jobs/service-metrics/templates/service_metrics_ctl.erb could specify a timeout <30s to kill_and_wait. The default is 25 seconds which should work.
or service-metrics-release/jobs/service-metrics/monit could allow longer than 40s for the stop command by appending 'with timeout {n} seconds' to the monit stop command.
or a drain script could be implemented as an alternative to using kill_and_wait in the stop script.
We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.
The labels on this github issue will be updated when the story is started.
Problem
The timeout in the service-metrics job is longer than the monit timeout, so the stop script never has a chance to kill a process that is timing out.
In the wild
We have a customer who reports on two occasions observing a service instance upgrade aborting due to service-metrics failing to stop. We have confirmed this in the logs they provide. We do not have an explanation for why service-metrics took a long time to stop, but had the kill_and_wait timeout been less than 30s, a kill -9 would probably have prevented the upgrade from aborting.
Analysis
The service_metrics job monit stop command calls
stop program "/var/vcap/jobs/service-metrics/bin/service_metrics_ctl stop
.service_metrics_ctl invokes
kill_and_wait $pidfile 40
.kill_and_wait
will allow the service_metrics process 40 seconds to shutdown gracefully before issuing akill -9
. Unfortunately monit will timeout after 30 seconds, raising "failed to stop".Suggestions