Monit should not time out before stop script times out

Problem

The timeout in the service-metrics job is longer than the monit timeout, so the stop script never has a chance to kill a process that is timing out.

In the wild

We have a customer who reports on two occasions observing a service instance upgrade aborting due to service-metrics failing to stop. We have confirmed this in the logs they provide. We do not have an explanation for why service-metrics took a long time to stop, but had the kill_and_wait timeout been less than 30s, a kill -9 would probably have prevented the upgrade from aborting.

Analysis

The service_metrics job monit stop command calls stop program "/var/vcap/jobs/service-metrics/bin/service_metrics_ctl stop.

service_metrics_ctl invokes kill_and_wait $pidfile 40. kill_and_wait will allow the service_metrics process 40 seconds to shutdown gracefully before issuing a kill -9. Unfortunately monit will timeout after 30 seconds, raising "failed to stop".

Suggestions

service-metrics-release/jobs/service-metrics/templates/service_metrics_ctl.erb could specify a timeout <30s to kill_and_wait. The default is 25 seconds which should work.
or service-metrics-release/jobs/service-metrics/monit could allow longer than 40s for the stop command by appending 'with timeout {n} seconds' to the monit stop command.
or a drain script could be implemented as an alternative to using kill_and_wait in the stop script.

vmware-archive / service-metrics-release