The elasticsearch drain script needs improvement. The recurring problem is that it's frequently told job_shutdown in cases where the disk data won't actually be changing (e.g. changing disk size, instance type, IP address) and it causes significant delays in what should otherwise be a quick process. I think the drain script should only/always block until the cluster is green. The case where you might be removing a node from the cluster is just causing complications and, as long as you have replicas, the simple green check can handle that scenario, just not in a preventive manner as originally intended.
Additionally, consider implementing and testing the following logic:
The elasticsearch drain script needs improvement. The recurring problem is that it's frequently told
job_shutdown
in cases where the disk data won't actually be changing (e.g. changing disk size, instance type, IP address) and it causes significant delays in what should otherwise be a quick process. I think the drain script should only/always block until the cluster is green. The case where you might be removing a node from the cluster is just causing complications and, as long as you have replicas, the simple green check can handle that scenario, just not in a preventive manner as originally intended.Additionally, consider implementing and testing the following logic:
curl -X PUT -d '{ ... }' /_cluster/settings
monit unmonitor elasticsearch
curl -X POST /_node/local/_shutdown