Open alexjst opened 1 year ago
Hi @alexjst
Milvus actually already support graceful stop in our latest K8s operator.
Basically what we do is mark one of the querynode as stopped, and the balancer is responsible for move all segment on the querynode to the other available querynodes.
If you are using latest K8s operator, you should already see the unserviceable period to be greatly shortened.
But you are right, use sigterm should be very straight forward.
Welcome to do the contribution, the only thing we need to do is change the query/index/data code, if SIGTERM received, the node mark itself as Stopping state and wait for coordinator to drain it's segment to other querynode.
@zwd1208 @LoveEachDay feel free to comment since I'm not an expert in K8s operator
/assign @zwd1208
@alexjst Milvus already support graceful stop by K8s operator or helm chart deployment.
this is an example for querynode graceful stop:
It is worth mentioning that the graceful stop of the Milvus cluster depends on stop orders for all components, it is key to achieve a seamless shutdown.
@jaime0815 The problem is not so much about the querynode itself that is being shut down. It's about the proxynodes that, during this shutdown and rebalance process, are not able to serve traffic normally. We see huge latency increases, connection failures (error messages) from the proxy nodes (which keep shard leader caches), and QPS drop that can last for minutes, although we have replicas=2. Ideally, when we have data replicas, searches can still be performed without interruption. The interruption is more obvious with larger datasets.
@jaime0815 The problem is not so much about the querynode itself that is being shut down. It's about the proxynodes that, during this shutdown and rebalance process, are not able to serve traffic normally. We see huge latency increases, connection failures (error messages) from the proxy nodes (which keep shard leader caches), and QPS drop that can last for minutes, although we have replicas=2. Ideally, when we have data replicas, searches can still be performed without interruption. The interruption is more obvious with larger datasets.
can you upload your proxy logs here? try to verify my guess
/assign @weiliu1031
Hi, we figured out how to do graceful query node shutdown in HashiCorp Nomad environment now. Two questions though:
Hi, we figured out how to do graceful query node shutdown in HashiCorp Nomad environment now. Two questions though:
- What is the recommended time length to allow for graceful shutdown of a querynode (and also other types of worker node)?
- Do coordinate nodes (rootcoord, datacoord, indexcoord, querycoord) also support rolling updates with zero downtime?
@alexjst
we figured out how to do graceful query node shutdown
Could you please enlighten how did you guys do this ?
@alexjst
we figured out how to do graceful query node shutdown
Could you please enlighten how did you guys do this ?
I'm curious too.
I am experimenting with how to reliably proceed with the "rollout restart". I set the pdb minAvaliable of the query node to 2/3 of the total replicas. and when I restarted the rollout of the query node deployment, the segments of the query node were lost.
Here's what happens:
How can we do a graceful shutdown?
@weiliu1031
And one more question.
if I restart the query node without killing it normally, sometimes the collection can't be reloaded. and the segments are not found even if the collection is reloaded.
pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to search: segment=446264053679358703: segment lacks: channel=by-dev-rootcoord-dml_2_446264053675553139v0: channel not available)>
I understand that the segments and metadatas are preserved in S3, and etcd. But I'm wondering why the collection load fails, and the search query fails.
Is there a trick to recovering data when a graceful shutdown fails?
@Uijeong97 you can see the relevant discussion here : https://discord.com/channels/1160323594396635310/1182918245590777956/1182918245590777956
Contrary to the claim though i don't see its working out of the box. i.e graceful shutdown does not trigger with the SIGTERM from the kubernetes.
Instead i have added a peice of code in the prestop hook of query node deployment where i find the process id for milvus and manually send it SIGTERM, I have observed after this the node then starts the graceful shutdown process and moves out all its segments to other nodes.
Although this is also not seamless. There is till a glitch when queries fail (~3sec) but after that it recovers.
check this out for more details on prestop logic : https://github.com/milvus-io/milvus/blob/master/scripts/stop_graceful.sh
@weiliu1031 @sunby
Should we offer some scripts to disable auto balance and gracefully remove segemnts/channels out?
This could be really helpful
@weiliu1031
And one more question.
if I restart the query node without killing it normally, sometimes the collection can't be reloaded. and the segments are not found even if the collection is reloaded.
- The collection below does not load.
- The results of a search query on the loaded collection.
pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to search: segment=446264053679358703: segment lacks: channel=by-dev-rootcoord-dml_2_446264053675553139v0: channel not available)>
I understand that the segments and metadatas are preserved in S3, and etcd. But I'm wondering why the collection load fails, and the search query fails.
Is there a trick to recovering data when a graceful shutdown fails?
if you deploy milvus on k8s, make sure that the k8s config terminationGracePeriodSeconds
for query node is large than query node's graceful stop timeout. which will give the query node enough time to move all it's segment to other node. otherwise query node may be killed by k8s before finish it's graceful stop
@Uijeong97 you can see the relevant discussion here : https://discord.com/channels/1160323594396635310/1182918245590777956/1182918245590777956
Contrary to the claim though i don't see its working out of the box. i.e graceful shutdown does not trigger with the SIGTERM from the kubernetes.
Instead i have added a peice of code in the prestop hook of query node deployment where i find the process id for milvus and manually send it SIGTERM, I have observed after this the node then starts the graceful shutdown process and moves out all its segments to other nodes.
Although this is also not seamless. There is till a glitch when queries fail (~3sec) but after that it recovers.
check this out for more details on prestop logic : https://github.com/milvus-io/milvus/blob/master/scripts/stop_graceful.sh
some info about the 3s search failure
you mentioned above, proxy has a cache about channel's load location, and it will be updated in period, which is 3s for now. if necessary, you can try to reduce the interval by proxy.shardLeaderCacheInterval
, which will cost much if you have many collections, handle it with care
@weiliu1031 @sunby
Should we offer some scripts to disable auto balance and gracefully remove segemnts/channels out?
This could be really helpful
good idea, i will try it
@roy-akash
@Uijeong97 you can see the relevant discussion here : https://discord.com/channels/1160323594396635310/1182918245590777956/1182918245590777956
Contrary to the claim though i don't see its working out of the box. i.e graceful shutdown does not trigger with the SIGTERM from the kubernetes.
Instead i have added a peice of code in the prestop hook of query node deployment where i find the process id for milvus and manually send it SIGTERM, I have observed after this the node then starts the graceful shutdown process and moves out all its segments to other nodes.
Although this is also not seamless. There is till a glitch when queries fail (~3sec) but after that it recovers.
check this out for more details on prestop logic : https://github.com/milvus-io/milvus/blob/master/scripts/stop_graceful.sh
Thanks for the answer. It was a great help. Is it possible to run the stop_graceful.sh script within the query node lifecycle pre-stop value?
@Uijeong97 yeah you can do that, add below to your deployment
`
lifecycle:
preStop:
exec:
command:
|
set -ex
exec 1<>/proc/1/fd/1;
exec 2>&1;
get_milvus_process() {
milvus_process_id="$(ps -e | grep milvus | grep -v grep | awk '{print $1}')"
printf '%s' "$milvus_process_id"
}
echo "Stopping milvus..."
if [ -z "$(get_milvus_process)" ]; then
echo "No milvus process"
exit 0
fi
kill -15 "$(get_milvus_process)"
while :
do
sleep 3
if [ -z "$(get_milvus_process)" ]; then
echo "Milvus stopped"
break
fi
done
`
@Uijeong97 yeah you can do that, add below to your deployment ` lifecycle: preStop: exec: command: - /bin/sh - -c - | #!/bin/sh set -ex exec 1<>/proc/1/fd/1; exec 2>&1;
get_milvus_process() { milvus_process_id="$(ps -e | grep milvus | grep -v grep | awk '{print $1}')" printf '%s' "$milvus_process_id" } echo "Stopping milvus..." if [ -z "$(get_milvus_process)" ]; then echo "No milvus process" exit 0 fi kill -15 "$(get_milvus_process)" while : do sleep 3 if [ -z "$(get_milvus_process)" ]; then echo "Milvus stopped" break fi done
`
Great!
@Uijeong97 yeah you can do that, add below to your deployment ` lifecycle: preStop: exec: command: - /bin/sh - -c - | #!/bin/sh set -ex exec 1<>/proc/1/fd/1; exec 2>&1;
get_milvus_process() { milvus_process_id="$(ps -e | grep milvus | grep -v grep | awk '{print $1}')" printf '%s' "$milvus_process_id" } echo "Stopping milvus..." if [ -z "$(get_milvus_process)" ]; then echo "No milvus process" exit 0 fi kill -15 "$(get_milvus_process)" while : do sleep 3 if [ -z "$(get_milvus_process)" ]; then echo "Milvus stopped" break fi done
` @roy-akash cc. @xiaofan-luan
Thanks for the code!!!👍
I have one more question.
The process of terminating a pod in K8S is as follows:
When Pod termination starts, the Milverse process is running, When the k8s pod receives a signal (phase 3), why doesn't it wait for the milverse process to terminate? (phase 4)
Killing a Milvus process in a prestop seems like something K8s terminating process should be doing... I'm wondering why K8s termination process doesn't terminate the Milverse process normally. Is there an expected reason?
The process of terminating a pod in K8S is as follows:
- Pod is set to the “Terminating” State and removed from the endpoints list of all Services
- preStop Hook is executed
- SIGTERM signal is sent to the pod
- Kubernetes waits for a grace period
- SIGKILL signal is sent to pod, and the pod is removed
When Pod termination starts, the Milverse process is running, When the k8s pod receives a signal (phase 3), why doesn't it wait for the milverse process to terminate? (phase 4)
There is a hard timeout limit by kubernetes controlled by the deployment config terminationGracePeriodSeconds , set this to a higher value like ~1800, by default this is 30s in kubernetes
Killing a Milvus process in a prestop seems like something K8s terminating process should be doing... I'm wondering why K8s termination process doesn't terminate the Milverse process normally. Is there an expected reason?
@weiliu1031 @xiaofan-luan : Yup, this seems like a miss there should be auto handling for this imo
@roy-akash
There is a hard timeout limit by kubernetes controlled by the deployment config terminationGracePeriodSeconds , set this to a higher value like ~1800, by default this is 30s in kubernetes
Thank you so much, that makes a lot of sense. 🔥
@weiliu1031
if you deploy milvus on k8s, make sure that the k8s config terminationGracePeriodSeconds for query node is large than query node's graceful stop timeout. which will give the query node enough time to move all it's segment to other node. otherwise query node may be killed by k8s before finish it's graceful stop
Sorry for so many questions. I just wanted to ask one more! 🤣
Is there a way to check if terminationGracePeriodSeconds is greater than the graceful shutdown time of the query node?
Does the graceful shutdown time of the query node depend on each situation? Or is there any generalized standard?
@roy-akash
There is a hard timeout limit by kubernetes controlled by the deployment config terminationGracePeriodSeconds , set this to a higher value like ~1800, by default this is 30s in kubernetes
Thank you so much, that makes a lot of sense. 🔥
@weiliu1031
if you deploy milvus on k8s, make sure that the k8s config terminationGracePeriodSeconds for query node is large than query node's graceful stop timeout. which will give the query node enough time to move all it's segment to other node. otherwise query node may be killed by k8s before finish it's graceful stop
Sorry for so many questions. I just wanted to ask one more! 🤣
Is there a way to check if terminationGracePeriodSeconds is greater than the graceful shutdown time of the query node?
Does the graceful shutdown time of the query node depend on each situation? Or is there any generalized standard?
milvus doesn't have a policy to check the k8s terminationGracePeriodSeconds
item, it's recommended to set the right value and check by hand.
the query node's graceful shutdown time related to the data scale on query node, if you have a very large-scale dataset in milvus cluster, such as hundreds GBs, you may need to change it's config. and the default value is enough for most normal case.
Is there an existing issue for this?
Is your feature request related to a problem? Please describe.
Both Kubernetes and Nomad incorporate the concept of 'graceful shutdown' through node draining. During regular server maintenance operations, any worker node, including datanodes, indexnodes, and querynodes, may receive a SIGTERM signal to initiate the shutdown process. Following the shutdown, a new node allocation process may be initiated to replace the previous node. However, in the case of Milvus, node draining currently causes service disruptions that can last for several minutes. This feature request aims to address this issue by implementing a smooth and orderly shutdown process for worker nodes in order to achieve zero downtime.
To achieve a seamless shutdown experience, the suggested priority order for graceful shutdown/node draining is as follows: querynodes, datanodes, and indexnodes. By prioritizing the graceful shutdown of querynodes, Milvus can maintain uninterrupted query services, ensuring minimal disruption to the overall system performance.
Describe the solution you'd like.
The solution involves enhancing Milvus to effectively support the graceful shutdown of worker nodes, with query nodes being given higher priority. Taking inspiration from established practices in Kubernetes and Nomad, Milvus should handle the SIGTERM signal in a manner that allows query nodes to finalize ongoing queries and seamlessly redirect new queries to other available query nodes that hold the same data replica. Simultaneously, the shutdown process should be carefully coordinated to ensure a smooth transition for other types of nodes, including datanodes, indexnodes, proxynodes, and coordinate nodes.
By prioritizing the graceful shutdown of query nodes, Milvus can ensure uninterrupted query availability and significantly reduce disruptions to the system during regular server maintenance operations. This enhancement will effectively maintain the system's overall performance and stability, providing a seamless experience for users relying on Milvus for their query operations.
Describe an alternate solution.
An alternative approach to address this issue with query nodes is to implement a retry mechanism within the query client, such as proxy nodes. When a query node being shut down fails, the query client quickly times out and retries the same query on other available query nodes with the same data replicas. Although the search on the shutting-down node would fail, the retries on alternative query nodes should succeed, providing accurate results within an acceptable latency due to timeout settings. However, it's important to note that leveraging the established concept of graceful shutdown through node draining, as seen in Kubernetes and Nomad, aligns Milvus with industry standards and best practices. By implementing a built-in graceful shutdown mechanism, Milvus ensures a reliable and predictable shutdown process for query nodes, minimizing service disruptions and maintaining system stability.
Anything else? (Additional Context)
https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace https://discuss.hashicorp.com/t/whats-the-recommended-way-to-drain-a-node-and-shutdown-all-tasks-on-the-node-gracefully/45039