milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.7k stars 2.93k forks source link

[Feature]: Support Graceful Shutdown of Worker Nodes in Milvus #24272

Open alexjst opened 1 year ago

alexjst commented 1 year ago

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

Both Kubernetes and Nomad incorporate the concept of 'graceful shutdown' through node draining. During regular server maintenance operations, any worker node, including datanodes, indexnodes, and querynodes, may receive a SIGTERM signal to initiate the shutdown process. Following the shutdown, a new node allocation process may be initiated to replace the previous node. However, in the case of Milvus, node draining currently causes service disruptions that can last for several minutes. This feature request aims to address this issue by implementing a smooth and orderly shutdown process for worker nodes in order to achieve zero downtime.

To achieve a seamless shutdown experience, the suggested priority order for graceful shutdown/node draining is as follows: querynodes, datanodes, and indexnodes. By prioritizing the graceful shutdown of querynodes, Milvus can maintain uninterrupted query services, ensuring minimal disruption to the overall system performance.

Describe the solution you'd like.

The solution involves enhancing Milvus to effectively support the graceful shutdown of worker nodes, with query nodes being given higher priority. Taking inspiration from established practices in Kubernetes and Nomad, Milvus should handle the SIGTERM signal in a manner that allows query nodes to finalize ongoing queries and seamlessly redirect new queries to other available query nodes that hold the same data replica. Simultaneously, the shutdown process should be carefully coordinated to ensure a smooth transition for other types of nodes, including datanodes, indexnodes, proxynodes, and coordinate nodes.

By prioritizing the graceful shutdown of query nodes, Milvus can ensure uninterrupted query availability and significantly reduce disruptions to the system during regular server maintenance operations. This enhancement will effectively maintain the system's overall performance and stability, providing a seamless experience for users relying on Milvus for their query operations.

Describe an alternate solution.

An alternative approach to address this issue with query nodes is to implement a retry mechanism within the query client, such as proxy nodes. When a query node being shut down fails, the query client quickly times out and retries the same query on other available query nodes with the same data replicas. Although the search on the shutting-down node would fail, the retries on alternative query nodes should succeed, providing accurate results within an acceptable latency due to timeout settings. However, it's important to note that leveraging the established concept of graceful shutdown through node draining, as seen in Kubernetes and Nomad, aligns Milvus with industry standards and best practices. By implementing a built-in graceful shutdown mechanism, Milvus ensures a reliable and predictable shutdown process for query nodes, minimizing service disruptions and maintaining system stability.

Anything else? (Additional Context)

https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace https://discuss.hashicorp.com/t/whats-the-recommended-way-to-drain-a-node-and-shutdown-all-tasks-on-the-node-gracefully/45039

xiaofan-luan commented 1 year ago

Hi @alexjst

Milvus actually already support graceful stop in our latest K8s operator.

Basically what we do is mark one of the querynode as stopped, and the balancer is responsible for move all segment on the querynode to the other available querynodes.

If you are using latest K8s operator, you should already see the unserviceable period to be greatly shortened.

But you are right, use sigterm should be very straight forward.

Welcome to do the contribution, the only thing we need to do is change the query/index/data code, if SIGTERM received, the node mark itself as Stopping state and wait for coordinator to drain it's segment to other querynode.

@zwd1208 @LoveEachDay feel free to comment since I'm not an expert in K8s operator

xiaofan-luan commented 1 year ago

/assign @zwd1208

jaime0815 commented 1 year ago

@alexjst Milvus already support graceful stop by K8s operator or helm chart deployment.

this is an example for querynode graceful stop:

  1. Pod receive a termianted sigterm
  2. Start stopping querynode within prestop until timeout
  3. Balance the segment and channel to the new querynode
  4. The old querynode will be offline status while the balance is finished in prestop stage, then the pod will be stopped.

It is worth mentioning that the graceful stop of the Milvus cluster depends on stop orders for all components, it is key to achieve a seamless shutdown.

alexjst commented 1 year ago

@jaime0815 The problem is not so much about the querynode itself that is being shut down. It's about the proxynodes that, during this shutdown and rebalance process, are not able to serve traffic normally. We see huge latency increases, connection failures (error messages) from the proxy nodes (which keep shard leader caches), and QPS drop that can last for minutes, although we have replicas=2. Ideally, when we have data replicas, searches can still be performed without interruption. The interruption is more obvious with larger datasets.

weiliu1031 commented 1 year ago

@jaime0815 The problem is not so much about the querynode itself that is being shut down. It's about the proxynodes that, during this shutdown and rebalance process, are not able to serve traffic normally. We see huge latency increases, connection failures (error messages) from the proxy nodes (which keep shard leader caches), and QPS drop that can last for minutes, although we have replicas=2. Ideally, when we have data replicas, searches can still be performed without interruption. The interruption is more obvious with larger datasets.

can you upload your proxy logs here? try to verify my guess

weiliu1031 commented 1 year ago

/assign @weiliu1031

alexjst commented 1 year ago

Hi, we figured out how to do graceful query node shutdown in HashiCorp Nomad environment now. Two questions though:

  1. What is the recommended time length to allow for graceful shutdown of a querynode (and also other types of worker node)?
  2. Do coordinate nodes (rootcoord, datacoord, indexcoord, querycoord) also support rolling updates with zero downtime?
weiliu1031 commented 1 year ago

Hi, we figured out how to do graceful query node shutdown in HashiCorp Nomad environment now. Two questions though:

  1. What is the recommended time length to allow for graceful shutdown of a querynode (and also other types of worker node)?
  2. Do coordinate nodes (rootcoord, datacoord, indexcoord, querycoord) also support rolling updates with zero downtime?
  1. for query node graceful shutdown, we will move the loaded segment in this query node, the the time cost depends on the total segment size, for single replica collection, 1h is safe enough, for multi replica collection, lost a query node won't affect the collection's access, it can be tolerant with w shorter time.
  2. we have a stand by policy here by design for coord, during rolling uppgrade, there will be two coord component, the new one will be standy, and after the old one shutdown, it will become online.
roy-akash commented 11 months ago

@alexjst

we figured out how to do graceful query node shutdown

Could you please enlighten how did you guys do this ?

Uijeong97 commented 11 months ago

@alexjst

we figured out how to do graceful query node shutdown

Could you please enlighten how did you guys do this ?

I'm curious too.

I am experimenting with how to reliably proceed with the "rollout restart". I set the pdb minAvaliable of the query node to 2/3 of the total replicas. and when I restarted the rollout of the query node deployment, the segments of the query node were lost.

Here's what happens:

  1. new query nodes were starting up
  2. changing to a running state
  3. and then shutting down existing query nodes without recovering
  4. so when we reloaded the collection, the segments were lost.

How can we do a graceful shutdown?

Uijeong97 commented 11 months ago

@weiliu1031

And one more question.

if I restart the query node without killing it normally, sometimes the collection can't be reloaded. and the segments are not found even if the collection is reloaded.

스크린샷 2023-12-13 오후 4 38 50

I understand that the segments and metadatas are preserved in S3, and etcd. But I'm wondering why the collection load fails, and the search query fails.

Is there a trick to recovering data when a graceful shutdown fails?

roy-akash commented 11 months ago

@Uijeong97 you can see the relevant discussion here : https://discord.com/channels/1160323594396635310/1182918245590777956/1182918245590777956

Contrary to the claim though i don't see its working out of the box. i.e graceful shutdown does not trigger with the SIGTERM from the kubernetes.

Instead i have added a peice of code in the prestop hook of query node deployment where i find the process id for milvus and manually send it SIGTERM, I have observed after this the node then starts the graceful shutdown process and moves out all its segments to other nodes.

Although this is also not seamless. There is till a glitch when queries fail (~3sec) but after that it recovers.

check this out for more details on prestop logic : https://github.com/milvus-io/milvus/blob/master/scripts/stop_graceful.sh

xiaofan-luan commented 11 months ago

@weiliu1031 @sunby

Should we offer some scripts to disable auto balance and gracefully remove segemnts/channels out?

This could be really helpful

weiliu1031 commented 11 months ago

@weiliu1031

And one more question.

if I restart the query node without killing it normally, sometimes the collection can't be reloaded. and the segments are not found even if the collection is reloaded.

  • The collection below does not load.
스크린샷 2023-12-13 오후 4 38 50
  • The results of a search query on the loaded collection.
pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to search: segment=446264053679358703: segment lacks: channel=by-dev-rootcoord-dml_2_446264053675553139v0: channel not available)>

I understand that the segments and metadatas are preserved in S3, and etcd. But I'm wondering why the collection load fails, and the search query fails.

Is there a trick to recovering data when a graceful shutdown fails?

if you deploy milvus on k8s, make sure that the k8s config terminationGracePeriodSeconds for query node is large than query node's graceful stop timeout. which will give the query node enough time to move all it's segment to other node. otherwise query node may be killed by k8s before finish it's graceful stop

weiliu1031 commented 11 months ago

@Uijeong97 you can see the relevant discussion here : https://discord.com/channels/1160323594396635310/1182918245590777956/1182918245590777956

Contrary to the claim though i don't see its working out of the box. i.e graceful shutdown does not trigger with the SIGTERM from the kubernetes.

Instead i have added a peice of code in the prestop hook of query node deployment where i find the process id for milvus and manually send it SIGTERM, I have observed after this the node then starts the graceful shutdown process and moves out all its segments to other nodes.

Although this is also not seamless. There is till a glitch when queries fail (~3sec) but after that it recovers.

check this out for more details on prestop logic : https://github.com/milvus-io/milvus/blob/master/scripts/stop_graceful.sh

some info about the 3s search failure you mentioned above, proxy has a cache about channel's load location, and it will be updated in period, which is 3s for now. if necessary, you can try to reduce the interval by proxy.shardLeaderCacheInterval, which will cost much if you have many collections, handle it with care

weiliu1031 commented 11 months ago

@weiliu1031 @sunby

Should we offer some scripts to disable auto balance and gracefully remove segemnts/channels out?

This could be really helpful

good idea, i will try it

Uijeong97 commented 11 months ago

@roy-akash

@Uijeong97 you can see the relevant discussion here : https://discord.com/channels/1160323594396635310/1182918245590777956/1182918245590777956

Contrary to the claim though i don't see its working out of the box. i.e graceful shutdown does not trigger with the SIGTERM from the kubernetes.

Instead i have added a peice of code in the prestop hook of query node deployment where i find the process id for milvus and manually send it SIGTERM, I have observed after this the node then starts the graceful shutdown process and moves out all its segments to other nodes.

Although this is also not seamless. There is till a glitch when queries fail (~3sec) but after that it recovers.

check this out for more details on prestop logic : https://github.com/milvus-io/milvus/blob/master/scripts/stop_graceful.sh

Thanks for the answer. It was a great help. Is it possible to run the stop_graceful.sh script within the query node lifecycle pre-stop value?

roy-akash commented 11 months ago

@Uijeong97 yeah you can do that, add below to your deployment `
lifecycle: preStop: exec: command:

xiaofan-luan commented 11 months ago

@Uijeong97 yeah you can do that, add below to your deployment ` lifecycle: preStop: exec: command: - /bin/sh - -c - | #!/bin/sh set -ex exec 1<>/proc/1/fd/1; exec 2>&1;

              get_milvus_process() {
                milvus_process_id="$(ps -e | grep milvus | grep -v grep | awk '{print $1}')"
                printf '%s' "$milvus_process_id"
              }

              echo "Stopping milvus..."

              if [ -z "$(get_milvus_process)" ]; then
                echo "No milvus process"
                exit 0
              fi

              kill -15 "$(get_milvus_process)"

              while :
              do
                sleep 3
                if [ -z "$(get_milvus_process)" ]; then
                  echo "Milvus stopped"
                  break
                fi
              done

`

Great!

Uijeong97 commented 11 months ago

@Uijeong97 yeah you can do that, add below to your deployment ` lifecycle: preStop: exec: command: - /bin/sh - -c - | #!/bin/sh set -ex exec 1<>/proc/1/fd/1; exec 2>&1;

              get_milvus_process() {
                milvus_process_id="$(ps -e | grep milvus | grep -v grep | awk '{print $1}')"
                printf '%s' "$milvus_process_id"
              }

              echo "Stopping milvus..."

              if [ -z "$(get_milvus_process)" ]; then
                echo "No milvus process"
                exit 0
              fi

              kill -15 "$(get_milvus_process)"

              while :
              do
                sleep 3
                if [ -z "$(get_milvus_process)" ]; then
                  echo "Milvus stopped"
                  break
                fi
              done

` @roy-akash cc. @xiaofan-luan

Thanks for the code!!!👍

I have one more question.

The process of terminating a pod in K8S is as follows:

  1. Pod is set to the “Terminating” State and removed from the endpoints list of all Services
  2. preStop Hook is executed
  3. SIGTERM signal is sent to the pod
  4. Kubernetes waits for a grace period
  5. SIGKILL signal is sent to pod, and the pod is removed

When Pod termination starts, the Milverse process is running, When the k8s pod receives a signal (phase 3), why doesn't it wait for the milverse process to terminate? (phase 4)

Killing a Milvus process in a prestop seems like something K8s terminating process should be doing... I'm wondering why K8s termination process doesn't terminate the Milverse process normally. Is there an expected reason?

roy-akash commented 11 months ago

The process of terminating a pod in K8S is as follows:

  1. Pod is set to the “Terminating” State and removed from the endpoints list of all Services
  2. preStop Hook is executed
  3. SIGTERM signal is sent to the pod
  4. Kubernetes waits for a grace period
  5. SIGKILL signal is sent to pod, and the pod is removed

When Pod termination starts, the Milverse process is running, When the k8s pod receives a signal (phase 3), why doesn't it wait for the milverse process to terminate? (phase 4)

There is a hard timeout limit by kubernetes controlled by the deployment config terminationGracePeriodSeconds , set this to a higher value like ~1800, by default this is 30s in kubernetes

Killing a Milvus process in a prestop seems like something K8s terminating process should be doing... I'm wondering why K8s termination process doesn't terminate the Milverse process normally. Is there an expected reason?

@weiliu1031 @xiaofan-luan : Yup, this seems like a miss there should be auto handling for this imo

Uijeong97 commented 11 months ago

@roy-akash

There is a hard timeout limit by kubernetes controlled by the deployment config terminationGracePeriodSeconds , set this to a higher value like ~1800, by default this is 30s in kubernetes

Thank you so much, that makes a lot of sense. 🔥

@weiliu1031

if you deploy milvus on k8s, make sure that the k8s config terminationGracePeriodSeconds for query node is large than query node's graceful stop timeout. which will give the query node enough time to move all it's segment to other node. otherwise query node may be killed by k8s before finish it's graceful stop

Sorry for so many questions. I just wanted to ask one more! 🤣

Is there a way to check if terminationGracePeriodSeconds is greater than the graceful shutdown time of the query node?

Does the graceful shutdown time of the query node depend on each situation? Or is there any generalized standard?

weiliu1031 commented 11 months ago

@roy-akash

There is a hard timeout limit by kubernetes controlled by the deployment config terminationGracePeriodSeconds , set this to a higher value like ~1800, by default this is 30s in kubernetes

Thank you so much, that makes a lot of sense. 🔥

@weiliu1031

if you deploy milvus on k8s, make sure that the k8s config terminationGracePeriodSeconds for query node is large than query node's graceful stop timeout. which will give the query node enough time to move all it's segment to other node. otherwise query node may be killed by k8s before finish it's graceful stop

Sorry for so many questions. I just wanted to ask one more! 🤣

Is there a way to check if terminationGracePeriodSeconds is greater than the graceful shutdown time of the query node?

Does the graceful shutdown time of the query node depend on each situation? Or is there any generalized standard?

milvus doesn't have a policy to check the k8s terminationGracePeriodSeconds item, it's recommended to set the right value and check by hand. the query node's graceful shutdown time related to the data scale on query node, if you have a very large-scale dataset in milvus cluster, such as hundreds GBs, you may need to change it's config. and the default value is enough for most normal case.