[Feature]: Enhancing Milvus Stability: Implementing Segment Recovery Checks in Query Node's Readiness Probe.

Uijeong97 commented 8 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Is your feature request related to a problem? Please describe.

I would like to address the issue of service interruption in the Milvus query node.

In a Kubernetes (k8s) environment, operations like node drain and rollout restart are performed frequently. During this process, interruption are occurring in MILVUS. (queries fail (~30sec) but after that it recovers.)

These interruptions occur because the Readiness probe for the new pod doesn't take into account the recovery operation. The new query node POD shifts to a RUNNING state even when not all segments are fully replicated.

In the k8s service, traffic is directed to the new POD, but since all segments are not yet replicated, partial disruptions occur.

Describe the solution you'd like.

Therefore, I propose the following solutions:

Check whether the Milvus query node's readiness probe is in the middle of segment recovery.
Change to the Running state only after the recovery process is complete.

For this task, it is necessary to have a feature in the Milvus CLI to check if the segment recovery is in progress, similar to the recovery tasks in MinIO.

In MinIO, when a pod is lost, a HEAL task is performed to recover the distributed disk.

Describe an alternate solution.

The idea is to make the Readiness Probe Time longer. This is simple, but it's hard to establish a baseline for how long the Readiness Probe Time should be.

Anything else? (Additional Context)

My team is considering adopting milvus for your real-time services. However, we are hesitant to adopt it because of the lack of HA. If we could have a readiness probe to see if it is healing, it would be a no-brainer.

xiaofan-luan commented 8 months ago

Is there an existing issue for this?

[x] I have searched the existing issues

Is your feature request related to a problem? Please describe.

I would like to address the issue of service interruption in the Milvus query node.

In a Kubernetes (k8s) environment, operations like node drain and rollout restart are performed frequently. During this process, interruption are occurring in MILVUS. (queries fail (~30sec) but after that it recovers.)

These interruptions occur because the Readiness probe for the new pod doesn't take into account the recovery operation. The new query node POD shifts to a RUNNING state even when not all segments are fully replicated.

In the k8s service, traffic is directed to the new POD, but since all segments are not yet replicated, partial disruptions occur.

Describe the solution you'd like.

Therefore, I propose the following solutions:

Check whether the Milvus query node's readiness probe is in the middle of segment recovery.

Change to the Running state only after the recovery process is complete.

For this task, it is necessary to have a feature in the Milvus CLI to check if the segment recovery is in progress, similar to the recovery tasks in MinIO.

In MinIO, when a pod is lost, a HEAL task is performed to recover the distributed disk.

Describe an alternate solution.

The idea is to make the Readiness Probe Time longer. This is simple, but it's hard to establish a baseline for how long the Readiness Probe Time should be.

Anything else? (Additional Context)

My team is considering adopting milvus for your real-time services. However, we are hesitant to adopt it because of the lack of HA. If we could have a readiness probe to see if it is healing, it would be a no-brainer.

Hi @Uijeong97 Thanks, for your feedback. usually, if pod crash or querynode down, milvus should trigger self heal automatically.

What is the milvus version you are currently using?
Do you still have any logs? if you deploy Prometheus and loki, we need to figure out the reason of pod crash, could it be OOM, Panic or anything else.
Can you give more details about your use case?Like how many physical resources, what is your current cluster config, how many collections is there in your cluster, what's your request pattern, read,write(better with some sample code) so we can try to reproduce in house

Uijeong97 commented 8 months ago

@xiaofan-luan

Thanks for the quick response.

This is not an issue with the automatic healing process. I am proposing the feature to check for recovery status to prevent requests to pods whose segments are not fully replicated.

What is the milvus version you are currently using?

I'am using milvus version 2.3.3.

Do you still have any logs? if you deploy Prometheus and loki, we need to figure out the reason of pod crash, could it be OOM, Panic or anything else.

This is the result of rollout restart of the query node while making requests to locust. This happens in 30 seconds or less.

interuption

This is how I expect the readiness Probe behavior of the query node to be.

When you perform a rollout restart, a new queryNode pod is started. At this point, all existing pods are RUNNING/Terminating status.
Move the segments from the old queryNode pod to the new queryNode pod. During this process, the new queryNode pod's readiness probe is in a NotReady state.
When the healing process ends, the new QueryNode pod will change to RUNNING Status, and the old pod will transition to TERMINATED Status.

This eliminates the possibility of the request being forwarded to a pod that is healing.

This makes that the service can operate without interruption.

Can you give more details about your use case?Like how many physical resources, what is your current cluster config, how many collections is there in your cluster, what's your request pattern, read,write(better with some sample code) so we can try to reproduce in house

only search query.
15 million embedding vectors.
index: flat

query node (in-memory replica 2)

replicas 60
resource limits
- CPU 8
- memory 16Gi

query coord

replicas 3
resource limits
- CPU 4
- memory 8Gi

xiaofan-luan commented 8 months ago

For query coord usually don't need that much replicas. 2 replica should be good enough

for querynode: i would suggest to have 30 nodes with 16core 32g or 15nodes with 32core64g. This could help on reduce the rpc numbers.

LoveEachDay commented 8 months ago

@Uijeong97 Here the question is that we'd make sure that all the segments are loaded. When you rollout querynode, the new query node will be scheduled, with no segments loaded. While at the same time, kubernetes will kill a old querynode which do have segments loaded.

We'd make sure those segment reside in the old querynode have been balanced out to the other querynodes, before the old querynode is killed.

In our recent implementation, the querynode are notified by the term signal, and it will begin to balance the segments to other querynode. We'd setup a preStop check to make sure the balancing has done.

xiaofan-luan commented 8 months ago

First, allow me to explain the rolling upgrade process for Milvus:

When we plan to take a machine offline, the Kubernetes Operator issues a SIGTERM signal. At this point, the Querynode marks itself as entering a stopping state. The Querycoord triggers a data balance across nodes. Segments are balanced and transferred to other Querynode nodes. The query routing cache is updated accordingly. Once all segments have been switched to another Querynode, the original node is taken offline. Therefore, even during this process, where some nodes may be in a terminating state, queries should not be affected.

In the current version's rolling upgrade process, there might be an issue where segments are not fully offline before being terminated by Kubernetes. We are actively working to resolve this issue.

Additionally, we are in the process of further improve stability and rolling upgrade. For more details, please refer to the GitHub issue #29409

xiaofan-luan commented 8 months ago

We are glad to take a call to learn more about your use case and the current problem you are facing. Reach me out at james.luan@zilliz.com

Uijeong97 commented 8 months ago

@xiaofan-luan

Thank you for your answer. Sorry for the delay in responding.

In the current version's rolling upgrade process, there might be an issue where segments are not fully offline before being terminated by Kubernetes. We are actively working to resolve this issue.

it's understood that when segments are moved to another node, the query routing cache is updated.

Problem:

In such scenario, to ensure uninterrupted request processing, queries should be sent to both the terminating pods and the new pods, especially if the segments are not fully replicated yet.

However, Kubernetes will not request a Pod in the terminating state and a new running Pod at the same time. When a Pod enters the terminating state, it is removed from the service endpoint.

Solve:

In my opinion, when a sigterm signal is received, the segment should be replicated rather than simply moved, similar to the approach used by the Horizontal Pod Auto Scaler (HPA).

This approach prevents service disruption because the service endpoint changes only after the segment is fully replicated to the new Pod.

milvus-io / milvus