milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.48k stars 2.92k forks source link

[Bug]: Milvus Pulsar Recovery pod ingesting logs indefinitely #35048

Closed Punit-Solanki closed 2 months ago

Punit-Solanki commented 3 months ago

Is there an existing issue for this?

Environment

- Milvus version: 3.2.1
- Deployment mode(standalone or cluster): AKS Cluster (Kubernetes Version 
1.28.9)

Current Behavior

When a GKE cluster update takes place, all the pods are recreated. This is fine. However, we have observed a weird behavior.

If the milvus-pulsar-recovery-0 pod gets into "running" state before zookeeper or bookie pods get in a "running" state, the pulsar recovery pod starts ingesting tons of logs with the error:

"15:16:01.153 [bookkeeper-io-3-4] ERROR org.apache.bookkeeper.common.allocator.impl.ByteBufAllocatorImpl - Unable to allocate memory"

During this time frame, there are no problems with the application. Everything is working fine. The log ingestion is a big problem. We are getting heavily billed for the log analytics workspace.

We have a temporary fix which we have implemented. When we manually delete the milvus-pulsar-recovery-0 pod and the pod is recreated, the error is resolved and log ingestion stops immediately.

What we have tried for far:

We would also like to point out that the utilization isn't exceeding the allocated size. Despite that, we doubled the MaxDirectMemorySize, and still we are seeing the same error:

"[bookkeeper-io-3-4] ERROR org.apache.bookkeeper.common.allocator.impl.ByteBufAllocatorImpl - Unable to allocate memory"

Expected Behavior

Logs should not be ingested indefinitely when AKS cluster update takes place. Additionally, if it is an error with our configuration, the ingestion and errors should not stop when we recreate the pulsar recovery pod.

Steps To Reproduce

A cluster update or recreation of all pods can trigger this behavior.

Milvus Log

No response

Anything else?

No response

xiaofan-luan commented 3 months ago

/assign @LoveEachDay can you help on investigating on this issue?

LoveEachDay commented 3 months ago

@Punit-Solanki You'd better add more memory to pulsar recovery.

Try to change the configmap for pulsar recovery, change the following configurations:

 BOOKIE_MEM: |
    -Xms64m -Xmx64m -XX:MaxDirectMemorySize=1024m

then restart pulsar recovery pod.

Punit-Solanki commented 3 months ago

Hi @LoveEachDay Thank you for your response!

I have added the value to our configmap. Request you to keep the issue open for the next couple of days as our cluster updates take place over the weekend.

I'll notify here if it works! Thanks again!

Punit-Solanki commented 3 months ago

Thank you so much @LoveEachDay

This worked for me. I don't see pulsar recovery ingesting unnecessary logs now.

Just one request, can you let me know what was the exact reason for this ingestion? Also, what did adding MaxDirectMemory to the configmap do here?

LoveEachDay commented 3 months ago

@Punit-Solanki Pulsar recovery monitors the ledger replication status periodically. If it detects an under-replicated ledger, it triggers a replication from one bookie to another. During this process, Pulsar recovery uses direct memory to read ledger data from the source bookie and write it to the target bookie.

Punit-Solanki commented 2 months ago

You may proceed to close this. Thank you so much @LoveEachDay