[Bug]: milvus-querynode memory leak

FmKnight commented 4 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: 2.4.4
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus2.4.3
- OS(Ubuntu or CentOS): Centos
- CPU/Memory: 25GB
- GPU: No
- Others:

Current Behavior

When using milvus, our company discovered that the milvus-quernode subservice was suspected to have a memory leak. Because our company has applied milvus to the knowledge base service in the production environment, this problem has a great impact on our service, so I would be grateful for your consultation and solution. The relevant specific information is as follows:

Milvus stores data volume: Currently, there are four collections with a total of about 3.22 million data items. The most commonly used collection with the largest amount of data has a data volume of 3.15 million items.

We initially allocated the milvus-querynode subservice to a 32G memory machine node, with approximately 25G of available memory. After running for a few days, the following error occurs:

the application layer program uses milvus hybrid search, the program reports an error, and the stack information is as follows: File "/mnt/ai/usrs/stephen/Langchain-Chatchat-prod-milvus24/server/knowledge_base/milvus_service/milvus_hybrid_search.py", line 684, in similarity_search_with_score_by_vector res = self.col.hybrid_search( File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/orm/collection.py", line 943, in hybrid_search resp = conn.hybrid_search( File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/decorators.py", line 147, in handler raise e from e File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/decorators.py", line 143, in handler return func(*args, kwargs) File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/decorators.py", line 182, in handler return func(self, *args, *kwargs) File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/decorators.py", line 122, in handler raise e from e File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/decorators.py", line 87, in handler return func(args, kwargs) File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/client/grpc_handler.py", line 850, in hybrid_search return self._execute_hybrid_search( File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/client/grpc_handler.py", line 761, in _execute_hybrid_search raise e from e File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/client/grpc_handler.py", line 754, in _execute_hybrid_search check_status(response.status) File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/client/utils.py", line 63, in check_status raise MilvusException(status.code, status.reason, status.error_code) pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to search: segment lacks[segment=450938165783864078]: channel not available[channel=by-dev-rootcoord-dml_12_450544666393159123v0])>

At this time, the milvus-querynode subservice is checked and it has been hung. Because we are in a hurry to restart the service, we did not keep the milvus related logs this time, but we have recorded the milvus-querynode logs for reference, see the attachment: milvus-querynode.log

the config we use to deploy the milvus with helm, see the attachment: config.txt

milvus-querynode.log config.txt

Expected Behavior

we want to find solution of this problem: How to solve the above milvus-quernode problem? It looks like a memory problem because the memory usage is always high and the service finally crashes. I guess it has something to do with the configuration and the number of words in the text stored in milvus. and maybe related to data insert, we have a background thread for incremental data, which inserts a document every 3 seconds, about dozens of milvus records

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

czs007 commented 4 months ago

@FmKnight it seems that the milvus version is 2.3.4 instead of 2.4.4 ?

commit 7a192da870bdb1090adea79bb93ada9390c129fd Author: congqixia congqi.xia@zilliz.com Date: Fri Dec 29 18:32:45 2023 +0800

enhance: Bump version 2.3.4 & milvus-proto (#29598)

tedxu commented 4 months ago

The uploaded log may not be specific to the client's case. The log itself looks good.

Additionally, I've observed that the loading segments are relatively small. It would be beneficial if you could examine the datanode logs to verify the effectiveness of the compaction process.

FmKnight commented 4 months ago

@FmKnight it seems that the milvus version is 2.3.4 instead of 2.4.4 ?

commit 7a192da Author: congqixia congqi.xia@zilliz.com Date: Fri Dec 29 18:32:45 2023 +0800
enhance: Bump version 2.3.4 & milvus-proto (#29598)

@czs007 yes, the log is milvus 2.3.4， we upgrade the version to 2.4.4 on end of June. Is there a memory leak? The milvus-quernode service crashes after a period of use. can we adjust parameters to avoid this or any other solution to avoid this.

FmKnight commented 4 months ago

The uploaded log may not be specific to the client's case. The log itself looks good.

Additionally, I've observed that the loading segments are relatively small. It would be beneficial if you could examine the datanode logs to verify the effectiveness of the compaction process.

@tedxu i paste our current milvus2.4.4 datanode log below, please see whether this can help find out the problem and solve. thanks milvus24-datanode-68977b5876-6hk2r.log

yanliang567 commented 4 months ago

/assign @tedxu @czs007 /unassign

tedxu commented 4 months ago

@FmKnight upon reviewing the datanode log, I haven't identified any issues.

It would be helpful if you could upload the complete log file. You can utilize the script found at deployments/export-log/export-milvus-log.sh to extract the most recent logs. Furthermore, please make sure to collect these logs from the right cluster.

yanliang567 commented 4 months ago

/assign @FmKnight

xiaofan-luan commented 4 months ago

You are saying that your querynode is OOM, but the log you offered is datanode. 1 can we collect more detailed log for all the node.

can you run pprof and collect the memory use info

milvus-io / milvus