Open FmKnight opened 4 months ago
@FmKnight it seems that the milvus version is 2.3.4 instead of 2.4.4 ?
commit 7a192da870bdb1090adea79bb93ada9390c129fd Author: congqixia congqi.xia@zilliz.com Date: Fri Dec 29 18:32:45 2023 +0800
enhance: Bump version 2.3.4 & milvus-proto (#29598)
The uploaded log may not be specific to the client's case. The log itself looks good.
Additionally, I've observed that the loading segments are relatively small. It would be beneficial if you could examine the datanode logs to verify the effectiveness of the compaction process.
@FmKnight it seems that the milvus version is 2.3.4 instead of 2.4.4 ?
commit 7a192da Author: congqixia congqi.xia@zilliz.com Date: Fri Dec 29 18:32:45 2023 +0800
enhance: Bump version 2.3.4 & milvus-proto (#29598)
@czs007 yes, the log is milvus 2.3.4, we upgrade the version to 2.4.4 on end of June. Is there a memory leak? The milvus-quernode service crashes after a period of use. can we adjust parameters to avoid this or any other solution to avoid this.
The uploaded log may not be specific to the client's case. The log itself looks good.
Additionally, I've observed that the loading segments are relatively small. It would be beneficial if you could examine the datanode logs to verify the effectiveness of the compaction process.
@tedxu i paste our current milvus2.4.4 datanode log below, please see whether this can help find out the problem and solve. thanks milvus24-datanode-68977b5876-6hk2r.log
/assign @tedxu @czs007 /unassign
@FmKnight upon reviewing the datanode log, I haven't identified any issues.
It would be helpful if you could upload the complete log file. You can utilize the script found at deployments/export-log/export-milvus-log.sh to extract the most recent logs. Furthermore, please make sure to collect these logs from the right cluster.
/assign @FmKnight
You are saying that your querynode is OOM, but the log you offered is datanode. 1 can we collect more detailed log for all the node.
Is there an existing issue for this?
Environment
Current Behavior
When using milvus, our company discovered that the milvus-quernode subservice was suspected to have a memory leak. Because our company has applied milvus to the knowledge base service in the production environment, this problem has a great impact on our service, so I would be grateful for your consultation and solution. The relevant specific information is as follows:
Milvus stores data volume: Currently, there are four collections with a total of about 3.22 million data items. The most commonly used collection with the largest amount of data has a data volume of 3.15 million items.
We initially allocated the milvus-querynode subservice to a 32G memory machine node, with approximately 25G of available memory. After running for a few days, the following error occurs:
the application layer program uses milvus hybrid search, the program reports an error, and the stack information is as follows: File "/mnt/ai/usrs/stephen/Langchain-Chatchat-prod-milvus24/server/knowledge_base/milvus_service/milvus_hybrid_search.py", line 684, in similarity_search_with_score_by_vector res = self.col.hybrid_search( File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/orm/collection.py", line 943, in hybrid_search resp = conn.hybrid_search( File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/decorators.py", line 147, in handler raise e from e File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/decorators.py", line 143, in handler return func(*args, kwargs) File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/decorators.py", line 182, in handler return func(self, *args, *kwargs) File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/decorators.py", line 122, in handler raise e from e File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/decorators.py", line 87, in handler return func(args, kwargs) File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/client/grpc_handler.py", line 850, in hybrid_search return self._execute_hybrid_search( File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/client/grpc_handler.py", line 761, in _execute_hybrid_search raise e from e File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/client/grpc_handler.py", line 754, in _execute_hybrid_search check_status(response.status) File "/mnt/ai/environment/milvus-2.4/venv/lib/python3.10/site-packages/pymilvus/client/utils.py", line 63, in check_status raise MilvusException(status.code, status.reason, status.error_code) pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to search: segment lacks[segment=450938165783864078]: channel not available[channel=by-dev-rootcoord-dml_12_450544666393159123v0])>
At this time, the milvus-querynode subservice is checked and it has been hung. Because we are in a hurry to restart the service, we did not keep the milvus related logs this time, but we have recorded the milvus-querynode logs for reference, see the attachment: milvus-querynode.log
the config we use to deploy the milvus with helm, see the attachment: config.txt
milvus-querynode.log config.txt
Expected Behavior
we want to find solution of this problem: How to solve the above milvus-quernode problem? It looks like a memory problem because the memory usage is always high and the service finally crashes. I guess it has something to do with the configuration and the number of words in the text stored in milvus. and maybe related to data insert, we have a background thread for incremental data, which inserts a document every 3 seconds, about dozens of milvus records
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response