milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.91k stars 2.95k forks source link

[Bug]: When BulkInsert performs large data import tasks, there is a chance of encountering the error: "connection reset by peer: importing data failed." #34975

Open lentitude2tk opened 4 months ago

lentitude2tk commented 4 months ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4.6
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): xxx
- SDK version(e.g. pymilvus v2.0.0rc2): xxx
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. The user imports data, with each request containing 80GB multi numpy files.
  2. During the import process on Tencent Cos Bucket, there are occasional errors with the message: "read: connection reset by peer: importing data failed."

Expected Behavior

If there are network issues, BulkInsert can internally perform a retry first. Currently, the internal progress is more than halfway completed, but due to the connection reset by peer failure, it has to roll back and restart the BulkInsert.

Steps To Reproduce

Occasionally, the getObject-related operations within bulkInsert can retry in case of network request failures.

failed to read utf32 bytes from numpy file, error: read tcp 10.140.0.112:58782->169.254.0.47:443: read: connection reset by peer: importing data failed

Milvus Log

https://grafana.op.zilliz.com.cn/explore?orgId=1&left=%7B%22datasource%22:%22Loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D%5C%22milvus-tc-ap-shanghai-1%5C%22,namespace%3D%5C%22milvus-in01-f5958ab56f80a01%5C%22,pod%3D~%5C%22in01-f5958ab56f80a01-milvus-.*%5C%22%7D%7C%3D%5C%22connection%20reset%20by%5C%22%22%7D%5D,%22range%22:%7B%22from%22:%22now-7d%22,%22to%22:%22now%22%7D%7D

Anything else?

No response

yanliang567 commented 4 months ago

/assign @czs007 /unassign

bigsheeper commented 4 months ago

/assign