milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.08k stars 2.95k forks source link

[Bug]: failed to search: segment lacks[segment=450251867839771751]: channel not available[channel=by-dev-rootcoord-dml_5_450251867852668093v0]", #33882

Open ReganWz opened 5 months ago

ReganWz commented 5 months ago

Is there an existing issue for this?

Environment

- Milvus version:GPU 2.4.1
- Deployment mode(standalone or cluster):standalone 
- MQ type(rocksmq, pulsar or kafka): rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2): go 2.3.x
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 
- GPU: A10
- Others:

Current Behavior

单台milvus GPU A10 6卡 单milvus节点数据 4000w,一个collection,默认partition,并发100-150时崩溃,QPS 300-400 持续一段时间(大约10分钟)开始异常错误;

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

milvus_error.log

Anything else?

none

yanliang567 commented 5 months ago

=> failed to search: raft inner error: CUDA error encountered at: file=/go/src/github.com/milvus-io/milvus/cmakebuild/3rdpartydownload/raft-src/cpp/include/raft/core/resource/devicememoryresource.hpp line=143: call='cudaMemGetInfo(&freesize, &totalsize)', Reason=cudaErrorIllegalAddress:an illegal memory access was encountered

sounds like a CUDA memory issue. @liliu-z please help to take a look

/assign @liliu-z /unassign