milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.26k stars 2.9k forks source link

[Bug]: ”Run Milvus with GPU Support Using Docker Compose“ Failed to start. #36355

Open ligh2012 opened 1 month ago

ligh2012 commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version:2.4.11-gpu
- Deployment mode(standalone or cluster):standalone 
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.6
- OS(Ubuntu or CentOS): Ubuntu 
- CPU/Memory: 
- GPU: 4060ti
- Others: 
nvidia-dirver:555
cuda-toolkit:12.4

Current Behavior

docker-compose up -d,

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 618abe86ae40 milvusdb/milvus:v2.4.11-gpu "/tini -- milvus run…" 14 minutes ago Exited (134) 17 seconds ago milvus-standalone 736c52bdc4a6 minio/minio:RELEASE.2023-03-20T20-16-18Z "/usr/bin/docker-ent…" 14 minutes ago Up 14 minutes (healthy) 0.0.0.0:9000-9001->9000-9001/tcp, :::9000-9001->9000-9001/tcp milvus-minio 920059ac3e97 quay.io/coreos/etcd:v3.5.5 "etcd -advertise-cli…" 14 minutes ago Up 14 minutes (healthy) 2379-2380/tcp milvus-etcd

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

terminate called after throwing an instance of 'raft::cuda_error' what(): CUDA error encountered at: file=/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/common/raft/integration/raft_initialization.cc line=53: call='cudaGetDeviceCount(&result)', Reason=cudaErrorSymbolNotFound:named symbol not found Obtained 6 stack frames

1 in /milvus/lib/libknowhere.so: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) +0xd5 [0x7fa72136dcc5]

2 in /milvus/lib/libknowhere.so(+0x811a7a) [0x7fa7211a1a7a]

3 in /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fa7478ceee8]

4 in /milvus/lib/libknowhere.so: raft_knowhere::initialize_raft(raft_knowhere::raft_configuration const&) +0x6d [0x7fa7217de6fd]

5 in /milvus/lib/libknowhere.so: knowhere::KnowhereConfig::SetRaftMemPool() +0x49 [0x7fa721342d79]

6 in milvus: runtime.asmcgocall.abi0 +0x68 [0x1e9f0a8]

SIGABRT: abort PC=0x7fa7478cb9fc m=19 sigcode=18446744073709551610 signal arrived during cgo execution

Anything else?

No response

yanliang567 commented 1 month ago

/assign @Presburger /unassign

xiaofan-luan commented 1 month ago

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version:2.4.11-gpu
- Deployment mode(standalone or cluster):standalone 
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.6
- OS(Ubuntu or CentOS): Ubuntu 
- CPU/Memory: 
- GPU: 4060ti
- Others: 
nvidia-dirver:555
cuda-toolkit:12.4

Current Behavior

docker-compose up -d,

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 618abe86ae40 milvusdb/milvus:v2.4.11-gpu "/tini -- milvus run…" 14 minutes ago Exited (134) 17 seconds ago milvus-standalone 736c52bdc4a6 minio/minio:RELEASE.2023-03-20T20-16-18Z "/usr/bin/docker-ent…" 14 minutes ago Up 14 minutes (healthy) 0.0.0.0:9000-9001->9000-9001/tcp, :::9000-9001->9000-9001/tcp milvus-minio 920059ac3e97 quay.io/coreos/etcd:v3.5.5 "etcd -advertise-cli…" 14 minutes ago Up 14 minutes (healthy) 2379-2380/tcp milvus-etcd

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

terminate called after throwing an instance of 'raft::cuda_error' what(): CUDA error encountered at: file=/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/common/raft/integration/raft_initialization.cc line=53: call='cudaGetDeviceCount(&result)', Reason=cudaErrorSymbolNotFound:named symbol not found Obtained 6 stack frames #1 in /milvus/lib/libknowhere.so: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) +0xd5 [0x7fa72136dcc5] #2 in /milvus/lib/libknowhere.so(+0x811a7a) [0x7fa7211a1a7a] #3 in /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fa7478ceee8] #4 in /milvus/lib/libknowhere.so: raft_knowhere::initialize_raft(raft_knowhere::raft_configuration const&) +0x6d [0x7fa7217de6fd] #5 in /milvus/lib/libknowhere.so: knowhere::KnowhereConfig::SetRaftMemPool() +0x49 [0x7fa721342d79] #6 in milvus: runtime.asmcgocall.abi0 +0x68 [0x1e9f0a8]

SIGABRT: abort

PC=0x7fa7478cb9fc m=19 sigcode=18446744073709551610 signal arrived during cgo execution

Anything else?

No response

could you give us some clue about what GPU you are running with and what CUDA version you are running on?

Presburger commented 1 month ago

@ligh2012 Check if the GPU can be accessed properly within the container, and verify if the installation from this NVIDIA guide is complete.

Presburger commented 1 month ago

hi, @ligh2012 You can first try this command on the host side to see if it produces the expected output.

docker run  --runtime=nvidia --rm -it nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi