milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.88k stars 2.94k forks source link

[Bug]: ”Run Milvus with GPU Support Using Docker Compose“ Failed to start. #36355

Open ligh2012 opened 2 months ago

ligh2012 commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version:2.4.11-gpu
- Deployment mode(standalone or cluster):standalone 
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.6
- OS(Ubuntu or CentOS): Ubuntu 
- CPU/Memory: 
- GPU: 4060ti
- Others: 
nvidia-dirver:555
cuda-toolkit:12.4

Current Behavior

docker-compose up -d,

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 618abe86ae40 milvusdb/milvus:v2.4.11-gpu "/tini -- milvus run…" 14 minutes ago Exited (134) 17 seconds ago milvus-standalone 736c52bdc4a6 minio/minio:RELEASE.2023-03-20T20-16-18Z "/usr/bin/docker-ent…" 14 minutes ago Up 14 minutes (healthy) 0.0.0.0:9000-9001->9000-9001/tcp, :::9000-9001->9000-9001/tcp milvus-minio 920059ac3e97 quay.io/coreos/etcd:v3.5.5 "etcd -advertise-cli…" 14 minutes ago Up 14 minutes (healthy) 2379-2380/tcp milvus-etcd

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

terminate called after throwing an instance of 'raft::cuda_error' what(): CUDA error encountered at: file=/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/common/raft/integration/raft_initialization.cc line=53: call='cudaGetDeviceCount(&result)', Reason=cudaErrorSymbolNotFound:named symbol not found Obtained 6 stack frames

1 in /milvus/lib/libknowhere.so: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) +0xd5 [0x7fa72136dcc5]

2 in /milvus/lib/libknowhere.so(+0x811a7a) [0x7fa7211a1a7a]

3 in /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fa7478ceee8]

4 in /milvus/lib/libknowhere.so: raft_knowhere::initialize_raft(raft_knowhere::raft_configuration const&) +0x6d [0x7fa7217de6fd]

5 in /milvus/lib/libknowhere.so: knowhere::KnowhereConfig::SetRaftMemPool() +0x49 [0x7fa721342d79]

6 in milvus: runtime.asmcgocall.abi0 +0x68 [0x1e9f0a8]

SIGABRT: abort PC=0x7fa7478cb9fc m=19 sigcode=18446744073709551610 signal arrived during cgo execution

Anything else?

No response

yanliang567 commented 2 months ago

/assign @Presburger /unassign

xiaofan-luan commented 2 months ago

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version:2.4.11-gpu
- Deployment mode(standalone or cluster):standalone 
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.6
- OS(Ubuntu or CentOS): Ubuntu 
- CPU/Memory: 
- GPU: 4060ti
- Others: 
nvidia-dirver:555
cuda-toolkit:12.4

Current Behavior

docker-compose up -d,

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 618abe86ae40 milvusdb/milvus:v2.4.11-gpu "/tini -- milvus run…" 14 minutes ago Exited (134) 17 seconds ago milvus-standalone 736c52bdc4a6 minio/minio:RELEASE.2023-03-20T20-16-18Z "/usr/bin/docker-ent…" 14 minutes ago Up 14 minutes (healthy) 0.0.0.0:9000-9001->9000-9001/tcp, :::9000-9001->9000-9001/tcp milvus-minio 920059ac3e97 quay.io/coreos/etcd:v3.5.5 "etcd -advertise-cli…" 14 minutes ago Up 14 minutes (healthy) 2379-2380/tcp milvus-etcd

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

terminate called after throwing an instance of 'raft::cuda_error' what(): CUDA error encountered at: file=/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/common/raft/integration/raft_initialization.cc line=53: call='cudaGetDeviceCount(&result)', Reason=cudaErrorSymbolNotFound:named symbol not found Obtained 6 stack frames #1 in /milvus/lib/libknowhere.so: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) +0xd5 [0x7fa72136dcc5] #2 in /milvus/lib/libknowhere.so(+0x811a7a) [0x7fa7211a1a7a] #3 in /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fa7478ceee8] #4 in /milvus/lib/libknowhere.so: raft_knowhere::initialize_raft(raft_knowhere::raft_configuration const&) +0x6d [0x7fa7217de6fd] #5 in /milvus/lib/libknowhere.so: knowhere::KnowhereConfig::SetRaftMemPool() +0x49 [0x7fa721342d79] #6 in milvus: runtime.asmcgocall.abi0 +0x68 [0x1e9f0a8]

SIGABRT: abort

PC=0x7fa7478cb9fc m=19 sigcode=18446744073709551610 signal arrived during cgo execution

Anything else?

No response

could you give us some clue about what GPU you are running with and what CUDA version you are running on?

Presburger commented 2 months ago

@ligh2012 Check if the GPU can be accessed properly within the container, and verify if the installation from this NVIDIA guide is complete.

Presburger commented 2 months ago

hi, @ligh2012 You can first try this command on the host side to see if it produces the expected output.

docker run  --runtime=nvidia --rm -it nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi
stale[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.