Open ligh2012 opened 1 month ago
/assign @Presburger /unassign
Is there an existing issue for this?
- [x] I have searched the existing issues
Environment
- Milvus version:2.4.11-gpu - Deployment mode(standalone or cluster):standalone - MQ type(rocksmq, pulsar or kafka): - SDK version(e.g. pymilvus v2.0.0rc2):2.4.6 - OS(Ubuntu or CentOS): Ubuntu - CPU/Memory: - GPU: 4060ti - Others: nvidia-dirver:555 cuda-toolkit:12.4
Current Behavior
docker-compose up -d,
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 618abe86ae40 milvusdb/milvus:v2.4.11-gpu "/tini -- milvus run…" 14 minutes ago Exited (134) 17 seconds ago milvus-standalone 736c52bdc4a6 minio/minio:RELEASE.2023-03-20T20-16-18Z "/usr/bin/docker-ent…" 14 minutes ago Up 14 minutes (healthy) 0.0.0.0:9000-9001->9000-9001/tcp, :::9000-9001->9000-9001/tcp milvus-minio 920059ac3e97 quay.io/coreos/etcd:v3.5.5 "etcd -advertise-cli…" 14 minutes ago Up 14 minutes (healthy) 2379-2380/tcp milvus-etcd
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
terminate called after throwing an instance of 'raft::cuda_error' what(): CUDA error encountered at: file=/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/common/raft/integration/raft_initialization.cc line=53: call='cudaGetDeviceCount(&result)', Reason=cudaErrorSymbolNotFound:named symbol not found Obtained 6 stack frames #1 in /milvus/lib/libknowhere.so: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) +0xd5 [0x7fa72136dcc5] #2 in /milvus/lib/libknowhere.so(+0x811a7a) [0x7fa7211a1a7a] #3 in /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fa7478ceee8] #4 in /milvus/lib/libknowhere.so: raft_knowhere::initialize_raft(raft_knowhere::raft_configuration const&) +0x6d [0x7fa7217de6fd] #5 in /milvus/lib/libknowhere.so: knowhere::KnowhereConfig::SetRaftMemPool() +0x49 [0x7fa721342d79] #6 in milvus: runtime.asmcgocall.abi0 +0x68 [0x1e9f0a8]
SIGABRT: abort
PC=0x7fa7478cb9fc m=19 sigcode=18446744073709551610 signal arrived during cgo execution
Anything else?
No response
could you give us some clue about what GPU you are running with and what CUDA version you are running on?
@ligh2012 Check if the GPU can be accessed properly within the container, and verify if the installation from this NVIDIA guide is complete.
hi, @ligh2012 You can first try this command on the host side to see if it produces the expected output.
docker run --runtime=nvidia --rm -it nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi
Is there an existing issue for this?
Environment
Current Behavior
docker-compose up -d,
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 618abe86ae40 milvusdb/milvus:v2.4.11-gpu "/tini -- milvus run…" 14 minutes ago Exited (134) 17 seconds ago milvus-standalone 736c52bdc4a6 minio/minio:RELEASE.2023-03-20T20-16-18Z "/usr/bin/docker-ent…" 14 minutes ago Up 14 minutes (healthy) 0.0.0.0:9000-9001->9000-9001/tcp, :::9000-9001->9000-9001/tcp milvus-minio 920059ac3e97 quay.io/coreos/etcd:v3.5.5 "etcd -advertise-cli…" 14 minutes ago Up 14 minutes (healthy) 2379-2380/tcp milvus-etcd
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
terminate called after throwing an instance of 'raft::cuda_error' what(): CUDA error encountered at: file=/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/common/raft/integration/raft_initialization.cc line=53: call='cudaGetDeviceCount(&result)', Reason=cudaErrorSymbolNotFound:named symbol not found Obtained 6 stack frames
1 in /milvus/lib/libknowhere.so: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) +0xd5 [0x7fa72136dcc5]
2 in /milvus/lib/libknowhere.so(+0x811a7a) [0x7fa7211a1a7a]
3 in /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fa7478ceee8]
4 in /milvus/lib/libknowhere.so: raft_knowhere::initialize_raft(raft_knowhere::raft_configuration const&) +0x6d [0x7fa7217de6fd]
5 in /milvus/lib/libknowhere.so: knowhere::KnowhereConfig::SetRaftMemPool() +0x49 [0x7fa721342d79]
6 in milvus: runtime.asmcgocall.abi0 +0x68 [0x1e9f0a8]
SIGABRT: abort PC=0x7fa7478cb9fc m=19 sigcode=18446744073709551610 signal arrived during cgo execution
Anything else?
No response