milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.51k stars 2.83k forks source link

[Bug]: Stop GPU index on CPU machine more user friendly instead of milvus crash #27589

Open binbinlv opened 11 months ago

binbinlv commented 11 months ago

Is there an existing issue for this?

Environment

- Milvus version: master-20231007-80eb5434-gpu
- Deployment mode(standalone or cluster): both
- MQ type(rocksmq, pulsar or kafka):    all
- SDK version(e.g. pymilvus v2.0.0rc2): 
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

GPU index could be created successfully on CPU machine, and could be searched too.

>>> index_param = {"index_type": "GPU_IVF_FLAT", "metric_type": "L2", "params": {"nlist": 1024}}
>>> collection.create_index("float_vector", index_param, index_name="index_name_1")
Status(code=0, message=)
>>> default_search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
>>> limit = 10
>>> nq = 1
>>> collection.load()
>>> res = collection.search(vectors[:nq], "float_vector", default_search_params, limit, "int64 >= 0")
>>>
>>> res[0].ids
[0, 1372, 114, 900, 5283, 5652, 8776, 6182, 3621, 4557]

Expected Behavior

GPU index could not be created successfully on CPU machine, and report error

Steps To Reproduce

  1. deploy milvus (gpu image) on CPU machine
  2. create collection and create index
from pymilvus import CollectionSchema, FieldSchema
from pymilvus import Collection
from pymilvus import connections
from pymilvus import DataType
from pymilvus import Partition
from pymilvus import utility

connections.connect()

dim = 128
int64_field = FieldSchema(name="int64", dtype=DataType.INT64, is_primary=True)
float_field = FieldSchema(name="float", dtype=DataType.FLOAT)
bool_field = FieldSchema(name="bool", dtype=DataType.BOOL)
string_field = FieldSchema(name="string", dtype=DataType.VARCHAR, max_length=65535)
json_field = FieldSchema(name="json_field", dtype=DataType.JSON)
float_vector = FieldSchema(name="float_vector", dtype=DataType.FLOAT_VECTOR, dim=dim)
schema = CollectionSchema(fields=[int64_field, float_field, bool_field, float_vector])
collection = Collection("test_search_collection_binbin_tmp_0", schema=schema)
import numpy as np
import random
nb = 10000
vectors = [[random.random() for _ in range(dim)] for _ in range(nb)]
res = collection.insert([[i for i in range(nb)], [np.float32(i) for i in range(nb)], [np.bool_(i) for i in range(nb)], vectors])
index_param = {"index_type": "GPU_IVF_FLAT", "metric_type": "L2", "params": {"nlist": 1024}}
collection.create_index("float_vector", index_param, index_name="index_name_1")

Milvus Log

https://grafana-4am.zilliz.cc/explore?orgId=1&left=%7B%22datasource%22:%22Loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D%5C%22devops%5C%22,namespace%3D%5C%22chaos-testing%5C%22,pod%3D~%5C%22gpu-cpu-machine-wssbe.*%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

Anything else?

No response

yanliang567 commented 11 months ago

/assign @liliu-z @Presburger /unassign

liliu-z commented 11 months ago

This is a GPU image, so we can create GPU index. And the case only involved 10K 128dim data, which didn't trigger index building at all. So this is as expected.

Presburger commented 11 months ago

data size is so small, cannot triage GPU build stage.

binbinlv commented 11 months ago

will try big data size.

yanliang567 commented 11 months ago

This is a GPU image, so we can create GPU index. And the case only involved 10K 128dim data, which didn't trigger index building at all. So this is as expected. what is the size/rule to tigger the GPU index. @liliu-z

Presburger commented 11 months ago

@yanliang567 while slow data can also trigger index build, but should flush\create\load,then the data will sealed.

binbinlv commented 11 months ago

when inserting 5M data, then create GPU_IVF_FLAT on cpu machine, milvus crashed showing the following error in log:

[2023/10/13 02:51:59.738 +00:00] [DEBUG] [config/etcd_source.go:141] ["etcd refreshConfigurations"] [prefix=by-dev/config] [endpoints="[gpu-cpu-machine-qfljr-etcd:2379]"]
F20231013 02:51:59.877173    94 raft_utils.cc:24] [KNOWHERE][gpu_device_manager][milvus] CUDA error encountered at: file=/go/src/github.com/milvus-io/milvus/cmake_build/thirdparty/knowhere/knowhere-src/src/common/raft/raft_utils.cc line=22: call='cudaGetDeviceCount(&device_counts)', Reason=cudaErrorInsufficientDriver:CUDA driver version is insufficient for CUDA runtime version
binbinlv commented 11 months ago

Could we stop GPU index on CPU machine more user friendly? like report error in advance instead of milvus crash.

liliu-z commented 11 months ago

This is a GPU image, so we can create GPU index. And the case only involved 10K 128dim data, which didn't trigger index building at all. So this is as expected. what is the size/rule to tigger the GPU index. @liliu-z

No size/rule, just because data is still in a growing segment.

liliu-z commented 11 months ago

Could we stop GPU index on CPU machine more user friendly? like report error in advance instead of milvus crash.

Make sense to catch an exception and throw it out to let indexCoord retry. @Presburger can you help take a look?

stale[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

binbinlv commented 9 months ago

keep open, remove stale

stale[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

RakeshRaj97 commented 3 months ago

Can we load a collection index built using GPU_IVF_FLAT index to a CPU node which has more DRAM?

stale[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.