milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.75k stars 2.93k forks source link

[Bug]: Index type GPU_CAGRA in Milvus v2.4.0-rc.1 is not working #32100

Closed xionghuaidong closed 4 months ago

xionghuaidong commented 7 months ago

Is there an existing issue for this?

Environment

- Milvus version: v2.4.0-rc.1-gpu
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.0
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: 64 CPUs, 251 GiB Memory
- GPU: Two Tesla P100 GPUs, with 16 GiB GPU Memory each
- Others:

Current Behavior

The build_milvus_index.py script failed with exception: RuntimeError: test vector search 10 vectors, returned 0 vectors

Expected Behavior

The build_milvus_index.py succeeded with message: test vector search 10 vectors, returned 10 vectors

Steps To Reproduce

1. Following the instructions in [Install Milvus Cluster with Docker Compose](https://milvus.io/docs/install_standalone-docker-compose-gpu.md) to launch a Milvus instance.

2. Run script [generate_vectors.py](https://gist.github.com/xionghuaidong/8b1482c32daaddbd14302ec0e558f579) to generate 10,000 vectors in directory ``task/`` and upload the files to Minio ``myminio/a-bucket/tasks/``.

3. Run script [build_milvus_index.py](https://gist.github.com/xionghuaidong/93ea09f424866c658050d99446139e6e) to build and search Milvus index.

Milvus Log

milvus.log

Anything else?

No response

xiaofan-luan commented 7 months ago

/assign @Presburger

Presburger commented 7 months ago

@Presburger

Presburger commented 7 months ago

can you offer command nvidia-smi outputs? @xionghuaidong

xionghuaidong commented 7 months ago

@Presburger Here is the nvidia-smi outputs.

$nvidia-smi 
Wed Apr 10 15:31:14 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:04:00.0 Off |                    0 |
| N/A   29C    P0    27W / 250W |    124MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:84:00.0 Off |                    0 |
| N/A   25C    P0    24W / 250W |    128MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Presburger commented 7 months ago

@xionghuaidong, Hi Is it possible to upgrade the nvidia driver to 535 or newer?

xionghuaidong commented 7 months ago

@xionghuaidong, Hi Is it possible to upgrade the nvidia driver to 535 or newer?

@Presburger It's a bit complicated since the development machine is shared among a few developers.

I tested CPU HNSW and other GPU indexes GPU_IVF_PQ, GPU_IVF_FLAT, GPU_BRUTE_FORCE. They all work. For GPU_CAGRA, index building seems OK according to the log, and searching is not working.

Presburger commented 7 months ago

@xionghuaidong, Hi Is it possible to upgrade the nvidia driver to 535 or newer?

@Presburger It's a bit complicated since the development machine is shared among a few developers.

I tested CPU HNSW and other GPU indexes GPU_IVF_PQ, GPU_IVF_FLAT, GPU_BRUTE_FORCE. They all work. For GPU_CAGRA, index building seems OK according to the log, and searching is not working.

Hi, can you offer me the search params, when you try GPU_CAGRA?

xionghuaidong commented 7 months ago

@xionghuaidong, Hi Is it possible to upgrade the nvidia driver to 535 or newer?

@Presburger It's a bit complicated since the development machine is shared among a few developers. I tested CPU HNSW and other GPU indexes GPU_IVF_PQ, GPU_IVF_FLAT, GPU_BRUTE_FORCE. They all work. For GPU_CAGRA, index building seems OK according to the log, and searching is not working.

Hi, can you offer me the search params, when you try GPU_CAGRA?

I'm using default search params of GPU_CAGRA.

query_vectors = [
    [0.041732933] * self._vector_dimensions,
]
result = self._milvus_client.search(
    collection_name=self._milvus_collection,
    data=query_vectors,
    limit=self._test_vector_search_limit,    # set to 10
    output_fields=[self._entity_id_field_name],
)

See the _test_vector_search method in build_milvus_index.py.

Presburger commented 7 months ago

@xionghuaidong hi, try change this line

"build_algo": "IVF_PQ",

to

"build_algo": "NN_DESCENT",

some GPU use IVF_PQ build graph maybe very very slow.

xionghuaidong commented 7 months ago

@xionghuaidong hi, try change this line

"build_algo": "IVF_PQ",

to

"build_algo": "NN_DESCENT",

some GPU use IVF_PQ build graph maybe very very slow.

@Presburger Hi, thanks for your response.

I executed my script build_milvus_index.py with "build_algo": "NN_DESCENT", and the Milvus server coredumped.

Here is the log. milvus.log

xionghuaidong commented 6 months ago

@Presburger Hi,is there any progress?

xionghuaidong commented 6 months ago

@Presburger Hi, I used docker-compose up to launch the official milvusdb/milvus:v2.4.1-gpu docker image and encountered the following error.

image

milvus-standalone  | container_linux.go:251: starting container process caused "process_linux.go:346: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/bin/nvidia-container-cli --load-kmods configure --device=all --compute --utility --require=cuda>=11.8 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 --pid=53085 /home/docker/overlay2/c01840fc568052fef2908a29b27d7119e275e1c1c500fd35aeca54705d8961f7/merged]\\nnvidia-container-cli: requirement error: unsatisfied condition: brand = titanrtx\\n\""
Error response from daemon: invalid header field value "oci runtime error: container_linux.go:251: starting container process caused \"process_linux.go:346: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/bin/nvidia-container-cli --load-kmods configure --device=all --compute --utility --require=cuda>=11.8 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 --pid=53085 /home/docker/overlay2/c01840fc568052fef2908a29b27d7119e275e1c1c500fd35aeca54705d8961f7/merged]\\\\nnvidia-container-cli: requirement error: unsatisfied condition: brand = titanrtx\\\\n\\\"\"\n"
stale[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.