milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.01k stars 2.95k forks source link

[Bug]: [benchmark] milvus hang when create diskann index #27663

Closed elstic closed 1 year ago

elstic commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: master-20231011-be980fbc
- Deployment mode(standalone or cluster): all
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task : fouramf-stable-1697054400 , id : 1

client log:

[2023-10-11 20:35:21,709 -  INFO - fouram]: [Base] Start inserting, ids: 0 - 49999, data size: 100,000 (base.py:323)
[2023-10-11 20:35:23,675 -  INFO - fouram]: [Time] Collection.insert run in 1.9651s (api_request.py:45)
[2023-10-11 20:35:23,679 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_UjJOzRos): 0 (base.py:485)
[2023-10-11 20:35:23,934 -  INFO - fouram]: [Base] Start inserting, ids: 50000 - 99999, data size: 100,000 (base.py:323)
[2023-10-11 20:35:25,840 -  INFO - fouram]: [Time] Collection.insert run in 1.9055s (api_request.py:45)
[2023-10-11 20:35:25,843 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_UjJOzRos): 0 (base.py:485)
[2023-10-11 20:35:25,930 -  INFO - fouram]: [Base] Total time of insert: 3.8706s, average number of vector bars inserted per
 second: 25835.7877, average time to insert 50000 vectors per time: 1.9353s (base.py:396)
[2023-10-11 20:35:25,931 -  INFO - fouram]: [Base] Start flush collection fouram_UjJOzRos (base.py:292)
[2023-10-11 20:35:27,951 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type
': 'L2', 'params': {}}}] (base.py:458)
[2023-10-11 20:35:27,951 -  INFO - fouram]: [Base] Start release collection fouram_UjJOzRos (base.py:303)
[2023-10-11 20:35:27,953 -  INFO - fouram]: [Base] Start build index of DISKANN for collection fouram_UjJOzRos, params:{'ind
ex_type': 'DISKANN', 'metric_type': 'L2', 'params': {}} (base.py:444)

build index is stuck for more than 7h

server:

fouramf-stable-54400-1-22-6938-etcd-0                             1/1     Running            1 (7h24m ago)    7h27m   10.104.5.118    4am-node12   <none>           <none>
fouramf-stable-54400-1-22-6938-etcd-1                             1/1     Running            1 (7h24m ago)    7h27m   10.104.18.229   4am-node25   <none>           <none>
fouramf-stable-54400-1-22-6938-etcd-2                             1/1     Running            0                7h27m   10.104.9.249    4am-node14   <none>           <none>
fouramf-stable-54400-1-22-6938-milvus-datacoord-78fc9bddbbmcckw   1/1     Running            1 (7h23m ago)    7h27m   10.104.6.80     4am-node13   <none>           <none>
fouramf-stable-54400-1-22-6938-milvus-datanode-587974d5c-hx966    1/1     Running            4 (7h9m ago)     7h27m   10.104.14.137   4am-node18   <none>           <none>
fouramf-stable-54400-1-22-6938-milvus-indexcoord-6ddb9b5d8szmtd   1/1     Running            0                7h27m   10.104.18.223   4am-node25   <none>           <none>
fouramf-stable-54400-1-22-6938-milvus-indexnode-6d64c55fbcp6kwx   1/1     Running            1 (7h23m ago)    7h27m   10.104.1.125    4am-node10   <none>           <none>
fouramf-stable-54400-1-22-6938-milvus-proxy-95cd6d769-h2r7h       1/1     Running            4 (7h11m ago)    7h27m   10.104.4.31     4am-node11   <none>           <none>
fouramf-stable-54400-1-22-6938-milvus-querycoord-f989bdc7dsdsdw   1/1     Running            4 (7h11m ago)    7h27m   10.104.17.78    4am-node23   <none>           <none>
fouramf-stable-54400-1-22-6938-milvus-querynode-787486b969zdc95   1/1     Running            1 (7h23m ago)    7h27m   10.104.6.81     4am-node13   <none>           <none>
fouramf-stable-54400-1-22-6938-milvus-rootcoord-669dbd54c4q9nb5   1/1     Running            3 (7h12m ago)    7h27m   10.104.6.78     4am-node13   <none>           <none>
fouramf-stable-54400-1-22-6938-minio-0                            1/1     Running            0                7h27m   10.104.5.117    4am-node12   <none>           <none>
fouramf-stable-54400-1-22-6938-minio-1                            1/1     Running            0                7h27m   10.104.18.232   4am-node25   <none>           <none>
fouramf-stable-54400-1-22-6938-minio-2                            1/1     Running            0                7h27m   10.104.24.98    4am-node29   <none>           <none>
fouramf-stable-54400-1-22-6938-minio-3                            1/1     Running            0                7h27m   10.104.9.248    4am-node14   <none>           <none>
fouramf-stable-54400-1-22-6938-pulsar-bookie-0                    1/1     Running            0                7h27m   10.104.9.247    4am-node14   <none>           <none>
fouramf-stable-54400-1-22-6938-pulsar-bookie-1                    1/1     Running            0                7h27m   10.104.13.42    4am-node16   <none>           <none>
fouramf-stable-54400-1-22-6938-pulsar-bookie-2                    1/1     Running            0                7h27m   10.104.5.125    4am-node12   <none>           <none>
fouramf-stable-54400-1-22-6938-pulsar-bookie-init-678cx           0/1     Completed          0                7h27m   10.104.6.79     4am-node13   <none>           <none>
fouramf-stable-54400-1-22-6938-pulsar-broker-0                    1/1     Running            0                7h27m   10.104.4.30     4am-node11   <none>           <none>
fouramf-stable-54400-1-22-6938-pulsar-proxy-0                     1/1     Running            0                7h27m   10.104.17.79    4am-node23   <none>           <none>
fouramf-stable-54400-1-22-6938-pulsar-pulsar-init-mv5qm           0/1     Completed          0                7h27m   10.104.17.77    4am-node23   <none>           <none>
fouramf-stable-54400-1-22-6938-pulsar-recovery-0                  1/1     Running            0                7h27m   10.104.17.81    4am-node23   <none>           <none>
fouramf-stable-54400-1-22-6938-pulsar-zookeeper-0                 1/1     Running            0                7h27m   10.104.18.230   4am-node25   <none>           <none>
fouramf-stable-54400-1-22-6938-pulsar-zookeeper-1                 1/1     Running            0                7h25m   10.104.4.44     4am-node11   <none>           <none>
fouramf-stable-54400-1-22-6938-pulsar-zookeeper-2                 1/1     Running            0                7h21m   10.104.6.101    4am-node13   <none>           <none>

index task : image

Expected Behavior

No response

Steps To Reproduce

1. create a collection or use an existing collection  
  2. build an DISKANN index on the vector column
  3. insert 100k vectors
  4. flush collection
  5. build index on vector column with the same parameters   ==> fail

Milvus Log

No response

Anything else?

Last night all the diskann stability cases got stuck at build index.

This image: master-20231011-07809880 was successful. So the problem could be between these two images: master-20231011-0780988 , master-20231011-be980fbc

test env: 4am cluster , qa-milvus namespace

yanliang567 commented 1 year ago

/assign @xige-16 /unassign

xige-16 commented 1 year ago

[ERROR] [indexnode/task.go:356] ["failed to build index"] [error="failed to create index, C Runtime Exception: bad_function_call\n: internal code=2001: segcore error"]

xige-16 commented 1 year ago

The test found that this version of the image does not support diskann index, and there is a problem with the default compilation parameters.

xige-16 commented 1 year ago

caused by pr: #27622

elstic commented 1 year ago

Verify image: master-20231012-bf46ffd6 The issue has been fixed