milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.28k stars 2.81k forks source link

[Bug]: It takes long time to build index for Float16Vector #34330

Open yhmo opened 2 months ago

yhmo commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version: v2.4.4
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Index process for Float16Vector is so poor. One million vector 768 dim, HNSW{"M": 8, "efConstruction": 200}, it takes 1.5 hours to build index. Much slower than FloatVector.

Expected Behavior

The time cost to build index for Float16Vector should be less or equal to FloatVector.

Steps To Reproduce

1. create a collection with Float16Vector field
2. insert 1000000 vectors 768 dim into the collection
3. create index HNSW{"M": 8, "efConstruction": 200}
4. wait index done

Milvus Log

No response

Anything else?

No response

yhmo commented 2 months ago

Test script:

import random
import time

import numpy as np

from pymilvus import (
    connections,
    FieldSchema, CollectionSchema, DataType,
    Collection,
    utility,
)

HOST = 'localhost'
PORT = '19530'

connections.connect(host=HOST, port=PORT)

F16_COLLECTION = "f16_col"

DIM = 768
METRIC_TYPE = "L2"

ID_FIELD = "id"
VECTOR_FIELD = "vector"

def gen_fp16_vectors(num):
    raw_vectors = []
    fp16_vectors = []
    for _ in range(num):
        raw_vector = [random.random() for _ in range(DIM)]
        raw_vectors.append(raw_vector)
        fp16_vector = np.array(raw_vector, dtype=np.float16)
        fp16_vectors.append(fp16_vector)
    return raw_vectors, fp16_vectors

def create_collection():
    if utility.has_collection(F16_COLLECTION):
        utility.drop_collection(F16_COLLECTION)

    fields = [
        FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
        FieldSchema(name="vector", dtype=DataType.FLOAT16_VECTOR, dim=DIM),
    ]
    schema = CollectionSchema(fields=fields)
    collection = Collection(name=F16_COLLECTION, schema=schema)
    print(f"Collection '{F16_COLLECTION}' created")

def prepare_data():
    collection_16 = Collection(name=F16_COLLECTION)
    count = 10000
    for i in range(100):
        raw_vectors, fp16_vectors = gen_fp16_vectors(count)
        collection_16.insert(data=[
            fp16_vectors,
        ])
        print(f"insert batch {i}")
    print("insert done")

    time.sleep(5)
    collection_16.flush()
    print("flush done")

    start = time.time()
    index_params = {
        'metric_type': METRIC_TYPE,
        'index_type': "HNSW",
        'params': {"M": 8, "efConstruction": 200},
    }
    collection_16.create_index(field_name=VECTOR_FIELD, index_params=index_params)
    utility.wait_for_index_building_complete(collection_name=F16_COLLECTION)
    print("index done")

    end = time.time()
    print(f"fp16 index time cost: {end-start} seconds")

if __name__ == '__main__':
    create_collection()
    prepare_data()
yanliang567 commented 2 months ago

/assign @cqy123456 /unassign

cqy123456 commented 1 month ago

/assign @yhmo use latest 2.4 to try, fp16 and bf16 add simd support in latest 2.4.