milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.77k stars 2.93k forks source link

[Bug]: It takes long time to build index for Float16Vector #34330

Closed yhmo closed 2 months ago

yhmo commented 4 months ago

Is there an existing issue for this?

Environment

- Milvus version: v2.4.4
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Index process for Float16Vector is so poor. One million vector 768 dim, HNSW{"M": 8, "efConstruction": 200}, it takes 1.5 hours to build index. Much slower than FloatVector.

Expected Behavior

The time cost to build index for Float16Vector should be less or equal to FloatVector.

Steps To Reproduce

1. create a collection with Float16Vector field
2. insert 1000000 vectors 768 dim into the collection
3. create index HNSW{"M": 8, "efConstruction": 200}
4. wait index done

Milvus Log

No response

Anything else?

No response

yhmo commented 4 months ago

Test script:

import random
import time

import numpy as np

from pymilvus import (
    connections,
    FieldSchema, CollectionSchema, DataType,
    Collection,
    utility,
)

HOST = 'localhost'
PORT = '19530'

connections.connect(host=HOST, port=PORT)

F16_COLLECTION = "f16_col"

DIM = 768
METRIC_TYPE = "L2"

ID_FIELD = "id"
VECTOR_FIELD = "vector"

def gen_fp16_vectors(num):
    raw_vectors = []
    fp16_vectors = []
    for _ in range(num):
        raw_vector = [random.random() for _ in range(DIM)]
        raw_vectors.append(raw_vector)
        fp16_vector = np.array(raw_vector, dtype=np.float16)
        fp16_vectors.append(fp16_vector)
    return raw_vectors, fp16_vectors

def create_collection():
    if utility.has_collection(F16_COLLECTION):
        utility.drop_collection(F16_COLLECTION)

    fields = [
        FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
        FieldSchema(name="vector", dtype=DataType.FLOAT16_VECTOR, dim=DIM),
    ]
    schema = CollectionSchema(fields=fields)
    collection = Collection(name=F16_COLLECTION, schema=schema)
    print(f"Collection '{F16_COLLECTION}' created")

def prepare_data():
    collection_16 = Collection(name=F16_COLLECTION)
    count = 10000
    for i in range(100):
        raw_vectors, fp16_vectors = gen_fp16_vectors(count)
        collection_16.insert(data=[
            fp16_vectors,
        ])
        print(f"insert batch {i}")
    print("insert done")

    time.sleep(5)
    collection_16.flush()
    print("flush done")

    start = time.time()
    index_params = {
        'metric_type': METRIC_TYPE,
        'index_type': "HNSW",
        'params': {"M": 8, "efConstruction": 200},
    }
    collection_16.create_index(field_name=VECTOR_FIELD, index_params=index_params)
    utility.wait_for_index_building_complete(collection_name=F16_COLLECTION)
    print("index done")

    end = time.time()
    print(f"fp16 index time cost: {end-start} seconds")

if __name__ == '__main__':
    create_collection()
    prepare_data()
yanliang567 commented 4 months ago

/assign @cqy123456 /unassign

cqy123456 commented 3 months ago

/assign @yhmo use latest 2.4 to try, fp16 and bf16 add simd support in latest 2.4.

yhmo commented 2 months ago

Tested in milvus v2.4.9 with the same script, the time cost of fp16 index reduces to 446 seconds, ten times faster than v2.4.4

fp16 index time cost: 446.4446921348572 seconds