milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.98k stars 2.95k forks source link

[Bug]: some results have topK=35 when topK>35 #25095

Closed AlexeyIvanov8 closed 1 year ago

AlexeyIvanov8 commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: 2.2.9
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): Kafka
- SDK version(e.g. pymilvus v2.0.0rc2): java sdk 2.2.5
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others: 

milvuscluster:
  image: milvusdb/milvus:v2.2.9
  components:
    dataCoord:
      replicas: 2
      resources:
        limits:
          cpu: 8
          memory: 16Gi
        requests:
          cpu: 4
          memory: 4Gi
    dataNode:
      replicas: 4
      resources:
        limits:
          cpu: 16
          memory: 64Gi
        requests:
          cpu: 2
          memory: 4Gi
    indexCoord:
      replicas: 1
      resources:
        limits:
          cpu: 8
          memory: 16Gi
        requests:
          cpu: 1
          memory: 1Gi
    indexNode:
      replicas: 4
      resources:
        limits:
          cpu: 32
          memory: 64Gi
        requests:
          cpu: 1
          memory: 4Gi
    queryCoord:
      replicas: 1
      resources:
        limits:
          cpu: 8
          memory: 16Gi
        requests:
          cpu: 2
          memory: 2Gi
    queryNode:
      replicas: 14
      resources:
        limits:
          cpu: 8
          memory: 64Gi
        requests:
          cpu: 8
          memory: 4Gi
    proxy:
      replicas: 4
      resources:
        limits:
          cpu: 10
          memory: 16Gi
        requests:
          cpu: 4
          memory: 8Gi
      serviceType: LoadBalancer
    rootCoord:
      replicas: 1
      resources:
        limits:
          cpu: 8
          memory: 8Gi
        requests:
          cpu: 1
          memory: 1Gi
  config:
    minio:
      bucketName: larva
      rootPath: ""
      useSSL: false
    proxy:
      maxTaskNum: 8192
    queryNode:
      dataSync:
        flowGraph:
          maxQueueLength: 1024 # Maximum length of task queue in flowgraph
          maxParallelism: 1024 # Maximum number of tasks executed in parallel in the flowgraph
    scheduler:
      #receiveChanSize: 10240
      #unsolvedQueueSize: 10240
      # maxReadConcurrentRatio is the concurrency ratio of read task (search task and query task).
      # Max read concurrency would be the value of `runtime.NumCPU * maxReadConcurrentRatio`.
      # It defaults to 2.0, which means max read concurrency would be the value of runtime.NumCPU * 2.
      # Max read concurrency must greater than or equal to 1, and less than or equal to runtime.NumCPU * 100.
      maxReadConcurrentRatio: 50.0 # (0, 100]
      cpuRatio: 120.0 # ratio used to estimate read task cpu usage.
    dataCoord:
      segment:
        maxSize: 512
  dependencies:
    kafka:
      inCluster:
        deletionPolicy: Delete
        pvcDeletion: true
        values:
          defaultReplicationFactor: 2
          replicaCount: 5
          numPartitions: 3
          resources:
            requests:
              cpu: 100m
              memory: 1G
            limits:
              cpu: 4
              memory: 8G
          persistence:
            accessMode: ReadWriteOnce
            enabled: true
            size: 30Gi
            storageClass: linstor-hdd
    etcd:
      inCluster:
        deletionPolicy: Delete
        pvcDeletion: false
        values:
          persistence:
            storageClass: linstor-hdd
            accessMode: ReadWriteOnce
            enabled: true
            size: 30Gi
          resources:
            limits:
              cpu: '4'
              memory: 16Gi
            requests:
              cpu: 200m
              memory: 1Gi
    storage:
      external: true
      endpoint: "minio:80"
      secretRef: "minio-secret"
      type: S3

Current Behavior

When I run multiply queries in parallel with topK>35 and each query contains batch of target vectors then some of results have topK=35. Most of queries have all topK equals topK from query.

Example: 10 target vectors,topK=50, then all results except one have 50 vectors and one result have 35 vectors: topKList = [50, 50, 50, 50, 35, 50, 50, 50, 50, 50]

Collection contains 200k vectors, loaded at 12 replicas, dim=512, index=HNSW;IP;M=16;efConstruction=8, ef=topK.

Query:

SearchParam{collectionName='collection', partitionNames='[]', metricType=IP, target vectors count=80, vectorFieldName='Vector', topK=50, nq=80, expr='', params='{"ef": 50}', consistencyLevel='EVENTUALLY', ignoreGrowing='false'}

Expected Behavior

All results have topK = topK from query.

Steps To Reproduce

1. Create collection with 200k vectors, dim=512 and index=HNSW;IP;M=16;efConstruction=8, ef=limit
2. Load collection at 12 replicas
3. Run 300 queries with 10 vector in each in parallel and topK=50
4. Some of queries results will contains all results except one with topK=50 and one result with topK=35. Most of queries have all topK=50.

Milvus Log

No response

Anything else?

No response

xiaofan-luan commented 1 year ago

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: 2.2.9
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): Kafka
- SDK version(e.g. pymilvus v2.0.0rc2): java sdk 2.2.5
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others: 

milvuscluster:
  image: milvusdb/milvus:v2.2.9
  components:
    dataCoord:
      replicas: 2
      resources:
        limits:
          cpu: 8
          memory: 16Gi
        requests:
          cpu: 4
          memory: 4Gi
    dataNode:
      replicas: 4
      resources:
        limits:
          cpu: 16
          memory: 64Gi
        requests:
          cpu: 2
          memory: 4Gi
    indexCoord:
      replicas: 1
      resources:
        limits:
          cpu: 8
          memory: 16Gi
        requests:
          cpu: 1
          memory: 1Gi
    indexNode:
      replicas: 4
      resources:
        limits:
          cpu: 32
          memory: 64Gi
        requests:
          cpu: 1
          memory: 4Gi
    queryCoord:
      replicas: 1
      resources:
        limits:
          cpu: 8
          memory: 16Gi
        requests:
          cpu: 2
          memory: 2Gi
    queryNode:
      replicas: 14
      resources:
        limits:
          cpu: 8
          memory: 64Gi
        requests:
          cpu: 8
          memory: 4Gi
    proxy:
      replicas: 4
      resources:
        limits:
          cpu: 10
          memory: 16Gi
        requests:
          cpu: 4
          memory: 8Gi
      serviceType: LoadBalancer
    rootCoord:
      replicas: 1
      resources:
        limits:
          cpu: 8
          memory: 8Gi
        requests:
          cpu: 1
          memory: 1Gi
  config:
    minio:
      bucketName: larva
      rootPath: ""
      useSSL: false
    proxy:
      maxTaskNum: 8192
    queryNode:
      dataSync:
        flowGraph:
          maxQueueLength: 1024 # Maximum length of task queue in flowgraph
          maxParallelism: 1024 # Maximum number of tasks executed in parallel in the flowgraph
    scheduler:
      #receiveChanSize: 10240
      #unsolvedQueueSize: 10240
      # maxReadConcurrentRatio is the concurrency ratio of read task (search task and query task).
      # Max read concurrency would be the value of `runtime.NumCPU * maxReadConcurrentRatio`.
      # It defaults to 2.0, which means max read concurrency would be the value of runtime.NumCPU * 2.
      # Max read concurrency must greater than or equal to 1, and less than or equal to runtime.NumCPU * 100.
      maxReadConcurrentRatio: 50.0 # (0, 100]
      cpuRatio: 120.0 # ratio used to estimate read task cpu usage.
    dataCoord:
      segment:
        maxSize: 512
  dependencies:
    kafka:
      inCluster:
        deletionPolicy: Delete
        pvcDeletion: true
        values:
          defaultReplicationFactor: 2
          replicaCount: 5
          numPartitions: 3
          resources:
            requests:
              cpu: 100m
              memory: 1G
            limits:
              cpu: 4
              memory: 8G
          persistence:
            accessMode: ReadWriteOnce
            enabled: true
            size: 30Gi
            storageClass: linstor-hdd
    etcd:
      inCluster:
        deletionPolicy: Delete
        pvcDeletion: false
        values:
          persistence:
            storageClass: linstor-hdd
            accessMode: ReadWriteOnce
            enabled: true
            size: 30Gi
          resources:
            limits:
              cpu: '4'
              memory: 16Gi
            requests:
              cpu: 200m
              memory: 1Gi
    storage:
      external: true
      endpoint: "minio:80"
      secretRef: "minio-secret"
      type: S3

Current Behavior

When I run multiply queries in parallel with topK>35 and each query contains batch of target vectors then some of results have topK=35. Most of queries have all topK equals topK from query.

Example: 10 target vectors,topK=50, then all results except one have 50 vectors and one result have 35 vectors: topKList = [50, 50, 50, 50, 35, 50, 50, 50, 50, 50]

Collection contains 200k vectors, loaded at 12 replicas, dim=512, index=HNSW;IP;M=16;efConstruction=8, ef=topK.

Query:

SearchParam{collectionName='collection', partitionNames='[]', metricType=IP, target vectors count=80, vectorFieldName='Vector', topK=50, nq=80, expr='', params='{"ef": 50}', consistencyLevel='EVENTUALLY', ignoreGrowing='false'}

Expected Behavior

All results have topK = topK from query.

Steps To Reproduce

1. Create collection with 200k vectors, dim=512 and index=HNSW;IP;M=16;efConstruction=8, ef=limit
2. Load collection at 12 replicas
3. Run 300 queries with 10 vector in each in parallel and topK=50
4. Some of queries results will contains all results except one with topK=50 and one result with topK=35. Most of queries have all topK=50.

Milvus Log

No response

Anything else?

No response

Did you tried to run this request again and it can be reproduced stably?

AlexeyIvanov8 commented 1 year ago

Yes, it reproduced stable when I run multiple parallels queries. I tried to make an example with reproduction, but there was no problem on small collections (about 5k).

xiaofan-luan commented 1 year ago

So let's try to reproduce this with minimal possibilities:

  1. Does it happen on 1 replica?
  2. Does it happen only with one vector?
  3. There is no filtering code right?
  4. can you share the search code?
AlexeyIvanov8 commented 1 year ago
  1. I retested, yes, with 1 replica also reproduced
  2. Yes, and for that vector topK always is 35
  3. Yes
  4. It's Scala code if need I can convert to Java and share other parts(connection etc) if required:
    
    val params = SearchParam
          .newBuilder()
          .withCollectionName(collection)
          .withPartitionNames(List().asJava)
          .withExpr("")
          .withMetricType(MetricType.IP)
          .withTopK(50)
          .withParams(s"{\"ef\": 50}")
          .withVectors(targetVectors.map(_.asJava).asJava) // List<List<Float>>
          .withVectorFieldName("vector")
          .withConsistencyLevel(ConsistencyLevelEnum.EVENTUALLY)
          .build()

milvusClient.search(params)

xiaofan-luan commented 1 year ago
  1. I retested, yes, with 1 replica also reproduced
  2. Yes, and for that vector topK always is 35
  3. Yes
  4. It's Scala code if need I can convert to Java and share other parts(connection etc) if required:
val params = SearchParam
          .newBuilder()
          .withCollectionName(collection)
          .withPartitionNames(List().asJava)
          .withExpr("")
          .withMetricType(MetricType.IP)
          .withTopK(50)
          .withParams(s"{\"ef\": 50}")
          .withVectors(targetVectors.map(_.asJava).asJava) // List<List<Float>>
          .withVectorFieldName("vector")
          .withConsistencyLevel(ConsistencyLevelEnum.EVENTUALLY)
          .build()

milvusClient.search(params)

How large is the index file? is there a chance we can get the binlog and index file? My guess is this is related to data and it's a very rare scenario. We will need the vector if it is not too huge

xiaofan-luan commented 1 year ago

@cydrain is there a possibility that knowhere return less than topk but with enough data?

xiaofan-luan commented 1 year ago

if you want to investigate by yourself, I would like to recommend you to add some logs at here:

SearchOnSealedIndex and SearchOnGrowing. You should at least know which part returned less result, either it's the engine itself or on the milvus side

yanliang567 commented 1 year ago

/assign @cydrain /unassign

xiaofan-luan commented 1 year ago

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: 2.2.9
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): Kafka
- SDK version(e.g. pymilvus v2.0.0rc2): java sdk 2.2.5
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others: 

milvuscluster:
  image: milvusdb/milvus:v2.2.9
  components:
    dataCoord:
      replicas: 2
      resources:
        limits:
          cpu: 8
          memory: 16Gi
        requests:
          cpu: 4
          memory: 4Gi
    dataNode:
      replicas: 4
      resources:
        limits:
          cpu: 16
          memory: 64Gi
        requests:
          cpu: 2
          memory: 4Gi
    indexCoord:
      replicas: 1
      resources:
        limits:
          cpu: 8
          memory: 16Gi
        requests:
          cpu: 1
          memory: 1Gi
    indexNode:
      replicas: 4
      resources:
        limits:
          cpu: 32
          memory: 64Gi
        requests:
          cpu: 1
          memory: 4Gi
    queryCoord:
      replicas: 1
      resources:
        limits:
          cpu: 8
          memory: 16Gi
        requests:
          cpu: 2
          memory: 2Gi
    queryNode:
      replicas: 14
      resources:
        limits:
          cpu: 8
          memory: 64Gi
        requests:
          cpu: 8
          memory: 4Gi
    proxy:
      replicas: 4
      resources:
        limits:
          cpu: 10
          memory: 16Gi
        requests:
          cpu: 4
          memory: 8Gi
      serviceType: LoadBalancer
    rootCoord:
      replicas: 1
      resources:
        limits:
          cpu: 8
          memory: 8Gi
        requests:
          cpu: 1
          memory: 1Gi
  config:
    minio:
      bucketName: larva
      rootPath: ""
      useSSL: false
    proxy:
      maxTaskNum: 8192
    queryNode:
      dataSync:
        flowGraph:
          maxQueueLength: 1024 # Maximum length of task queue in flowgraph
          maxParallelism: 1024 # Maximum number of tasks executed in parallel in the flowgraph
    scheduler:
      #receiveChanSize: 10240
      #unsolvedQueueSize: 10240
      # maxReadConcurrentRatio is the concurrency ratio of read task (search task and query task).
      # Max read concurrency would be the value of `runtime.NumCPU * maxReadConcurrentRatio`.
      # It defaults to 2.0, which means max read concurrency would be the value of runtime.NumCPU * 2.
      # Max read concurrency must greater than or equal to 1, and less than or equal to runtime.NumCPU * 100.
      maxReadConcurrentRatio: 50.0 # (0, 100]
      cpuRatio: 120.0 # ratio used to estimate read task cpu usage.
    dataCoord:
      segment:
        maxSize: 512
  dependencies:
    kafka:
      inCluster:
        deletionPolicy: Delete
        pvcDeletion: true
        values:
          defaultReplicationFactor: 2
          replicaCount: 5
          numPartitions: 3
          resources:
            requests:
              cpu: 100m
              memory: 1G
            limits:
              cpu: 4
              memory: 8G
          persistence:
            accessMode: ReadWriteOnce
            enabled: true
            size: 30Gi
            storageClass: linstor-hdd
    etcd:
      inCluster:
        deletionPolicy: Delete
        pvcDeletion: false
        values:
          persistence:
            storageClass: linstor-hdd
            accessMode: ReadWriteOnce
            enabled: true
            size: 30Gi
          resources:
            limits:
              cpu: '4'
              memory: 16Gi
            requests:
              cpu: 200m
              memory: 1Gi
    storage:
      external: true
      endpoint: "minio:80"
      secretRef: "minio-secret"
      type: S3

Current Behavior

When I run multiply queries in parallel with topK>35 and each query contains batch of target vectors then some of results have topK=35. Most of queries have all topK equals topK from query.

Example: 10 target vectors,topK=50, then all results except one have 50 vectors and one result have 35 vectors: topKList = [50, 50, 50, 50, 35, 50, 50, 50, 50, 50]

Collection contains 200k vectors, loaded at 12 replicas, dim=512, index=HNSW;IP;M=16;efConstruction=8, ef=topK.

Query:

SearchParam{collectionName='collection', partitionNames='[]', metricType=IP, target vectors count=80, vectorFieldName='Vector', topK=50, nq=80, expr='', params='{"ef": 50}', consistencyLevel='EVENTUALLY', ignoreGrowing='false'}

Expected Behavior

All results have topK = topK from query.

Steps To Reproduce

1. Create collection with 200k vectors, dim=512 and index=HNSW;IP;M=16;efConstruction=8, ef=limit
2. Load collection at 12 replicas
3. Run 300 queries with 10 vector in each in parallel and topK=50
4. Some of queries results will contains all results except one with topK=50 and one result with topK=35. Most of queries have all topK=50.

Milvus Log

No response

Anything else?

No response

Is there a possibility that you have duplicated primary keys in your database?

cydrain commented 1 year ago

Hi @AlexeyIvanov8, can you share your script to reproduce this issue ?

cydrain commented 1 year ago

@cydrain is there a possibility that knowhere return less than topk but with enough data?

@xiaofan-luan if no data deleted in this collection, knowhere HNSW will always return 50 results if the search param set topk=50. And in this issue, it happens with 12 replica. 12 replica will generally return 12 * 50 = 600 results, and return the final 50 results after reduce.

I can't imagine what happen and make Milvus only returns 35 results for one query vector. I need the script to reproduce this issue in my machine and dig it out.

xiaofan-luan commented 1 year ago

@cydrain is there a possibility that knowhere return less than topk but with enough data?

@xiaofan-luan if no data deleted in this collection, knowhere HNSW will always return 50 results if the search param set topk=50. And in this issue, it happens with 12 replica. 12 replica will generally return 12 * 50 = 600 results, and return the final 50 results after reduce.

I can't imagine what happen and make Milvus only returns 35 results for one query vector. I need the script to reproduce this issue in my machine and dig it out.

each search goes to only 1 replica. Could it be duplicated primary key?

cydrain commented 1 year ago

@cydrain is there a possibility that knowhere return less than topk but with enough data?

@xiaofan-luan if no data deleted in this collection, knowhere HNSW will always return 50 results if the search param set topk=50. And in this issue, it happens with 12 replica. 12 replica will generally return 12 * 50 = 600 results, and return the final 50 results after reduce. I can't imagine what happen and make Milvus only returns 35 results for one query vector. I need the script to reproduce this issue in my machine and dig it out.

each search goes to only 1 replica. Could it be duplicated primary key?

it's possible, if most of the returned results' ids are duplicated, only 35 of them are unique.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.