Closed AlexeyIvanov8 closed 1 year ago
Is there an existing issue for this?
- [x] I have searched the existing issues
Environment
- Milvus version: 2.2.9 - Deployment mode(standalone or cluster): cluster - MQ type(rocksmq, pulsar or kafka): Kafka - SDK version(e.g. pymilvus v2.0.0rc2): java sdk 2.2.5 - OS(Ubuntu or CentOS): - CPU/Memory: - GPU: - Others: milvuscluster: image: milvusdb/milvus:v2.2.9 components: dataCoord: replicas: 2 resources: limits: cpu: 8 memory: 16Gi requests: cpu: 4 memory: 4Gi dataNode: replicas: 4 resources: limits: cpu: 16 memory: 64Gi requests: cpu: 2 memory: 4Gi indexCoord: replicas: 1 resources: limits: cpu: 8 memory: 16Gi requests: cpu: 1 memory: 1Gi indexNode: replicas: 4 resources: limits: cpu: 32 memory: 64Gi requests: cpu: 1 memory: 4Gi queryCoord: replicas: 1 resources: limits: cpu: 8 memory: 16Gi requests: cpu: 2 memory: 2Gi queryNode: replicas: 14 resources: limits: cpu: 8 memory: 64Gi requests: cpu: 8 memory: 4Gi proxy: replicas: 4 resources: limits: cpu: 10 memory: 16Gi requests: cpu: 4 memory: 8Gi serviceType: LoadBalancer rootCoord: replicas: 1 resources: limits: cpu: 8 memory: 8Gi requests: cpu: 1 memory: 1Gi config: minio: bucketName: larva rootPath: "" useSSL: false proxy: maxTaskNum: 8192 queryNode: dataSync: flowGraph: maxQueueLength: 1024 # Maximum length of task queue in flowgraph maxParallelism: 1024 # Maximum number of tasks executed in parallel in the flowgraph scheduler: #receiveChanSize: 10240 #unsolvedQueueSize: 10240 # maxReadConcurrentRatio is the concurrency ratio of read task (search task and query task). # Max read concurrency would be the value of `runtime.NumCPU * maxReadConcurrentRatio`. # It defaults to 2.0, which means max read concurrency would be the value of runtime.NumCPU * 2. # Max read concurrency must greater than or equal to 1, and less than or equal to runtime.NumCPU * 100. maxReadConcurrentRatio: 50.0 # (0, 100] cpuRatio: 120.0 # ratio used to estimate read task cpu usage. dataCoord: segment: maxSize: 512 dependencies: kafka: inCluster: deletionPolicy: Delete pvcDeletion: true values: defaultReplicationFactor: 2 replicaCount: 5 numPartitions: 3 resources: requests: cpu: 100m memory: 1G limits: cpu: 4 memory: 8G persistence: accessMode: ReadWriteOnce enabled: true size: 30Gi storageClass: linstor-hdd etcd: inCluster: deletionPolicy: Delete pvcDeletion: false values: persistence: storageClass: linstor-hdd accessMode: ReadWriteOnce enabled: true size: 30Gi resources: limits: cpu: '4' memory: 16Gi requests: cpu: 200m memory: 1Gi storage: external: true endpoint: "minio:80" secretRef: "minio-secret" type: S3
Current Behavior
When I run multiply queries in parallel with
topK>35
and each query contains batch of target vectors then some of results havetopK=35
. Most of queries have all topK equals topK from query.Example: 10 target vectors,
topK=50
, then all results except one have 50 vectors and one result have 35 vectors: topKList = [50, 50, 50, 50, 35, 50, 50, 50, 50, 50]Collection contains 200k vectors, loaded at 12 replicas, dim=512, index=HNSW;IP;M=16;efConstruction=8, ef=topK.
Query:
SearchParam{collectionName='collection', partitionNames='[]', metricType=IP, target vectors count=80, vectorFieldName='Vector', topK=50, nq=80, expr='', params='{"ef": 50}', consistencyLevel='EVENTUALLY', ignoreGrowing='false'}
Expected Behavior
All results have topK = topK from query.
Steps To Reproduce
1. Create collection with 200k vectors, dim=512 and index=HNSW;IP;M=16;efConstruction=8, ef=limit 2. Load collection at 12 replicas 3. Run 300 queries with 10 vector in each in parallel and topK=50 4. Some of queries results will contains all results except one with topK=50 and one result with topK=35. Most of queries have all topK=50.
Milvus Log
No response
Anything else?
No response
Did you tried to run this request again and it can be reproduced stably?
Yes, it reproduced stable when I run multiple parallels queries. I tried to make an example with reproduction, but there was no problem on small collections (about 5k).
So let's try to reproduce this with minimal possibilities:
val params = SearchParam
.newBuilder()
.withCollectionName(collection)
.withPartitionNames(List().asJava)
.withExpr("")
.withMetricType(MetricType.IP)
.withTopK(50)
.withParams(s"{\"ef\": 50}")
.withVectors(targetVectors.map(_.asJava).asJava) // List<List<Float>>
.withVectorFieldName("vector")
.withConsistencyLevel(ConsistencyLevelEnum.EVENTUALLY)
.build()
milvusClient.search(params)
- I retested, yes, with 1 replica also reproduced
- Yes, and for that vector topK always is 35
- Yes
- It's Scala code if need I can convert to Java and share other parts(connection etc) if required:
val params = SearchParam .newBuilder() .withCollectionName(collection) .withPartitionNames(List().asJava) .withExpr("") .withMetricType(MetricType.IP) .withTopK(50) .withParams(s"{\"ef\": 50}") .withVectors(targetVectors.map(_.asJava).asJava) // List<List<Float>> .withVectorFieldName("vector") .withConsistencyLevel(ConsistencyLevelEnum.EVENTUALLY) .build() milvusClient.search(params)
How large is the index file? is there a chance we can get the binlog and index file? My guess is this is related to data and it's a very rare scenario. We will need the vector if it is not too huge
@cydrain is there a possibility that knowhere return less than topk but with enough data?
if you want to investigate by yourself, I would like to recommend you to add some logs at here:
SearchOnSealedIndex and SearchOnGrowing. You should at least know which part returned less result, either it's the engine itself or on the milvus side
/assign @cydrain /unassign
Is there an existing issue for this?
- [x] I have searched the existing issues
Environment
- Milvus version: 2.2.9 - Deployment mode(standalone or cluster): cluster - MQ type(rocksmq, pulsar or kafka): Kafka - SDK version(e.g. pymilvus v2.0.0rc2): java sdk 2.2.5 - OS(Ubuntu or CentOS): - CPU/Memory: - GPU: - Others: milvuscluster: image: milvusdb/milvus:v2.2.9 components: dataCoord: replicas: 2 resources: limits: cpu: 8 memory: 16Gi requests: cpu: 4 memory: 4Gi dataNode: replicas: 4 resources: limits: cpu: 16 memory: 64Gi requests: cpu: 2 memory: 4Gi indexCoord: replicas: 1 resources: limits: cpu: 8 memory: 16Gi requests: cpu: 1 memory: 1Gi indexNode: replicas: 4 resources: limits: cpu: 32 memory: 64Gi requests: cpu: 1 memory: 4Gi queryCoord: replicas: 1 resources: limits: cpu: 8 memory: 16Gi requests: cpu: 2 memory: 2Gi queryNode: replicas: 14 resources: limits: cpu: 8 memory: 64Gi requests: cpu: 8 memory: 4Gi proxy: replicas: 4 resources: limits: cpu: 10 memory: 16Gi requests: cpu: 4 memory: 8Gi serviceType: LoadBalancer rootCoord: replicas: 1 resources: limits: cpu: 8 memory: 8Gi requests: cpu: 1 memory: 1Gi config: minio: bucketName: larva rootPath: "" useSSL: false proxy: maxTaskNum: 8192 queryNode: dataSync: flowGraph: maxQueueLength: 1024 # Maximum length of task queue in flowgraph maxParallelism: 1024 # Maximum number of tasks executed in parallel in the flowgraph scheduler: #receiveChanSize: 10240 #unsolvedQueueSize: 10240 # maxReadConcurrentRatio is the concurrency ratio of read task (search task and query task). # Max read concurrency would be the value of `runtime.NumCPU * maxReadConcurrentRatio`. # It defaults to 2.0, which means max read concurrency would be the value of runtime.NumCPU * 2. # Max read concurrency must greater than or equal to 1, and less than or equal to runtime.NumCPU * 100. maxReadConcurrentRatio: 50.0 # (0, 100] cpuRatio: 120.0 # ratio used to estimate read task cpu usage. dataCoord: segment: maxSize: 512 dependencies: kafka: inCluster: deletionPolicy: Delete pvcDeletion: true values: defaultReplicationFactor: 2 replicaCount: 5 numPartitions: 3 resources: requests: cpu: 100m memory: 1G limits: cpu: 4 memory: 8G persistence: accessMode: ReadWriteOnce enabled: true size: 30Gi storageClass: linstor-hdd etcd: inCluster: deletionPolicy: Delete pvcDeletion: false values: persistence: storageClass: linstor-hdd accessMode: ReadWriteOnce enabled: true size: 30Gi resources: limits: cpu: '4' memory: 16Gi requests: cpu: 200m memory: 1Gi storage: external: true endpoint: "minio:80" secretRef: "minio-secret" type: S3
Current Behavior
When I run multiply queries in parallel with
topK>35
and each query contains batch of target vectors then some of results havetopK=35
. Most of queries have all topK equals topK from query.Example: 10 target vectors,
topK=50
, then all results except one have 50 vectors and one result have 35 vectors: topKList = [50, 50, 50, 50, 35, 50, 50, 50, 50, 50]Collection contains 200k vectors, loaded at 12 replicas, dim=512, index=HNSW;IP;M=16;efConstruction=8, ef=topK.
Query:
SearchParam{collectionName='collection', partitionNames='[]', metricType=IP, target vectors count=80, vectorFieldName='Vector', topK=50, nq=80, expr='', params='{"ef": 50}', consistencyLevel='EVENTUALLY', ignoreGrowing='false'}
Expected Behavior
All results have topK = topK from query.
Steps To Reproduce
1. Create collection with 200k vectors, dim=512 and index=HNSW;IP;M=16;efConstruction=8, ef=limit 2. Load collection at 12 replicas 3. Run 300 queries with 10 vector in each in parallel and topK=50 4. Some of queries results will contains all results except one with topK=50 and one result with topK=35. Most of queries have all topK=50.
Milvus Log
No response
Anything else?
No response
Is there a possibility that you have duplicated primary keys in your database?
Hi @AlexeyIvanov8, can you share your script to reproduce this issue ?
@cydrain is there a possibility that knowhere return less than topk but with enough data?
@xiaofan-luan if no data deleted in this collection, knowhere HNSW will always return 50 results if the search param set topk=50. And in this issue, it happens with 12 replica. 12 replica will generally return 12 * 50 = 600 results, and return the final 50 results after reduce.
I can't imagine what happen and make Milvus only returns 35 results for one query vector. I need the script to reproduce this issue in my machine and dig it out.
@cydrain is there a possibility that knowhere return less than topk but with enough data?
@xiaofan-luan if no data deleted in this collection, knowhere HNSW will always return 50 results if the search param set topk=50. And in this issue, it happens with 12 replica. 12 replica will generally return 12 * 50 = 600 results, and return the final 50 results after reduce.
I can't imagine what happen and make Milvus only returns 35 results for one query vector. I need the script to reproduce this issue in my machine and dig it out.
each search goes to only 1 replica. Could it be duplicated primary key?
@cydrain is there a possibility that knowhere return less than topk but with enough data?
@xiaofan-luan if no data deleted in this collection, knowhere HNSW will always return 50 results if the search param set topk=50. And in this issue, it happens with 12 replica. 12 replica will generally return 12 * 50 = 600 results, and return the final 50 results after reduce. I can't imagine what happen and make Milvus only returns 35 results for one query vector. I need the script to reproduce this issue in my machine and dig it out.
each search goes to only 1 replica. Could it be duplicated primary key?
it's possible, if most of the returned results' ids are duplicated, only 35 of them are unique.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
Is there an existing issue for this?
Environment
Current Behavior
When I run multiply queries in parallel with
topK>35
and each query contains batch of target vectors then some of results havetopK=35
. Most of queries have all topK equals topK from query.Example: 10 target vectors,
topK=50
, then all results except one have 50 vectors and one result have 35 vectors: topKList = [50, 50, 50, 50, 35, 50, 50, 50, 50, 50]Collection contains 200k vectors, loaded at 12 replicas, dim=512, index=HNSW;IP;M=16;efConstruction=8, ef=topK.
Query:
Expected Behavior
All results have topK = topK from query.
Steps To Reproduce
Milvus Log
No response
Anything else?
No response