Closed mustfkeskin closed 3 years ago
Hi @mustfkeskin, if I understand your question correct. You wanted to know how many documents share the same vector right?
You can use exists
query to find all documents containing a particular vector.
Example:-
{
"query": {
"exists": {
"field": "imageVector"
}
}
}
Note:- This feature is available only from ODFE 1.11.0.0
I explain my question with example One product can have multiple images. But same product some images can be duplicated
productId=1, image_vector=[1,1] productId=1, image_vector=[1,1] productId=1, image_vector=[3,3] productId=2, image_vector=[1,9] productId=2, image_vector=[2,9]
I want to see which product contains duplicate vector productId= 1 --> True productId= 2 --> False
@mustfkeskin Since knn_vector type doesn't support sorting, it is not possible to do bucket selection aggregation. I am looking at other possible way to get expected result. Will update here in couple of days with suggestions, hope that helps.
@mustfkeskin you can try following approach. I only tested for limited use cases, please verify before deploying it in your production. Since knn_vector cannot be used directly inside aggregation, we can create a new field on fly "imageVectorHashCode" which concatenates all dimension value and use it for aggregation.
Example: Insert documents productId=1, image_vector=[1,1] productId=1, image_vector=[1,1] productId=1, image_vector=[3,3] productId=2, image_vector=[1,9] productId=3, image_vector=[2,9] productId=3, image_vector=[2,9] productId=3, image_vector=[2,9] productId=3, image_vector=[2,9] productId=3, image_vector=[2,9] productId=3, image_vector=[2,10]
POST "myindex/_doc"
{
"image_vector": [1,1],
"product_id": 1
}
POST "myindex/_doc"
{
"image_vector": [1,1],
"product_id": 1
}
POST "myindex/_doc"
{
"image_vector": [3,3],
"product_id": 1
}
POST "myindex/_doc"
{
"image_vector": [1,9],
"product_id": 2
}
POST "myindex/_doc"
{
"image_vector": [2,9],
"product_id": 2
}
POST "myindex/_doc"
{
"image_vector": [2,9],
"product_id": 3
}
POST "myindex/_doc"
{
"image_vector": [2,9],
"product_id": 3
}
POST "myindex/_doc"
{
"image_vector": [2,9],
"product_id": 3
}
POST "myindex/_doc"
{
"image_vector": [2,9],
"product_id": 3
}
POST "myindex/_doc"
{
"image_vector": [2,10],
"product_id": 3
}
Expected output: productId= 1 --> True productId= 1 --> True productId= 3 --> True
I haven't spent time in getting final output like you are expecting but i could get this information using aggregation as below
curl -X GET "localhost:9200/myindex/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match_all": {}
},
"aggs": {
"myindex": {
"terms": {
"field": "product_id"
},
"aggs": {
"imageVectorHashCode":{
"terms":{
"script": "StringBuilder builder = new StringBuilder();for (float i : params._source[\"image_vector\"]){ builder.append(i);builder.append(\"#\");} return builder.toString().hashCode();",
"min_doc_count":2
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "imageVectorHashCode._bucket_count"
},
"script": {
"inline": "params.count != 0"
}
}
}
}}
}
}'
Output showing only aggs
"aggregations" : {
"myindex" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 3,
"doc_count" : 4,
"imageVectorHashCode" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "524933495",
"doc_count" : 4
}
]
}
},
{
"key" : 1,
"doc_count" : 3,
"imageVectorHashCode" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "-1218115168",
"doc_count" : 2
}
]
}
}
]
}
}
You can modify the above logic based on your requirement . Hope this gives some idea and helps you to solve your problem.
Note: I used bucket selector to remove product id which doesn't have duplicates. You can remove if you want to know for every product id and determine (true/false) if count > 0
This solution fit my case I have one more question about response json
"key" : 1, --> means productId=1 have "doc_count" : 3 means this product have 3 image vector
"buckets" : [
{
"key" : "-1218115168",
"doc_count" : 2
}
]
[1,1] this vector --> 2 time duplicated "-1218115168" --> vector hash
Am I correct?
@mustfkeskin Thats correct.
Thank you @VijayanB and @vamshin
Hello I want to find how many productId contains duplicate image vector. How i can found it My mappings are as follows