opendistro-for-elasticsearch / k-NN

🆕 A machine learning plugin which supports an approximate k-NN search algorithm for Open Distro.
https://opendistro.github.io/
Apache License 2.0
277 stars 55 forks source link

Find Duplicate Vectors #335

Closed mustfkeskin closed 3 years ago

mustfkeskin commented 3 years ago

Hello I want to find how many productId contains duplicate image vector. How i can found it My mappings are as follows

'mappings': {
            'properties': {
                'image_vector': {
                    'type': 'knn_vector',
                    'dimension': 1024
                },
                'url': {
                    'type': 'keyword'
                },
                'brandId': {
                    'type': 'long'
                },
                'categoryId': {
                    'type': 'long'
                },
                'productId': {
                    'type': 'long'
                }
            }
        }
vamshin commented 3 years ago

Hi @mustfkeskin, if I understand your question correct. You wanted to know how many documents share the same vector right?

You can use exists query to find all documents containing a particular vector.

Example:-

{
    "query": {
       "exists": {
       "field": "imageVector"
     }
  }
}

Note:- This feature is available only from ODFE 1.11.0.0

mustfkeskin commented 3 years ago

I explain my question with example One product can have multiple images. But same product some images can be duplicated

productId=1, image_vector=[1,1] productId=1, image_vector=[1,1] productId=1, image_vector=[3,3] productId=2, image_vector=[1,9] productId=2, image_vector=[2,9]

I want to see which product contains duplicate vector productId= 1 --> True productId= 2 --> False

VijayanB commented 3 years ago

@mustfkeskin Since knn_vector type doesn't support sorting, it is not possible to do bucket selection aggregation. I am looking at other possible way to get expected result. Will update here in couple of days with suggestions, hope that helps.

VijayanB commented 3 years ago

@mustfkeskin you can try following approach. I only tested for limited use cases, please verify before deploying it in your production. Since knn_vector cannot be used directly inside aggregation, we can create a new field on fly "imageVectorHashCode" which concatenates all dimension value and use it for aggregation.

Example: Insert documents productId=1, image_vector=[1,1] productId=1, image_vector=[1,1] productId=1, image_vector=[3,3] productId=2, image_vector=[1,9] productId=3, image_vector=[2,9] productId=3, image_vector=[2,9] productId=3, image_vector=[2,9] productId=3, image_vector=[2,9] productId=3, image_vector=[2,9] productId=3, image_vector=[2,10]

POST "myindex/_doc"
{
    "image_vector": [1,1],
    "product_id": 1
}

POST "myindex/_doc"
{
    "image_vector": [1,1],
    "product_id": 1
}

POST "myindex/_doc"
{
    "image_vector": [3,3],
    "product_id": 1
}
POST "myindex/_doc"
{
    "image_vector": [1,9],
    "product_id": 2
}
POST "myindex/_doc"
{
    "image_vector": [2,9],
    "product_id": 2
}
POST "myindex/_doc"
{
    "image_vector": [2,9],
    "product_id": 3
}
POST "myindex/_doc"
{
    "image_vector": [2,9],
    "product_id": 3
}
POST "myindex/_doc"
{
    "image_vector": [2,9],
    "product_id": 3
}
POST "myindex/_doc"
{
    "image_vector": [2,9],
    "product_id": 3
}
POST "myindex/_doc"
{
    "image_vector": [2,10],
    "product_id": 3
}

Expected output: productId= 1 --> True productId= 1 --> True productId= 3 --> True

I haven't spent time in getting final output like you are expecting but i could get this information using aggregation as below

curl -X GET "localhost:9200/myindex/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "myindex": {
      "terms": {
        "field": "product_id"
      },
      "aggs": {
        "imageVectorHashCode":{
          "terms":{
             "script": "StringBuilder builder = new StringBuilder();for (float i : params._source[\"image_vector\"]){ builder.append(i);builder.append(\"#\");} return builder.toString().hashCode();",
             "min_doc_count":2
          }
        },
                "min_bucket_selector": {
          "bucket_selector": {
            "buckets_path": {
              "count": "imageVectorHashCode._bucket_count"
            },
            "script": {
              "inline": "params.count != 0"
            }
          }
        }
     }}
  }
}'

Output showing only aggs

 "aggregations" : {
    "myindex" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 3,
          "doc_count" : 4,
          "imageVectorHashCode" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "524933495",
                "doc_count" : 4
              }
            ]
          }
        },
        {
          "key" : 1,
          "doc_count" : 3,
          "imageVectorHashCode" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "-1218115168",
                "doc_count" : 2
              }
            ]
          }
        }
      ]
    }
  }

You can modify the above logic based on your requirement . Hope this gives some idea and helps you to solve your problem.

Note: I used bucket selector to remove product id which doesn't have duplicates. You can remove if you want to know for every product id and determine (true/false) if count > 0

mustfkeskin commented 3 years ago

This solution fit my case I have one more question about response json

"key" : 1, --> means productId=1 have "doc_count" : 3 means this product have 3 image vector

"buckets" : [
              {
                "key" : "-1218115168",
                "doc_count" : 2
              }
            ]

[1,1] this vector --> 2 time duplicated "-1218115168" --> vector hash

Am I correct?

VijayanB commented 3 years ago

@mustfkeskin Thats correct.

mustfkeskin commented 3 years ago

Thank you @VijayanB and @vamshin