opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.66k stars 1.78k forks source link

[HELP] Min hashes generated by OS 2.11 is less than the hashes generated by OS 1.3.7 #12578

Open uttsap opened 7 months ago

uttsap commented 7 months ago

Describe the bug

We have a process to de duplicate documents. We convert document’s textual details into a set of minhashes and then find total no. of common minhashes to find the similarity between 2 documents. We have used the elasticsearch minhashes and elasticsearch term queries to calculate the similarity instead of LSH approach. This process was working fine till we upgraded OS from 1.3.7 to 2.11.

For a same document, number of minihashes generated are different after OS upgrade. We are getting less minihashes in OS 2.11 due to which the comparison is not working.

Example Document:

{
        "_index": "document-deduplication",
        "_id": "c",
        "_score": 1,
        "_source": {
          "uri": "c",
          "publishedDate": 3,
          "text": """British army fights fake news with propagandists and hackers in one unit. Cyber and intelligence experts unite to battle disinformation as character of warfare changes. Computer hackers and propaganda specialists working in the British army are to be organized in a single division, as part of an effort to reflect a belief that the lines between peace and war has become increasingly blurred.  The cyber and intelligence experts will be organized into a reborn 6th Division – containing ground troops who can be used in secret, special forces-type operations.
 British forces to get new mission to counter state actors. One of the early tasks for the rebranded unit is to better tackle disinformation and fake news emerging from Russia and elsewhere . Army are more likely to be engaged in peacekeeping and security operations, or engaged in exercises in parts of the world where a visible UK presence is deemed politically desirable.
 But there is an increasing view that conflict has moved to the electronic and information arenas – particularly with Russia , but also with countries such as China and Iran – in which a key question is whether the UK can play a role to ensure that countries in eastern Europe remained allied with the west.""",
          "createdOn": 3,
          "insertedOn": 1710155879376
        }
      },

Index Settings:

{
  "document-deduplication": {
    "settings": {
      "index": {
        "refresh_interval": "30s",
        "number_of_shards": "1",
        "provided_name": "document-deduplication",
        "creation_date": "1696933018754",
        "analysis": {
          "filter": {
            "minHashFilter": {
              "hash_count": "1",
              "type": "min_hash",
              "with_rotation": "true",
              "hash_set_size": "3",
              "bucket_count": "250"
            },
            "shingleFilter": {
              "max_shingle_size": "3",
              "min_shingle_size": "3",
              "output_unigrams_if_no_shingles": "true",
              "output_unigrams": "false",
              "type": "shingle"
            }
          },
          "analyzer": {
            "min_hash_analyzer": {
              "filter": [
                "shingleFilter",
                "minHashFilter"
              ],
              "tokenizer": "standard"
            },
            "rich_text_analyzer": {
              "filter": [
                "lowercase"
              ],
              "char_filter": [
                "remove_tags_filter"
              ],
              "tokenizer": "standard"
            }
          },
          "char_filter": {
            "remove_tags_filter": {
              "pattern": "<.*?>",
              "type": "pattern_replace",
              "replacement": ""
            }
          }
        },
        "number_of_replicas": "1",
        "uuid": "G9Kq5ZjYRp6Ld1Ej4712VA",
        "version": {
          "created": "135248027"
        }
      }
    }
  }
}

Index Mapping:

{
  "document-deduplication": {
    "mappings": {
      "properties": {
        "createdOn": {
          "type": "date",
          "format": "epoch_millis"
        },
        "insertedOn": {
          "type": "date",
          "format": "epoch_millis"
        },
        "publishedDate": {
          "type": "date",
          "format": "epoch_millis"
        },
        "text": {
          "type": "text",
          "store": true,
          "fields": {
            "raw": {
              "type": "text",
              "analyzer": "rich_text_analyzer"
            }
          },
          "term_vector": "yes",
          "analyzer": "min_hash_analyzer"
        },
        "uri": {
          "type": "keyword"
        }
      }
    }
  }
}

Hashes genearted before upgrade: 197 Hashes generated after upgrade to 2.11: 165

Do we need to fine tune the index settings? Or are we missing something else? Help appreciated.

Thanks,

Related component

Other

To Reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

Same number of mini hashes.

Additional Details

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

Additional context Add any other context about the problem here.

andrross commented 7 months ago

[Triage - attendees 1 2 3] @uttsap Thanks for filing this issue.