opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.75k stars 1.81k forks source link

[BUG] Using synonym filter after hunspell. #16530

Open aswad1 opened 5 days ago

aswad1 commented 5 days ago

Describe the bug

When using synonym filter after hunspell. I don't see the expected plural synonyms in the output. In the configuration below, I have added synonyms:

PUT /test-index3
{
  "settings": {
    "analysis": {
      "filter": {
        "custom_synonym_graph-replacement_filter": {
          "type": "synonym_graph",
          "synonyms": [
            "stationary, stationery, stationaries, stationeries"
          ]
        },
        "custom_hunspell_stemmer": {
          "type": "hunspell",
          "locale": "en_US"
        }
      },
      "analyzer": {
        "test_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "custom_hunspell_stemmer",
            "custom_synonym_graph-replacement_filter"
          ]
        }
      }
    }
  }

While testing, I don't see stationaries and stationeries in the output.

POST /test-index3/_analyze
{
  "analyzer": "test_analyzer",
  "text": "stationary"
}

--
{
  "tokens": [
    {
      "token": "stationery",
      "start_offset": 0,
      "end_offset": 10,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "stationary",
      "start_offset": 0,
      "end_offset": 10,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "stationary",
      "start_offset": 0,
      "end_offset": 10,
      "type": "word",
      "position": 0
    }
  ]
}

Here is the details analysis from Opensearch:

POST /test-index3/_analyze
{
  "analyzer": "test_analyzer",
  "text": "stationary",
   "explain": true
}

------------------
{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "whitespace",
      "tokens": [
        {
          "token": "stationary",
          "start_offset": 0,
          "end_offset": 10,
          "type": "word",
          "position": 0,
          "bytes": "[73 74 61 74 69 6f 6e 61 72 79]",
          "positionLength": 1,
          "termFrequency": 1
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "lowercase",
        "tokens": [
          {
            "token": "stationary",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0,
            "bytes": "[73 74 61 74 69 6f 6e 61 72 79]",
            "positionLength": 1,
            "termFrequency": 1
          }
        ]
      },
      {
        "name": "custom_hunspell_stemmer",
        "tokens": [
          {
            "token": "stationary",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0,
            "bytes": "[73 74 61 74 69 6f 6e 61 72 79]",
            "keyword": false,
            "positionLength": 1,
            "termFrequency": 1
          }
        ]
      },
      {
        "name": "custom_synonym_graph-replacement_filter",
        "tokens": [
          {
            "token": "stationery",
            "start_offset": 0,
            "end_offset": 10,
            "type": "SYNONYM",
            "position": 0,
            "bytes": "[73 74 61 74 69 6f 6e 65 72 79]",
            "keyword": false,
            "positionLength": 1,
            "termFrequency": 1
          },
          {
            "token": "stationary",
            "start_offset": 0,
            "end_offset": 10,
            "type": "SYNONYM",
            "position": 0,
            "bytes": "[73 74 61 74 69 6f 6e 61 72 79]",
            "keyword": false,
            "positionLength": 1,
            "termFrequency": 1
          },
          {
            "token": "stationary",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0,
            "bytes": "[73 74 61 74 69 6f 6e 61 72 79]",
            "keyword": false,
            "positionLength": 1,
            "termFrequency": 1
          }
        ]
      }
    ]
  }
}

The hunspell rules and dictionary files are attached. en-US.aff.txt en-US.dic.txt

Related component

Other

To Reproduce

N/A

Expected behavior

The screen capture for Solr analysis screenshot where the synonym graph filter is highlighted. You will see all the synonyms displayed under SGF

Solr-screenshot

Additional Details

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

Additional context Add any other context about the problem here.

prudhvigodithi commented 4 days ago

[Triage]

Coming from https://github.com/opensearch-project/OpenSearch/issues/16263 and with the proposed fix to add synonym_analyzer for the synonym_graph (PR https://github.com/opensearch-project/OpenSearch/pull/16488) should solve this bug as well.

SFX Z Y 8
SFX Z   0     rs         e
SFX Z   y     iers       [^aeiou]y
SFX Z   0     ers        [aeiou]y
SFX Z   0     ers        [^ey]
SFX Z   0     ners         [aiu]n
SFX Z   0     ers          [^e]an
SFX Z   e     ners         [aiu]ne
SFX Z   0     rly         e
SFX Z   y     ierly       [^aeiou]y
SFX Z   0     erly        [aeiou]y
SFX Z   0     erly        [^ey]
SFX Z   0     nerly       [aiu]n
SFX Z   0     erly        [^e]an
SFX Z   e     nerly       [aiu]ne
  curl -X POST "localhost:9200/test-index5/_analyze" -H "Content-Type: application/json" -d '{
    "analyzer": "test_analyzer",
    "text": "stationary"
  }'

Thank you @msfroh @getsaurabh02 @nupurjaiswal @dblock @aswad1