opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.62k stars 1.76k forks source link

[BUG] Multimatch on fields with distincts synonym analyzers does not work properly #7691

Open pierre-mallet opened 1 year ago

pierre-mallet commented 1 year ago

Describe the bug

I stumbled on an edge case when mixing different fields with search-time synonyms analyzers in a multimatch query. In the case described below we can see in the profiling information that the multimatch query is rewrite into an incorrect DisjunctionMaxQuery that targets the same field multiple times.

This bug is reproductible in 2.6.0 and 2.7.0

To Reproduce

Here is a minimal scenario to reproduce the bug

curl -X PUT localhost:9200/synonyms -d '{
  "aliases": {},
  "mappings": {
    "dynamic": false,
    "properties": {
      "text": {
        "search_analyzer": "synonym_1",
        "type": "text"
      },
      "title": {
        "search_analyzer": "synonym_2",
        "type": "text"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "synonym_1": {
          "filter": [
            "synonyms"
          ],
          "tokenizer": "standard"
        },
        "synonym_2": {
          "filter": [
            "synonyms"
          ],
          "tokenizer": "standard"
        }
      },
      "filter": {
        "synonyms": {
          "lenient": "true",
          "synonyms": [
            "term1, term2"
          ],
          "type": "synonym_graph"
        }
      }
    }
  }
}'

curl -X POST localhost:9200/synonyms/_doc/test_doc -d '{
  "title": "term1"
}'

curl -X POST localhost:9200/synonyms/_search -d '{
  "profile": "true",
  "from": 0,
  "query": {
    "multi_match": {
      "boost": 1,
      "fields": [
        "text^1.0",
        "title^1.0"
      ],
      "query": "term1",
      "type": "best_fields"
    }
  }
}'

This request does not match our document and in the profiling we can see this :

"profile": {
    "shards": [
      {
        "id": "[GaBuyLhQSyalJSCmyrTaIA][synonyms][0]",
        "inbound_network_time_in_millis": 0,
        "outbound_network_time_in_millis": 0,
        "searches": [
          {
            "query": [
              {
                "type": "DisjunctionMaxQuery",
                "description": **"(Synonym(text:term1 text:term2) | Synonym(text:term1 text:term2))",**

The textfield is targeted twice and the titlefield is not used in the query.

Note that in version 2.5.0 this works correctly and the query generated is :

 "type" : "DisjunctionMaxQuery",
 "description" : "(Synonym(title:term1 title:term2) | Synonym(text:term1 text:term2))",

Also note that if we have more terms in the query it works fine :

curl -X POST localhost:9200/synonyms/_search -d '{
  "profile": "true",
  "from": 0,
  "query": {
    "multi_match": {
      "boost": 1,
      "fields": [
        "text^1.0",
        "title^1.0"
      ],
      "query": "term1 term3 ",
      "type": "best_fields"
    }
  }
}'

=>

"type": "DisjunctionMaxQuery",
"description": "((Synonym(title:term1 title:term2) title:term3) | (Synonym(text:term1 text:term2) text:term3))",

Expected behavior

The multimatch query should targets both fields

Plugins

No plugins

Host/Environment (please complete the following information):

harshavamsi commented 1 year ago

I've added a draft PR in #10148. It matches the exact request that is being made here, although I am not a 100% sure yet if this issue is with multi_match or with the synonym analyzer.

I'm having issues getting the test case to pass, I keep encountering this:

 2> java.lang.IllegalArgumentException: Unknown filter type [synonym] for [synonyms]
        at __randomizedtesting.SeedInfo.seed([91D695E83F6A921B:84D55249764E76E3]:0)
        at org.opensearch.index.analysis.AnalysisRegistry.getAnalysisProvider(AnalysisRegistry.java:565)
        at org.opensearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:516)
        at org.opensearch.index.analysis.AnalysisRegistry.buildTokenFilterFactories(AnalysisRegistry.java:336)
        at org.opensearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:239)
        at org.opensearch.index.IndexModule.newIndexService(IndexModule.java:628)
        at org.opensearch.indices.IndicesService.createIndexService(IndicesService.java:839)
        at org.opensearch.indices.IndicesService.withTempIndexService(IndicesService.java:782)
        at org.opensearch.cluster.metadata.MetadataCreateIndexService.applyCreateIndexWithTemporaryService(MetadataCreateIndexService.java:479)
        at org.opensearch.cluster.metadata.MetadataCreateIndexService.applyCreateIndexRequestWithV1Templates(MetadataCreateIndexService.java:584)
        at org.opensearch.cluster.metadata.MetadataCreateIndexService.applyCreateIndexRequest(MetadataCreateIndexService.java:441)
        at org.opensearch.cluster.metadata.MetadataCreateIndexService.applyCreateIndexRequest(MetadataCreateIndexService.java:448)
        at org.opensearch.cluster.metadata.MetadataCreateIndexService$1.execute(MetadataCreateIndexService.java:354)
        at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:65)
        at org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:874)
        at org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:424)
        at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:295)
        at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:206)
        at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:204)
        at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:242)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.lang.Thread.run(Thread.java:1623)

Looking at other tests that make use of search analyzers in settings, looks like I'm not doing anything too different but I can't seem to get out of this issue -- https://github.com/opensearch-project/OpenSearch/blob/0c4e95095e4d44540ea8a90aa5ca55fea1248366/server/src/internalClusterTest/java/org/opensearch/search/suggest/CompletionSuggestSearchIT.java#L1090

I've also verified that this issue does not occur in 2.5 and shows up in 2.6. Looking at the diff, I don't see any changes to the MultiMatchQuery or the SynonymAnalyzer, although I'm not entirely sure I'm looking at the right places -- https://github.com/opensearch-project/OpenSearch/blame/750901a2d1304961070563a66313fa34c03ec779/server/src/main/java/org/opensearch/index/search/MultiMatchQuery.java#L221

There was a lucene version upgrade that could've caused this issue. Any thoughts @msfroh @rishabhmaurya

pierre-mallet commented 11 months ago

For information, this issue is no longer reproductible in version 2.10 of opensearch ( I did not tried the 2.9 version ).

msfroh commented 11 months ago

I suspect that this may have been fixed in 2.8 (since I was taking with @mingshl about a very similar bug report on Friday).

OpenSearch 2.6 brought in an upgrade from Lucene 9.4.2 to 9.5.0, and OpenSearch 2.8 upgraded to Lucene 9.6.0.

I took a look at the "Bug Fixes" section of the Lucene 9.6.0 change log for anything having to do with synonym queries. I suspect that the fix in 9.6.0 was https://github.com/apache/lucene/pull/12260, and the bug in 9.5.0 was https://github.com/apache/lucene/pull/11941.

Note that the fix doesn't show up in the published 9.6.0 change log, but it does show up in the 9.6.0 section of later change logs.

mingshl commented 11 months ago

Thanks @msfroh for pointing the right place!!! It's fixed by Lucene at 9.6 here, I can upgrade the lucene version at 2.6 and 2.7 to fix the issue for the multi-match query not working property, shall I ?

msfroh commented 11 months ago

I can upgrade the lucene version at 2.6 and 2.7 to fix the issue for the multi-match query not working property, shall I ?

We don't upgrade the Lucene version on past releases.

In this case, I believe the correct course is just to say "This is a known bug in 2.6 and 2.7, but we know it is fixed in 2.8 and later".

dblock commented 11 months ago

Let's get https://github.com/opensearch-project/OpenSearch/pull/10148 to pass and close this with it? @harshavamsi