opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
57 stars 58 forks source link

[BUG] neural query explain not showing details for nested field #698

Open yuye-aws opened 2 months ago

yuye-aws commented 2 months ago

What is the bug?

Searching for nested field works well. However, we cannot obtain the detailed explanation for search results GET {indexname}/_search?explain=true

How can one reproduce the bug?

First, create an index with nested field embedding, a sample document may look like:

{
    "text": "A Hybrid EP and SQP for Dynamic Economic Dispatch with Nonsmooth Fuel Cost Function Dynamic economic dispatch (DED) is one of the main functions of power generation operation and control. It determines the optimal settings of generator units with predicted load demand over a certain period of time. The objective is to operate an electric power system most economically while the system is operating within its security limits. This paper proposes a new hybrid methodology for solving DED. The proposed method is developed in such a way that a simple evolutionary programming (EP) is applied as a based level search, which can give a good direction to the optimal global region, and a local search sequential quadratic programming (SQP) is used as a fine tuning to determine the optimal solution at the final. Ten units test system with nonsmooth fuel cost function is used to illustrate the effectiveness of the proposed method compared with those obtained from EP and SQP alone.",
    "text_chunk_embedding": [
      {
        "knn": [...]
      },
      {
        "knn": [...]
      }
    ],
    "text_chunk": [
      "[CLS] a hybrid ep and sqp for dynamic economic dispatch with nonsmooth fuel cost function dynamic economic dispatch ( ded ) is one of the main functions of power generation operation and control. it determines the optimal settings of generator units with predicted load demand over a certain period of time. the objective is to operate an electric power system most economically while the system is operating within its security limits. this paper proposes a new hybrid methodology for solving ded. the proposed method is developed in such a way that a simple evolutionary programming ( ep ) is applied as a based level search, which can give a good direction to the optimal global region, and",
      "a local search sequential quadratic programming ( sqp ) is used as a fine tuning to determine the optimal solution at the final. ten units test system with nonsmooth fuel cost function is used to illustrate the effectiveness of the proposed method compared with those obtained from ep and sqp alone. [SEP]"
    ]
}

Then, use the explain query to search the document:

GET {indexname}/_search?explain=true
{
  "size": 1,
  "_source": {
    "excludes": "text_chunk_embedding"
  },
  "query": {
    "nested": {
      "score_mode": "avg",
      "path": "text_chunk_embedding",
      "query": {
        "neural": {
          "text_chunk_embedding.knn": {
            "model_id": "PDx55Y4BxByNDM4P0mdQ",
            "query_text": "Global-Locally Self-Attentive Dialogue State Tracker"
          }
        }
      }
    }
  }
}

Currently, the explanation for search results is

"_explanation": {
  "value": 0.021672908,
  "description": "Score based on 3 child docs in range from 6364 to 6366, using score mode Avg",
  "details": [
    {
      "value": 0.021672908,
      "description": "sum of:",
      "details": [
        {
          "value": 1,
          "description": "No Explanation",
          "details": []
        },
        {
          "value": 0,
          "description": "match on required clause, product of:",
          "details": [
            {
              "value": 0,
              "description": "# clause",
              "details": []
            },
            {
              "value": 1,
              "description": "_nested_path:text_chunk_embedding",
              "details": []
            }
          ]
        }
      ]
    }
  ]
}

What is the expected behavior?

The explain query should at least show score for each nested document like the BM25 query.

What is your host/environment?

Operating system, version.

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

Add any other context about the problem.

yuye-aws commented 2 months ago

If I search with BM25 query:

GET {indexname}/_search?explain=true
{
  "size": 1,
  "query": {
    "match": {
      "text_chunk": "Global-Locally Self-Attentive Dialogue State Tracker"
    }
  }
}

The explanation is very detailed like

{
    "value": 18.182425,
    "description": "sum of:",
    "details": [
      {
        "value": 4.982006,
        "description": "weight(text_chunk:self in 20446) [PerFieldSimilarity], result of:",
        "details": [
          {
            "value": 4.982006,
            "description": "score(freq=2.0), computed as boost * idf * tf from:",
            "details": [
              {
                "value": 2.2,
                "description": "boost",
                "details": []
              },
              {
                "value": 3.1877272,
                "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details": [
                  {
                    "value": 351,
                    "description": "n, number of documents containing term",
                    "details": []
                  },
                  {
                    "value": 8517,
                    "description": "N, total number of documents with field",
                    "details": []
                  }
                ]
              },
              {
                "value": 0.7103958,
                "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                "details": [
                  {
                    "value": 2,
                    "description": "freq, occurrences of term within document",
                    "details": []
                  },
                  {
                    "value": 1.2,
                    "description": "k1, term saturation parameter",
                    "details": []
                  },
                  {
                    "value": 0.75,
                    "description": "b, length normalization parameter",
                    "details": []
                  },
                  {
                    "value": 104,
                    "description": "dl, length of field (approximate)",
                    "details": []
                  },
                  {
                    "value": 181.63051,
                    "description": "avgdl, average length of field",
                    "details": []
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "value": 10.799234,
        "description": "weight(text_chunk:attentive in 20446) [PerFieldSimilarity], result of:",
        "details": [
          {
            "value": 10.799234,
            "description": "score(freq=2.0), computed as boost * idf * tf from:",
            "details": [
              {
                "value": 2.2,
                "description": "boost",
                "details": []
              },
              {
                "value": 6.9098706,
                "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details": [
                  {
                    "value": 8,
                    "description": "n, number of documents containing term",
                    "details": []
                  },
                  {
                    "value": 8517,
                    "description": "N, total number of documents with field",
                    "details": []
                  }
                ]
              },
              {
                "value": 0.7103958,
                "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                "details": [
                  {
                    "value": 2,
                    "description": "freq, occurrences of term within document",
                    "details": []
                  },
                  {
                    "value": 1.2,
                    "description": "k1, term saturation parameter",
                    "details": []
                  },
                  {
                    "value": 0.75,
                    "description": "b, length normalization parameter",
                    "details": []
                  },
                  {
                    "value": 104,
                    "description": "dl, length of field (approximate)",
                    "details": []
                  },
                  {
                    "value": 181.63051,
                    "description": "avgdl, average length of field",
                    "details": []
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "value": 2.401184,
        "description": "weight(text_chunk:state in 20446) [PerFieldSimilarity], result of:",
        "details": [
          {
            "value": 2.401184,
            "description": "score(freq=1.0), computed as boost * idf * tf from:",
            "details": [
              {
                "value": 2.2,
                "description": "boost",
                "details": []
              },
              {
                "value": 1.9813391,
                "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details": [
                  {
                    "value": 1174,
                    "description": "n, number of documents containing term",
                    "details": []
                  },
                  {
                    "value": 8517,
                    "description": "N, total number of documents with field",
                    "details": []
                  }
                ]
              },
              {
                "value": 0.55086344,
                "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                "details": [
                  {
                    "value": 1,
                    "description": "freq, occurrences of term within document",
                    "details": []
                  },
                  {
                    "value": 1.2,
                    "description": "k1, term saturation parameter",
                    "details": []
                  },
                  {
                    "value": 0.75,
                    "description": "b, length normalization parameter",
                    "details": []
                  },
                  {
                    "value": 104,
                    "description": "dl, length of field (approximate)",
                    "details": []
                  },
                  {
                    "value": 181.63051,
                    "description": "avgdl, average length of field",
                    "details": []
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
}
martin-gaievski commented 2 months ago

@yuye-aws neural search will not have detailed response for explain as it uses knn query under the hood, and knn query doesn't support explain. Here is the corresponding GH issue for this matter: https://github.com/opensearch-project/k-NN/issues/875

yuye-aws commented 1 month ago

@yuye-aws neural search will not have detailed response for explain as it uses knn query under the hood, and knn query doesn't support explain. Here is the corresponding GH issue for this matter: opensearch-project/k-NN#875

Sorry for taking long to respond. It seems quite likely that after this issue will automatically get resolved after https://github.com/opensearch-project/k-NN/issues/875/. Just out of curiosity, do we have an ongoing plan to resolve the k-NN issue?