opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.46k stars 1.74k forks source link

[BUG] Sorting on integer_range field types can fail with out of bounds exceptions #12263

Open Scrambles56 opened 7 months ago

Scrambles56 commented 7 months ago

Describe the bug

Attempting to sort a search on a integer_range field provides inconsistent results, some searches will be successful, but then if a specific document is in the results, will fail with an exception as follows:

{
  "error": {
    "root_cause": [],
    "type": "search_phase_execution_exception",
    "reason": "",
    "phase": "fetch",
    "grouped": true,
    "failed_shards": [],
    "caused_by": {
      "type": "array_index_out_of_bounds_exception",
      "reason": "Index 5 out of bounds for length 5"
    }
  },
  "status": 500
}

Related component

Search

To Reproduce

  1. Spin up a clean opensearch instance (v2.11.0)

  2. Create an index as follows:

    PUT /listings
    {
    "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
    },
    "mappings": {
    "properties": {
      "listing_id": {
        "type": "keyword"
      },
      "version": {
        "type": "integer"
      },
      "title": {
        "type": "text"
      },
      "description": {
        "type": "text"
      },
      "supplier_name": {
        "type": "text"
      },
      "categories": {
        "properties": {
          "id": {
            "type": "keyword"
          },
          "name": {
            "type": "keyword"
          }
        }
      },
      "variant_types": {
        "type": "keyword"
      },
      "price_range": {
        "type": "integer_range"
      },
      "compare_at_price": {
        "type": "integer"
      },
      "location_types": {
        "type": "keyword"
      },
      "location": {
        "type": "geo_point"
      }
    }
    }
    }
  3. Index a document with the following details:

    POST /listings/_doc
    {
    "listing_id": "lst_clscbqyvf00060104t1339786",
    "version": 1,
    "title": "Breathwork Session",
    "description": "asd",
    "supplier_name": "Impact Life Coach and Trauma Therapy",
    "categories": [
      {
        "name": "Breathwork",
        "id": "cat_clm4cjukp000a3b6h7qi9k5i5"
      },
      {
        "name": "Mind-Body",
        "id": "cat_clm4chyb600023b6hv8zhwty5"
      }
    ],
    "variant_types": [
      "Pass"
    ],
    "price_range": {
      "gte": 300,
      "lte": 500
    },
    "token_currency": "nztoken",
    "primary_image_url": "https://google.com/",
    "location_types": []
    }
  4. Attempt a search:

    POST /listings/_search?typed_keys=true
    {
    "from": 0,
    "query": {
    "bool": {
      "must": [
        {
          "terms": {
            "categories.id": [
              "cat_clm4cjukp000a3b6h7qi9k5i5"
            ]
          }
        }
      ]
    }
    },
    "size": 10,
    "sort": [
    {
      "price_range": {
        "order": "asc"
      }
    }
    ]
    }

Observe: Search fails with error detailed above.

  1. Reset your environment running steps 1 & 2 again.

  2. Index a document as follows (note the different price_range.lte):

    POST /listings/_doc
    {
    "listing_id": "lst_clscbqyvf00060104t1339786",
    "version": 1,
    "title": "Breathwork Session",
    "description": "asd",
    "supplier_name": "Impact Life Coach and Trauma Therapy",
    "categories": [
      {
        "name": "Breathwork",
        "id": "cat_clm4cjukp000a3b6h7qi9k5i5"
      },
      {
        "name": "Mind-Body",
        "id": "cat_clm4chyb600023b6hv8zhwty5"
      }
    ],
    "variant_types": [
      "Pass"
    ],
    "price_range": {
      "gte": 300,
      "lte": 490
    },
    "token_currency": "nztoken",
    "primary_image_url": "https://google.com/",
    "location_types": []
    }
  3. Perform the search again:

    POST /listings/_search?typed_keys=true
    {
    "from": 0,
    "query": {
    "bool": {
      "must": [
        {
          "terms": {
            "categories.id": [
              "cat_clm4cjukp000a3b6h7qi9k5i5"
            ]
          }
        }
      ]
    }
    },
    "size": 10,
    "sort": [
    {
      "price_range": {
        "order": "asc"
      }
    }
    ]
    }

Observe: Search succeeds

Expected behavior

Option 1: Sorting on integer_range should be unsupported, and all searches attempting to do so should fail with a clear error message.

Option 2: Sorting on integer_range should allow you to specify an anchor point to sort on (e.g. min,max,median).

Additional Details

Plugins N/A

Screenshots N/A

Host/Environment (please complete the following information):

nknize commented 7 months ago

From Slack discussion (will purge after 90 days) so including explanation below:

oye! tldr; sorting by range fields is unexpected behavior. Looks like Elastic mucked that one up pretty good. I should've explicitly stated this when I wrote the blog post years ago. I initially removed doc value support for RangeFields when I first added the field to Elasticsearch, only because we didn't have any aggregation support for range fields. They were added back not long after in order to boost query performance using IndexOrDocValuesQuery, but the nasty side effect is that Sort also uses doc values, and no guard rails were included in the commit. So what's happening is the integer range encoding of the value to doc value is variable (to save space on disk since S3 is expensive :slightly_smiling_face: ). So when the DocValueFormat instance is pulled from RangeFieldType.docValueFormat it's just using the default RAW formatter which doesn't take the RangeType into consideration thus tries to blindly decode the encoded range to a nonsensicle string using BytesRef.utf8ToString welp, as expected the values aren't UTF8 so the UnicodeUtil#UTF8ToUTF16 trips a byte boundary assertion (if you're running with assertions enabled) and nasty unexpected behaviors ensue :confused:

peternied commented 7 months ago

[Triage - attendees 1 2 3 4 5 6 7 8] @Scrambles56 Thanks for filing this issue, look forward to a pull request to address this issue