opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.43k stars 1.72k forks source link

[BUG] In flat_object fields `null` values indexed as "null" strings #13880

Closed akramarev closed 1 week ago

akramarev commented 3 months ago

Describe the bug

Flat_object documentation doesn't define null values indexing rules, though it looks unexpected that documents containing fields with null values can be searched with exists or even prefix: "nu" queries.

Related component

Indexing

To Reproduce

  1. create a new index with flat_object field mapping
PUT demo-flattened
{
  "mappings": {
    "properties": {
      "record": {
        "type": "flat_object",
      }
    }
  }
}
  1. index a document with record.name: null
PUT demo-flattened/_doc/1
{
  "record": {
    "name": null
  }
}
  1. search the document using prefix query:
POST demo-flattened/_search
{
  "query": {
    "prefix": {
      "record.name": {
        "value": "nu",
      }
    }
  }
}
  1. OR search the document using exists query:
POST demo-flattened/_search
{
  "query": {
    "exists": {
      "field": "record.name"
    }
  }
}

Expected behavior

Both queries 3 and 4 should return no documents, but they do return the ingested document. Tested using latest opensearchproject/opensearch:2.14.0 image.

Additional Details

Host/Environment:

Additional context

kkewwei commented 3 months ago

@akramarev it seems that query 4 return document is expected. I tested on the pr #7137, it doesn't return document in query 3.

akramarev commented 3 months ago

Thanks for the rely @kkewwei. It's great that your PR fixes my main concern - test case 3 (prefix query by "nu" for null field). I should spin it up and check locally as well.

Regarding test case 4 (Exists query on the null field) though, I'm not sure the behavior is expected. My expectations are based on the documentation:

1. The flat object field type supports the following queries: ..., Exists 2. An indexed value will not exist for a document field in any of the following cases: ..., The field in the source JSON is null or [].

So the Exists behavior that I expect from text field for example, I equally expect from flat_object fields, at least from root level json properties (e.g. record.name). Also as I mentioned elasticsearch:7.15.0 confirms my expectations.

Text field example just illustrate the point above:

PUT demo-text
{
  "mappings": {
    "properties": {
      "record": {
        "type": "text",
      }
    }
  }
}

PUT demo-text/_doc/1
{
  "record": null
}

# expected/actual hits.total.value=0
POST demo-text/_search
{
  "query": {
    "exists": {
      "field": "record"
    }
  }
}
kkewwei commented 3 months ago

@akramarev I update my statement. In query case 3, it doesn't return document too. We don't build any index for the null value, you can see it from the unit test: https://github.com/opensearch-project/OpenSearch/pull/13853/files#diff-b8bb41a040735cadecdd9200b835224f11a6a213e6c61bffc8ef4a74180f58e2R201

In query case 3, it will return doc when we set the null_value for the field.

akramarev commented 3 months ago

@kkewwei I'm confused with case 3 and case 3 in the latest reply, could you clarify? I think you meant "In query case 3, it shouldn't return the document too".

In the meantime, I pulled your branch and verified that Prefix query doesn't return the document as expected (thanks for the fix!). While Exists query still returns it, I believe it's a bug. The quickest repro with curl commands, same as before, but just a bit simpler to copy&paste in a terminal to validate.

❯ git status
On branch flat_object_para
Your branch is up to date with 'origin/flat_object_para'.

nothing to commit, working tree clean

❯ curl -X PUT "localhost:9200/demo-flattened" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "record": {
        "type": "flat_object"
      }
    }
  }
}'
{"acknowledged":true,"shards_acknowledged":true,"index":"demo-flattened"}%

❯ curl -X PUT "localhost:9200/demo-flattened/_doc/1" -H 'Content-Type: application/json' -d'
{
  "record": {
    "name": null
  }
}'
{"_index":"demo-flattened","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}%

❯ curl -X POST "localhost:9200/demo-flattened/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "exists": {
      "field": "record.name"
    }
  }
}'
{"took":30,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":"demo-flattened","_id":"1","_score":1.0,"_source":
{
  "record": {
    "name": null
  }
}}]}}%
peternied commented 3 months ago

[Triage - attendees 1 2 3 4 5 6 7] @akramarev Thanks for creating this issue, could you make a pull request to address?

kkewwei commented 3 months ago

@peternied I'm working on it, and almost done, please assign it to me, @akramarev, it looks more complicated than expected, there are too many cases about null value for flat_object type, I will pull separate pr to fix the null value.

kkewwei commented 3 months ago

@akramarev, please try the new branch #14069 when you are free. I test two of cases ok now.

akramarev commented 2 months ago

Thanks, @kkewwei, I can confirm - the test cases mentioned above work properly in your PR.