opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.

https://opensearch.org/docs/latest/opensearch/index/

Apache License 2.0

9.85k stars 1.83k forks source link

Add flattened field type #1018

Closed dblock closed 1 year ago

dblock commented 3 years ago

(updated from https://github.com/opensearch-project/OpenSearch/issues/1018#issuecomment-1188365805 below, @macrakis)

[Design Proposal] The flat data type in OpenSearch

Summary

JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.

Flat subfields support exact match queries and textual sorting.

Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.

Motivation

Flat fields are useful when a field and its subfields will mostly be read, and not be used as search criteria.
Flat fields do not create a large number of fields, one per unique key. The “mapping explosion” caused by mapping many individual fields can lead to heavy RAM use and thrashing. (memory efficiency)
Flat fields do not have inverted indexes which take space. (Space efficiency)
Migration from systems supporting similar flat types is easy. (Compatibility)

Demand

OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)

Prevent mapping explosion.

OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)

Migrating corpus with existing flattened mappings.

Specification

Mapping and ingestion

flattened is a new mapping type
Fields declared flattened are ingested as structured, nested objects.
Neither the field as a whole nor its subfields are indexed.
The nested fields are uninterpreted keywords, and sort like strings. Nested fields cannot be used as dates or numbers.

Searching and retrieving

Supports fetching a subfields with the usual dotted notation.
Supports aggregations of subfields with the usual dotted notation.
Filtering by subfields is supported, but may be inefficient (full scan).

Example

This declares catalog as being of type flattened:

{ "mappings": 
  { "book" :
    { "properties" :
       { "ISBN13"  : "keyword",
         "catalog" : "flattened" }
}}}

Consider the ingestion of the following document:

{ 
  { "ISBN13" : "V9781933988177",
    "catalog" : 
       { "title" : "Lucene in Action",
         "author1" : 
            { "surname" : "McCandless",
              "given"   : "Mike" }
}}}

Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.

Performance

Performance should be similar to a keyword field.
Fetching the value of a nested field using dot paths in a given document should be efficient.
Finding a document with a specific value of a nested field (e.g., given = ‘Mike’) is not efficient: it may require a full scan of the index, an expensive operation.

Limitations

Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.

Possible implementation

These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.

Flattened fields could be stored as Lucene blobs whose internal structure is defined by OpenSearch.
The internal structure should be such that dot path references (such as .catalog.author1.given) are cheap. Perhaps binary JSON.

Security

Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.

Possible enhancements

The current specification is minimal. It intentionally does not include many options offered by other vendors.

Depending on the user feedback we receive after the initial release, various enhancements are possible:

Querying the field as a whole searches all leaf values. For example, catalog=Mike would match the document above.
Fine tune efficiency with various options controlling query interpretation, etc.
Provide a concatenated index. In that index, the entry for the given field above would be something like “Catalog|author1|given=Mike”. This would provide efficient searching by field (assuming that indexes support prefix compression).
Allow specifying that certain subfields should be indexed separately. (This could also be provided in logstash.)
Support wildcards in field names.

mingshl commented 1 year ago

@lukas-vlcek agreed. Here is the branch flat-object updated.

mingshl commented 1 year ago

Thanks @lukas-vlcek for the initial commit pr, which layout is following a typical plugin module. It's marked with mark it @opensearch.experimental. @nknize @dblock what are your thoughts to guide us about how to release this feature?mapper-extras module, sandbox, experimental, or long term support (LTS)?

dblock commented 1 year ago

The implementation looks simple. The only reason to do experimental is if you think something may be removed in the future, or the API will change. So my preference is for LTS unless there's a good reason not to. @nknize WDYT?

dblock commented 1 year ago

Could we please get some 👀 from other maintainers on the PRs/code for this? Maybe @gbbafna or @Bukhtawar?

macohen commented 1 year ago

The implementation looks simple. The only reason to do experimental is if you think something may be removed in the future, or the API will change. So my preference is for LTS unless there's a good reason not to. @nknize WDYT?

I was thinking that if we can release with a stable API for whatever version is coming when this is done then great. If we’re not comfortable with the API, but it’s good enough to test when we get to a release, go experimental then and LTS for a future release. Just my thoughts on how we can release frequently and get feedback. How does that sound?

mingshl commented 1 year ago

To implement flat objects, we currently has three technical approaches: Using the above sample index "catalog" and two sample doc to explain below:

curl -XPUT localhost:9200/test-index002 --data '{
  "mappings": {
    "properties": {
      "ISBN13": {
        "type": "keyword"
      },
      "catalog": {
        "type": "flat-object"
      }
    }
  }
}' -H "Content-Type:Application/json"

curl -XPUT localhost:9200/test-index002/_doc/1 -d '{
          "ISBN13": "V9781933988177",
          "catalog": {
            "title": "Lucene in Action",
            "author1": {
              "surname": "McCandless",
              "given": "Mike"
            }
          }
        }' -H "Content-Type:Application/json"

curl -XPUT localhost:9200/test-index002/_doc/0 -d '{
          "ISBN13": "V9781933988176",
          "catalog": {
            "title": "Test in Action",
            "author": {
              "surname": "Mock",
              "given": "Mike"
            }
          }
        }' -H "Content-Type:Application/json"

Approach 1: Two String Fields

store the entire nested object as a String. a flat-object creates exactly two internal Lucene StringField ( "value" and "content_and_path" ) in regards of how many nested fields the flat field has.

value: a keyword field field contains the leaf values of all subfields, allowing for efficient searching of leaf values without specifying the path. (e.g: catalog = 'Mike').

content_and_path: : a keyword field field contains the path to the leaf value and its value, enabling efficient searching when the query includes the path to the leaf. (e.g: catalog.author.given = 'Mike') For example, StringField 1, the "value" keyword field would leaf values of all subfields:

catalog.value:
        doc 1: "{ "title" : "Lucene in Action", "author" : { "surname" : "McCandless", "given" : "Mike" } }", 
        doc 0: "{ "title" : "Test in Action", "author" : { "surname" : "Mock", "given" : "Mike" } }"

    The keyword fields will not be tokenized and transformed into a list of individual terms. The terms dictionary for the "value" field would be:
    catalog.value:
     ```
    doc 1: ["title","Lucene in Action", "author", "surname" ,"McCandless", "given", "Mike"],
    doc 0: ["title","Test in Action","author","surname", "Mock","given", "Mike"]
     ```
    The  StringField 2, "content_and_path" keyword field would contain the nested fields as a string, with each field's path appended to the beginning: 
    catalog.__content_and_path:, 
     ```
    doc 1: "catalog.title:Lucene in Action, catalog.author.surname:McCandless, catalog.author.given:Mike"
    doc 0: "catalog.title:Test in Action, catalog.author.surname:Mock, catalog.author.given:Mike
     ```
    the keyword field will be treated as exact values. The terms dictionary for the "content_and_path" field would be:
    catalog.content_and_path:
     ```
    doc 1: ["catalog.title=Lucene in Action", "catalog.author.surname=McCandless", "catalog.author.given=Mike"]
    doc 0: ["catalog.title=Test in Action", "catalog.author.surname=Mock", "catalog.author.given=Mike"]
     ```

Pros:
- Easy to store as a string.
- Avoids mapping explosion by having at most 4 fields.
- The "content_and_path" field can be useful for searching the full path information and subfield values in a single field.
Cons:
- Only supports exact match with the doc path, not partial matches.
- Searching for subfield values requires parsing the "content_and_path" field, which can be slower and more complex.
- No type-specific parsing or numerical operations.
- Sorting is limited to lexicographic sorting of strings.

Approach 2: Two Fields Per Subfields

Denormalized data into multiple subfields. For each subfield, there are one text field: "value" and one text field "path_and_content". The term dictionaries for each text field are created by applying text analysis, which depends on the mapping settings and may involve lowercasing, stemming, and removing stop words. For example, for "catalog" field, there are three subfields in "catalog.title", "catalog.author.surname", "catalog.author.given", and it will be six text fields. The three text fields for "value" and three text fields for "path_and_content" for doc 0 will be:

{"catalog.title.value" : "Test in Action"
"catalog.title.path_and_content" : "catalog.title=Test in Action"}

{"catalog.author.surname.value : "Mock"
"catalog.author.surname.path_and_content" : "catalog.author.surname=Mock"}

{"catalog.author.given.value : "Mike"
"catalog.author.given.path_and_content" : "catalog.author.given=Mike"}

The three terms dictionary for the "value" field would be:

"catalog.title.value" contains the terms {"Test", "In", "Action"}
"catalog.author.surname.value" contains the terms {"Mock"}
"catalog.author.given.value" contains the terms {"Mike"}

The three terms dictionary for the "content_and_path" field would be:

"catalog.title.path_and_content" contains the terms {"catalog.title". "Test" ,“In” , Action"}
"catalog.author.surname.path_and_content" contains the terms {"catalog.author.surname“, ”Mock"}
"catalog.author.given.path_and_content" contains the terms {"catalog.author.given", "Mike"}

Pros:
- support full-text searches on each subfield, including partial match, aggregate and fuzzy match searches.
- The index size is optimized, as the text fields are stored efficiently with term dictionaries and inverted indices.
  - Cons:
- The index size is larger than in Approach 1, as each subfield is stored in its own field.
- The exact match in two text fields will be a problem.
- Creating and managing multiple text fields for each subfield can be more complex than storing the information in a single field.

Approach 3: Nested Doc Per subfields

a flat-object creates nested four sub documents per subfields, with two keyword fields for the subfield names (key), and two text fields for the subfields values.


    "properties": {
      "key": {
        "type": "keyword"
      },
      "key_text": {
        "analyzer": "whitespace",
        "norms": false,
        "type": "text"
      },
      "value": {
        "type": "keyword"
      },
      "value_text": {
        "analyzer": "whitespace",
        "norms": false,
        "type": "text"
      }
    }

with the sample example, the field for catalog will create 3 sub docs

catalog: { "properties": {
{ "key" : "title", "key_text" : "title", "value": "Lucene in Action", "value_text": "Lucene in Action" } , 
{ "key" : "author.surname", "key_text" : "author.surname","value": "McCandless", "value_text": "McCandless" } ,
{ "key" : "author.given", "key_text" : "author.given", "value": "Mike", "value_text": "Mike"} 
}}

Pros:
- enable full text search and also partial search with the combination of keywords and text field type
- easy to support all kinds of operations in searching, including fuzzy, keywords, and aggregation and sorting
- flexible in applying different analyzers and norms
Cons:
- Complicated Query, querying by the dot path will query on the “key” first to found a match in dot path, then query for the “value” for the exact match, and “value_text” for partial match.
- Create four sub-fields for every child-doc

lukas-vlcek commented 1 year ago

@Mingshl

Thanks a lot for putting this together. The following are my thoughts about it. In nutshell, I think we shall try to explore also options that do not rely on nested objects and I am proposing some ideas below. Perhaps your approach 1 goes into the same direction as well. Let's see...

Let me start with the definition of index mapping:

Mapping

{
  "mappings": {
    "properties": {
      "ISBN13": {
        "type": "keyword"
      },
      "catalog": {
        "type": "flat-object"
      }
    }
  }
}

What I think is very important (and I have been missing it in the discussion so far) is that the catalog field can accept ANY JSON object. Including objects with variable/changing structure. This can mean that we can think of the following three example documents all representing valid catalog object:

Sample data

// --- Document 0
// The "catalog.author" is a simple text field.

{
  "ISBN13": "V9781933988175",
  "catalog": {
    "title": "Java in Action",
    "author": "John Doe"
  }
}

// --- Document 1
// The "catalog.author" is an object.

{
  "ISBN13": "V9781933988177",
  "catalog": {
    "title": "Lucene in Action",
    "author": {
      "surname": "McCandless",
      "given": "Mike"
    }
  }
}

// --- Document 2
// The "catalog.author" is an array with objects.
// And each object can be either simple value field or another object with variable "schema".

{
  "ISBN13": "V9781933988176",
  "catalog": {
    "title": "Test in Action",
    "author": [
        "John Doe",
        {
          "surname": "Smith",
          "given": "Peter"
        },
        {
          "surename": "Green",
          "first_name": "Billy"
        }
    ]
  }
}

(A small notice: some examples used in the past were using author and author1 fields. Which in fact already means that we assume variable structure, but it seems to me that this naming difference was so subtle that it sometimes went unnoticed. So let's make it clearly obvious.)

Now, if we agree that this is the use case that we want to support then we can revisit suggested approaches.

Approach 2

If I understand correctly I think we can exclude Approach 2.

It is because individual fields (like catalog.author) can be either simple value filed (like string) or object. And this is impossible to model using today's index mapping. That would lead to mapping conflicts (depending on which document comes first for indexing).

Also I believe we do not want to introduce any new mappings for individual fields dynamically. This could lead to heavy metadata synchronisation across the cluster (imagine indexing a lot of documents with unique fields, the metadata will need to be synced all over the cluster nodes) and can also lead to mapping explosion.

Approach 3

I like Approach 3, let's explore how it could be used to store and query the Document 3:

Firstly, I am not sure why we need to distinguish between key and key_text, and similarly between value and value_text. So for the sake of simplicity let's simplify this model for now into two fields only – key and value:

key	value
`catalog.author.surename`	[ Smith, Green ]
`catalog.author.given`	[ Peter ]
`catalog.author.first_name`	[ Billy ]
`catalog.author`	[ John, Doe ]
`catalog.title`	[ Test, in, Action ]

This means we get the following nested objects for indexing:

[
  {
    "key": "catalog.author.surename",
    "value": "Smith Green"
  },
  {
    "key": "catalog.author.given",
    "value": "Peter"
  },
  {
    "key": "catalog.author.first_name",
    "value": "Billy"
  },
  {
    "key": "catalog.author",
    "value": "John Doe"
  },
  {
    "key": "catalog.title",
    "value": "Test in Action"
  },
]

The Cons is that we are loosing the option to correctly search across child objects, meaning that we can search for catalog.author.surename:Green AND catalog.author.given:Peter and get a hit while there is no such combination for the author. But I think it is accepted fact that we are not able to model nested objects using abstraction on top of nested objects (:smile:). Generally, I think it applies to all approaches that we will not be simulate nested objects.

Of course, internally, such query would need to be translated to a more complicated bool query of nested queries. That is something that would have to be carefully implemented.

// Not sure if I got this query perfectly correct, but something like this...
{
   "query":{
      "bool":{
         "must":[
            {
               "nested":{
                  "path":"catalog",
                  "query":{
                     "bool":{
                        "must":[
                           {
                              "term":{
                                 "catalog.key":"catalog.author.surename"
                              }
                           },
                           {
                              "match":{
                                 "catalog.value":"Green"
                              }
                           }
                        ]
                     }
                  }
               }
            },
            {
               "nested":{
                  "path":"catalog",
                  "query":{
                     "bool":{
                        "must":[
                           {
                              "term":{
                                 "catalog.key":"catalog.author.given"
                              }
                           },
                           {
                              "match":{
                                 "catalog.value":"Peter"
                              }
                           }
                        ]
                     }
                  }
               }
            }
         ]
      }
   }
}

As a bonus – I can imagine that it might be useful/possible to further specify text analyzers using some kind of key pattern matching. For example all catalog.author.*name fields could use different analyzer (but that would probably require to add extra value_analyzed field). But that is not important at this point.

Approach 1

I feel a bit lost on this one. I think the text formatting is not helping much. But on a high-lvel it feels to me that it is going the similar direction as my last proposal (see below).

Approaches without Nested Documents?

I was thinking about approaches without nested documents. It is a known fact that nested documents are expensive. So can we do without them?

I think part of the answer is what specificaly we want to use the flat-object data for during querying (I will get to this point later). The following are two differet approaches that I would like to discuss:

Proximity search based

Let's start with this question: What would a JSON data look like when transformed into a single line form? Basically it is a Map with key-value pairs. For each key I get a value. For example the Document 0 could be represented like this:

ISBN13: V9781933988175 catalog.title: Java in Action catalog.author: John Doe

To make it more clear and readable let's add formatting and arbitrary boundries around each key-value pair:

| ISBN13: V9781933988175 | catalog.title: Java in Action | catalog.author: John Doe |

What I see here is a sequence of tokens organized into groups defined by some boundries. The first token in each group is called a "key" followed by one or more "value" tokens.

From here the flat-object search could be seen as search for specific value token(s) inside the group identified by its key.

(Hint: I can imagine that every key can be processed using "synonym filter technique" to yield all possible path expansions: ie. "foo.bar.clazz" can yield "foo" and "foo.bar" at the same token position, this is to enable more flexible search. But let's leave this discussion for later...)

While this is still very vague definition I found the following article from Mike very inspiring: https://blog.mikemccandless.com/2014/08/a-new-proximity-query-for-lucene-using.html

Lucene can actually search for tokens in specifric order having some kind of "closeness". So the proximity search task is actualy a task of searching for a "key" token followed by the "value" token(s) until the border of the group is reached.

Of course the devil is in details. Both the Slop and TermAutomaton Queries seem to require explicit definition of the distance between tokens. Which may be challenging to fix or workaround (but maybe is doable). It might be worth investigating this approach a bit further.

In general I think such approach would not perform worse than what the performance of the TermAutomation queries is today.

Repeat key before each value token

The second approach is based on very simple idea: Let's prefix each value token with its key.

Taking the Document 0 again we would get the following (for clarity each token on a new row):

ISBN13:V9781933988175
catalog.title:Java
catalog.title:in
catalog.title:Action
catalog.author:John
catalog.author:Doe

This idea is based on my observation of what I would mostly likely be using any arbitrary JSON payload for in indexed document. And that is for filtering. Example: I want to search within K8s logs associated with a custom key-value annotation. So the artibtary JSON payload would be the annotations metadata. In this case I am primarily interested in filtering. And approach like this would probably work fine and perform well.

mingshl commented 1 year ago

Thanks @lukas-vlcek for your comments on all three approaches!

I think I have to clarify two points in the three approaches, please correct me if you think there is anything wrong but AFAIK:

the difference between `text` type and `keyword` type:

text type

the "text" type is used to index values such as full text, and it is analyzed, meaning it is tokenized and transformed into a list of individual terms, known as tokens, before being indexed. This allows for full-text search capabilities, such as matching phrases, searching for individual words within the text, and more. However, the "text" type is not suitable for sorting, aggregating and filtering because it has been transformed into individual terms.

keyword type

the "keyword" type is used to index values such as numbers, dates, and strings, which are treated as exact values, or single terms. The "keyword" type is not analyzed, meaning it is not tokenized or transformed in any way. It is useful for sorting, aggregating and filtering data, but it's not searchable by full text search.

This is also the reason why the approach 1 and approach 3 are a combination of two kinds of fields, a keyword field and a text field. Because they can support different functionalities, for example, text type will be useful for full text search while keyword type will be helpful using synonyms or stop words filters and aggregating.

Let's use the above example in approach 3 to explain:

in the following example:

key	key_text	value	value_text
catalog.author.surname	[ catalog, author, surname ]	[ Smith, Green ]	[ Smith, Green ]
catalog.author.given	[ catalog, author, given ]	[ Peter ]	[ Peter ]
catalog.author.first_name	[ catalog, author, first_name ]	[ Billy ]	[ Billy ]
catalog.author	[ catalog, author ]	[ John Doe ]	[ John, Doe ]
catalog.title	[ catalog, title ]	[ Test in Action ]	[ Test, in, Action ]

withe value(in a keyword field) and value_text(in a text field), it would support:

partial search when you look for catalog.title = 'Test'
and also full text search when you search for catalog.title = 'Test in Action'.

withe key(in a keyword field) and key_text(in a text field), it would support:

partial search when you look for author = 'Peter'
and also full text search when you search for catalog.author.given = 'Peter'.

The other point I want to clarify is that:

The approach 1 is NOT using nested document

Approach 1 is taking the JSON object as a string, and storing in two string fields. One as text field another as keyword field. And the dot path notation idea for the content_and_path is very similar to your idea of Proximity search based(But Repeat key before each value token parsed into the smallest term like catalog.title:Java,catalog.title: In , catalog.title:Action. but approach 1 will take like catalog.title:Java in Action. Approach 1 is very efficiency in storing as a string for key-value pairs, the parsing in mapping will help with the standardized formats. But the biggest cons is that it can only support exact match. So there is a trade off in supporting partial search and full text search. In the Proximity search based idea, it can be a problem query for catalog.title = Java in Action, and in the approach 1, it can a problem searching for catalog.title = Java'.

So I am leaning more towards approach 1 and 3, but I also want to hear more from you the community.

I agreed that approach 2 will be less favorable, because when there are enormous number of subfields, it can be expensive to flat-out the all the subfields in mappings, it will get expensive in mapping, but it's NOT using dynamic mapping approach, because all the subfields will be in text type, so it shouldn't be having different type of meta data.

macohen commented 1 year ago

Just commenting that I'm changing this to a 2.7 release. We still have some thinking, design, security considerations, and coding to do on this and I think we should get this one right rather than rushing it. We aren't pausing on this at all. Of course, the label is just a label, so if there are any major objections or thoughts about how to release in 2.7 then we can change it back, I think in the next week or so.

reta commented 1 year ago

Just a few thoughts

The approaches which model flattened type as nested documents should be ruled out (it is good to mention those are being considered but due to the way nested documents are being implemented in OpenSearch, it is not viable).

The approach with proximity search is looking too restrictive on what could be searched against (simple terms are probably fine but phrases are not). Also, I think there are limits to that, let say the JSON has a text field with 64kb of text inside, how would the proximity search deal with that?

The 1st approach is probably the optimal solution for how the flattened type could be implemented. One thought to keep in mind - this is the "lazy man" approach to store and search over data, so the presence of the limitations are OKish (the non-lazy man would would properly design the data model).

@nknize @mikemccand may be you could share the expertise? Thank you!

macohen commented 1 year ago

msokolov commented 1 year ago

Based on the problem statement in the issue description, why are we talking about indexing anything (approach 1 above)? The way I read the requirements, it is not about search, only about post-processing of results for retrieving and/or aggregating values within a json blob.

If searching the text is needed, we could handle that with a JSON text analyzer, right?

I also wonder why we wouldn't want to support arrays and/or numeric types, but that could presumably be a later extension.

mingshl commented 1 year ago

@msokolov in the approach1, it's taking the entire Json object as a string in one index, but there are fields that can be indexed.

Could you please explain more about what you mean by JSON text analyzer?

In regarding of supporting numeric types, with the approaches above, they are currently using stringfield, so all fields are treated as string. But it can be a good enhancement that we can further develop in the future.

josefschiefer27 commented 1 year ago

@lukas-vlcek's approach 3 would have the chance to support in future other data-types (e.g. numbers, dates etc.) by introducing typed value fields. With approach 1 I don't see any straightforward way to support other data-types besides strings in future.

macohen commented 1 year ago

@josefschiefer27 for this feature, our goals are to allow for fields in arbitrary complex documents to be stored as keywords, but not indexed to avoid dynamic mapping explosions where potentially thousands of fields are indexed by default which impacts performance of the cluster manager as it continues to keep track of state. If you still want to index a subdocument for searching, I believe that can be done either using a separate explicit mapping or allowing dynamic mapping to do its thing.

josefschiefer27 commented 1 year ago

@macohen - the introduction of typed value fields for approach 3 would introduce only a new field for each new data-type. In other words, instead of having a generic string value field for all values, why not having in addition a generic numeric field for all numeric values etc. There wouldn't be a mapping explosion since there are only additional fields per data-type. I wanted to mention this as an advantage for approach 3 which could enable future extensions.

mingshl commented 1 year ago

@josefschiefer27 There are pros and cons in approach 1 and approach 3, and I have been thinking that they are pretty good and different implementations that cannot be combined.

You are right about approach 3's biggest pro that can support multiple types, for example, numeric and dates. But the biggest con is that it creates sub docs. If the Json object are very complicated, it can goes up to exponential amount of sub docs in the worst case (if each leave has n leaves, summing n^k, k from 0 to n, that would be n^n amount sub-docs). It would be a lot of sub-docs to consider in the worst case.

Approach 1 treats everything as string,(for example number as string), a very complicated JSON, it can be a long string field to parse in mapping, but it doesn't do anything in the docs. it works the same way as uninterpreted keywords, and sort like strings. Nested fields cannot be used as dates or numbers. It's efficiency in storing and mapping.

Thinking of what use case the flat-object should fit. It's so called lazy man approach, we define a field as flat-object to store as string to avoid mapping explosion, can be retrieved in exact match in global field level and dot path notation. If someone wants to use numeric data in a fair simple JSON object, people can always go with dynamic mapping to allow each subfield has its specific field type.

Can you please give a sample use case for approach 3? We would like to see how it can fits in different ways.

josefschiefer27 commented 1 year ago

Approach 1 does work well with most search operations (e.g. range queries with numbers/dates get tricky), but does fall short with most aggregations (e.g. aggregations for numeric fields).

In the current proposal for Approach 3, I agree there would be lots of sub-docs. However, do we really have to create nested sub-docs? Couldn't we just map the fields as proposed without nested docs? Query capabilities would be still better and more flexible as approach 1.

Maybe I am missing something - let me make an example. Let's say we have this json (note that I added numeric fields 'reviews' and 'price'):

{ 
  { "ISBN13" : "V9781933988177",
    "catalog" : 
       { "title" : "Lucene in Action",
         "reviews": 1, "price": 11.5,
         "authors" : [ 
            { "surname" : "McCandless",
              "given"   : "Mike" },
            { "surname" : "Hatcher",
              "given"   : "Erik" }
           ...
         ...]
}}}

If we want to flatten the field 'catalog' we could map it similar as suggested in approach 3 with typed values as follows:

[
  {
    "key": "catalog.title",
    "value_string": "Lucene in Action"
  },
  {
    "key": "catalog.authors.surname",
    "value_string": ["McCandless", "Hatcher"]
  },
  {
    "key": "catalog.authors.given",
    "value_string": ["Mike","Erik"]
  },
  {
    "key": "catalog.reviews",
    "value_num": 1
  },
  {
    "key": "catalog.price",
    "value_num": 11.5
  }
]

With this index structure for flattened fields we used 3 fields in total ('key', 'value_string', 'value_num') and writing search and aggregation queries are fairly simple.

Let's say we want to get the average price for all jsons, we could write the query as follows:

{
  "query": {
    "term": {
      "key": {
        "value": "catalog.price"
      }
    }
  }, 
  "aggs": {
    "average_price": {
      "avg": {
        "field": "value_num"
      }
    }
  }
}

Note, that queries and aggs do have to be rewritten to work with this index structure. However, most search and aggregation functions would work with this index structure as intended. There would be less limitations as for approach 1.

reta commented 1 year ago

@josefschiefer27 trying to understand what would be the catalog field data type, an object?

josefschiefer27 commented 1 year ago

@reta - catalog is flattened. With my proposed index structure, it can be an object or array, both would work as expected (same behavior as for other fields in Opensearch). With 'as expected' I mean same result as I would get without flattening.

reta commented 1 year ago

@josefschiefer27 sorry should have been more precise, catalog is flattened, right. I am trying to understand what is underlying representation of this data structure in terms of Apache Lucene supported types (so we could apply term queries, etc). OpenSearch does not support arrays natively but only objects or nested types, which are mapped to Apache Lucene documents.

josefschiefer27 commented 1 year ago

@reta - I am a bit confused when you say OpenSearch does not support arrays natively. Every field can be also an array in OpenSearch (https://opensearch.org/docs/2.0/opensearch/supported-field-types/index/) which is a feature supported through Lucene.

josefschiefer27 commented 1 year ago

If we go with approach 1, we do something very similar what Elasticsearch does today with 'flattened' data type. And the number one limitation and complaint from users today is the lack of support of data types besides strings. In my opinion it's very challenging to get around that limitation. It would be nice if we don't put ourselves into the same corner.

Here some Elasticsearch limitations for reference https://github.com/elastic/elasticsearch/issues/61550 - there are many hearts on this issue ;-) See also https://github.com/elastic/elasticsearch/issues/43805 for possible limitations.

josefschiefer27 commented 1 year ago

I tried to create a bigger example based on @lukas-vlcek's json examples from above to illustrate how the mapping would work. I added date and numeric fields as well as mixed field values to make it more interesting. Below my learnings when going through the example...

Sample Data

- // --- Document 0
- // The "catalog.author" is a simple text field. One numeric field.
{
  "ISBN13": "V9781933988175",
  "catalog": {
    "title": "Java in Action",
    "author": "John Doe",
    "publication_score": 1023
  }
}

- // --- Document 1
- // The "catalog.author" is an object. New date field.
{
  "ISBN13": "V9781933988177",
  "catalog": {
    "title": "Lucene in Action",
    "price": 12.5,
    "publication_date": "2010-10-10T10:10:10",
    "author": {
      "surname": "McCandless",
      "given": "Mike"
      "publication_score": 1033
    }
  }
}

- // --- Document 2
- // The "catalog.author" is an array with objects.
- // And each object can be either simple value field or another object with variable "schema". Date is invalidate date value.
{
  "ISBN13": "V9781933988176",
  "catalog": {
    "title": "Test in Action",
    "publication_date": "none",
    "price": 14,
    "author": [
        "John Doe",
        {
          "surname": "Smith",
          "given": "Peter"
        },
        {
          "surname": "Smith2",
          "given": "Peter2"
        },
        {
          "surename": "Green",
          "first_name": "Billy"
        }
    ]
  }
}

These documents would be mapped (under the hood) into the following index structure.

- // --- Mapped Document 0
- // The "catalog.author" is a simple text field. One numeric field.
{
  {
    "key": "ISBN13",
    "value_string": "V9781933988175"
  },
  {
    "key": "catalog.title",
    "value_string": "Java in Action"
  },
  {
    "key": "catalog.author",
    "value_string": "John Doe"
  },
  {
    "key": "catalog.author",
    "value_string": "John Doe"
  },
  {
    "key": "catalog.publication_score",
    "value_num": 1023
  }
}

- // --- Document 1
- // The "catalog.author" is an object. New date field.
{
  {
    "key": "ISBN13",
    "value_string": "V9781933988177"
  },
  {
    "key": "catalog.title",
    "value_string": "Lucene in Action"
  },
  {
    "key": "catalog.price",
    "value_num": 12.5
  },
  {
    "key": "catalog.publication_date",
    "value_date": "2010-10-10T10:10:10"
  },
  {
    "key": "catalog.author.surname",
    "value_string": "McCandless"
  },
  {
    "key": "catalog.author.given",
    "value_string": "Mike"
  },
  {
    "key": "catalog.author.publication_score",
    "value_num": 1033
   }
}

- // --- Document 2
- // The "catalog.author" is an array with objects.
- // And each object can be either simple value field or another object with variable "schema". Date is invalidate date value.
{
  {
    "key": "ISBN13",
    "value_string": "V9781933988176",
  },
  {
    "key": "catalog.title",
    "value_string": "Test in Action",
  },
  {
    "key": "catalog.publication_date",
    "value_string": "none",
  },
  {
    "key": "catalog.price",
    "value_num": 14
  },
  {
    "key": "catalog.author",
    "value_string": "John Doe",
  },
  {
    "key": "catalog.author.surname",
    "value_string": "Smith",
  },
  {
    "key": "catalog.author.given",
    "value_string": ["Peter", "Peter2"],
  },
  {
    "key": "catalog.author.surname",
    "value_string": ["Smith", "Smith2"]
  },
  {
    "key": "catalog.author.surename",
    "value_string": "Green",
  },
  {
    "key": "catalog.author.first_name",
    "value_string": "Billy"
    }
}

For this index mapping we used 4 Lucene fields ('key', 'value_string', 'value_num', 'value_date') to map all fields into Lucene. You can see that we can map also 'weird' json data which wouldn't be supported by OpenSearch without flattening.

Now the real fun starts - how can we query this json data!?!

Queries and aggregations using flattened fields need to be rewritten - any query clause and aggregation needs to use generic value fields and requires an additional filter for the key.

Let's try some query example. Let's assume we want to find all docs which the word 'Action' in catalog.title.

Without flattening the query would be:

{
   "query" : {
        "wildcard" : {
                "catalog.title": "*Action*"
         }
    }
}

To get the same result, we could try to rewrite this query as follows:

{
   "query" : {
      "bool": {
        "filter": [
          {
             "term" : {
                   "key": "catalog.title"
             }
          },
          {
             "wildcard" : {
                "value_string": "*Action*"
             }
          }
        ]
      }
    }
}

However, there is a big problem with this query - since we don't use nested docs/queries, it wouldn't deliver always the correct result (e.g. if there is a 'catalog.title' field and *Action* matches in some other field we would still get a hit). I possibly could use a scripted query to validate the match - however this wouldn't be an elegant solution anymore... it might work as discussed above by using nested docs/queries, however that might lead to a 'nested-doc' explosion.

The example was a helpful exercise for me to understand better the problem. It would be nice if we could find some way to support data-types beyond just strings.

josefschiefer27 commented 1 year ago

An idea to get around the problem of having only string (sub-)fields in flattened objects. We could allow users to define in the mapping a parameter to exclude certain fields from the flattening (e.g. "exclude": "*date*" for not flattening all date fields). This would allow users to have full search/aggs supports for selected fields of the flattened object.

reta commented 1 year ago

@reta - I am a bit confused when you say OpenSearch does not support arrays natively. Every field can be also an array in OpenSearch (https://opensearch.org/docs/2.0/opensearch/supported-field-types/index/) which is a feature supported through Lucene.

There is no dedicated array field type in OpenSearch. Instead, you can pass an array of values into any field. All values in the array must have the same field type. - taken from docs

mingshl commented 1 year ago

@josefschiefer27 Approach 3 does create a lot of sub-docs, but not nested doc in multiple levels. To be clear, there will be root level and level one. That's two level in total. But level one might have n^n sub-docs in the worst case. Yes, it will support the numeric operation. It is an important point for the users, but it's not a minimum requirement addressing in this issue.

It seems that you have a clear idea of implementing approach 3 and would you like to raise a PR, or a draft PR to approach 3?

mingshl commented 1 year ago

An idea to get around the problem of having only string (sub-)fields in flattened objects. We could allow users to define in the mapping a parameter to exclude certain fields from the flattening (e.g. "exclude": "date" for not flattening all date fields). This would allow users to have full search/aggs supports for selected fields of the flattened object.

I thought about dynamically adding subfields to identify typed fields, but if enabled adding non-limited subfields, for example, millions of dates and numbers subfields, it will have risk in leading to mapping explosion.

And there might be a way around to help with the numeric subfields, if a user would like to use one raw field as flat-object to injest entire JSON as string, when found out a numeric subfield and a date subfield within the JSON object, user can cherrypick the subfields, and try add additional new fields to update the documents with numeric fields or date fields. In this example, it can be three fields,

{
     "raw field" :{
        "type": "flat-object"
      },
    "date field" :{
        "type": "Dates"
      },
   "number field" :{
        "type": "numbers"
      }

It might need some work, but this can a way around to help with the typed fields and avoid mapping explosion.

josefschiefer27 commented 1 year ago

@mingshl - I think creating lots of sub-docs for flattened objects is sub-optimal and likely creates other problems. There might be flattened objects where the number of sub-docs can becomes huge and nested queries can be expensive. In my attempt for approach 3 I tried to avoid nested docs/queries.

Meanwhile, I do believe that approach 1 with smart string encoding is probably the most promising approach. In your description for approach 1 you are using two fields ('value' and 'content_and_path'). Wouldn't be the 'content_and_path' field sufficient? You mentioned as example catalog = 'Mike' - not sure when this would be needed in an OpenSearch query.

Edit: Found the answer to my question - such query is currently supported by 'flattened' data type.

kotwanikunal commented 1 year ago

@lukas-vlcek Reaching out since this is marked as a part of v2.7.0 roadmap. Please let me know if this isn't going to be a part of the release.

mingshl commented 1 year ago

Hi @kotwanikunal，the flat-object is going to v2.7.0 release. We are planning to merge this PR later today. https://github.com/opensearch-project/OpenSearch/pull/6507

DarshitChanpura commented 1 year ago

Hi @dblock Is the issue ready to be closed, since #6507 is merged.

mingshl commented 1 year ago

we can close this issue now. flat_object is going to 2.7 and future enhancement issues are here: https://github.com/opensearch-project/OpenSearch/issues/7138 https://github.com/opensearch-project/OpenSearch/issues/7137 https://github.com/opensearch-project/OpenSearch/issues/7136

opensearch-project / OpenSearch

Add flattened field type #1018

[Design Proposal] The flat data type in OpenSearch

Summary

Motivation

Demand

Specification

Mapping and ingestion

Searching and retrieving

Example

Performance

Limitations

Possible implementation

Security

Possible enhancements

Mapping

Sample data

Approach 2

Approach 3

Approach 1

Approaches without Nested Documents?

Proximity search based

Repeat key before each value token

the difference between text type and keyword type:

text type

keyword type

The approach 1 is NOT using nested document

Sample Data

Now the real fun starts - how can we query this json data!?!

the difference between `text` type and `keyword` type: