Closed dblock closed 1 year ago
@lukas-vlcek agreed. Here is the branch flat-object updated.
Thanks @lukas-vlcek for the initial commit pr, which layout is following a typical plugin module. It's marked with mark it @opensearch.experimental. @nknize @dblock what are your thoughts to guide us about how to release this feature?mapper-extras module, sandbox, experimental, or long term support (LTS)?
The implementation looks simple. The only reason to do experimental is if you think something may be removed in the future, or the API will change. So my preference is for LTS unless there's a good reason not to. @nknize WDYT?
Could we please get some 👀 from other maintainers on the PRs/code for this? Maybe @gbbafna or @Bukhtawar?
The implementation looks simple. The only reason to do experimental is if you think something may be removed in the future, or the API will change. So my preference is for LTS unless there's a good reason not to. @nknize WDYT?
I was thinking that if we can release with a stable API for whatever version is coming when this is done then great. If we’re not comfortable with the API, but it’s good enough to test when we get to a release, go experimental then and LTS for a future release. Just my thoughts on how we can release frequently and get feedback. How does that sound?
To implement flat objects, we currently has three technical approaches: Using the above sample index "catalog" and two sample doc to explain below:
curl -XPUT localhost:9200/test-index002 --data '{
"mappings": {
"properties": {
"ISBN13": {
"type": "keyword"
},
"catalog": {
"type": "flat-object"
}
}
}
}' -H "Content-Type:Application/json"
curl -XPUT localhost:9200/test-index002/_doc/1 -d '{
"ISBN13": "V9781933988177",
"catalog": {
"title": "Lucene in Action",
"author1": {
"surname": "McCandless",
"given": "Mike"
}
}
}' -H "Content-Type:Application/json"
curl -XPUT localhost:9200/test-index002/_doc/0 -d '{
"ISBN13": "V9781933988176",
"catalog": {
"title": "Test in Action",
"author": {
"surname": "Mock",
"given": "Mike"
}
}
}' -H "Content-Type:Application/json"
Approach 1: Two String Fields
catalog.value:
doc 1: "{ "title" : "Lucene in Action", "author" : { "surname" : "McCandless", "given" : "Mike" } }",
doc 0: "{ "title" : "Test in Action", "author" : { "surname" : "Mock", "given" : "Mike" } }"
The keyword fields will not be tokenized and transformed into a list of individual terms. The terms dictionary for the "value" field would be:
catalog.value:
```
doc 1: ["title","Lucene in Action", "author", "surname" ,"McCandless", "given", "Mike"],
doc 0: ["title","Test in Action","author","surname", "Mock","given", "Mike"]
```
The StringField 2, "content_and_path" keyword field would contain the nested fields as a string, with each field's path appended to the beginning:
catalog.__content_and_path:,
```
doc 1: "catalog.title:Lucene in Action, catalog.author.surname:McCandless, catalog.author.given:Mike"
doc 0: "catalog.title:Test in Action, catalog.author.surname:Mock, catalog.author.given:Mike
```
the keyword field will be treated as exact values. The terms dictionary for the "content_and_path" field would be:
catalog.content_and_path:
```
doc 1: ["catalog.title=Lucene in Action", "catalog.author.surname=McCandless", "catalog.author.given=Mike"]
doc 0: ["catalog.title=Test in Action", "catalog.author.surname=Mock", "catalog.author.given=Mike"]
```
Approach 2: Two Fields Per Subfields
{"catalog.title.value" : "Test in Action"
"catalog.title.path_and_content" : "catalog.title=Test in Action"}
{"catalog.author.surname.value : "Mock"
"catalog.author.surname.path_and_content" : "catalog.author.surname=Mock"}
{"catalog.author.given.value : "Mike"
"catalog.author.given.path_and_content" : "catalog.author.given=Mike"}
The three terms dictionary for the "value" field would be:
"catalog.title.value" contains the terms {"Test", "In", "Action"}
"catalog.author.surname.value" contains the terms {"Mock"}
"catalog.author.given.value" contains the terms {"Mike"}
The three terms dictionary for the "content_and_path" field would be:
"catalog.title.path_and_content" contains the terms {"catalog.title". "Test" ,“In” , Action"}
"catalog.author.surname.path_and_content" contains the terms {"catalog.author.surname“, ”Mock"}
"catalog.author.given.path_and_content" contains the terms {"catalog.author.given", "Mike"}
Approach 3: Nested Doc Per subfields
"properties": {
"key": {
"type": "keyword"
},
"key_text": {
"analyzer": "whitespace",
"norms": false,
"type": "text"
},
"value": {
"type": "keyword"
},
"value_text": {
"analyzer": "whitespace",
"norms": false,
"type": "text"
}
}
with the sample example, the field for catalog will create 3 sub docs
catalog: { "properties": {
{ "key" : "title", "key_text" : "title", "value": "Lucene in Action", "value_text": "Lucene in Action" } ,
{ "key" : "author.surname", "key_text" : "author.surname","value": "McCandless", "value_text": "McCandless" } ,
{ "key" : "author.given", "key_text" : "author.given", "value": "Mike", "value_text": "Mike"}
}}
@Mingshl
Thanks a lot for putting this together. The following are my thoughts about it. In nutshell, I think we shall try to explore also options that do not rely on nested objects and I am proposing some ideas below. Perhaps your approach 1 goes into the same direction as well. Let's see...
Let me start with the definition of index mapping:
{
"mappings": {
"properties": {
"ISBN13": {
"type": "keyword"
},
"catalog": {
"type": "flat-object"
}
}
}
}
What I think is very important (and I have been missing it in the discussion so far) is that the catalog
field can accept ANY JSON object. Including objects with variable/changing structure. This can mean that we can think of the following three example documents all representing valid catalog
object:
// --- Document 0
// The "catalog.author" is a simple text field.
{
"ISBN13": "V9781933988175",
"catalog": {
"title": "Java in Action",
"author": "John Doe"
}
}
// --- Document 1
// The "catalog.author" is an object.
{
"ISBN13": "V9781933988177",
"catalog": {
"title": "Lucene in Action",
"author": {
"surname": "McCandless",
"given": "Mike"
}
}
}
// --- Document 2
// The "catalog.author" is an array with objects.
// And each object can be either simple value field or another object with variable "schema".
{
"ISBN13": "V9781933988176",
"catalog": {
"title": "Test in Action",
"author": [
"John Doe",
{
"surname": "Smith",
"given": "Peter"
},
{
"surename": "Green",
"first_name": "Billy"
}
]
}
}
(A small notice: some examples used in the past were using author
and author1
fields. Which in fact already means that we assume variable structure, but it seems to me that this naming difference was so subtle that it sometimes went unnoticed. So let's make it clearly obvious.)
Now, if we agree that this is the use case that we want to support then we can revisit suggested approaches.
If I understand correctly I think we can exclude Approach 2.
It is because individual fields (like catalog.author
) can be either simple value filed (like string) or object. And this is impossible to model using today's index mapping. That would lead to mapping conflicts (depending on which document comes first for indexing).
Also I believe we do not want to introduce any new mappings for individual fields dynamically. This could lead to heavy metadata synchronisation across the cluster (imagine indexing a lot of documents with unique fields, the metadata will need to be synced all over the cluster nodes) and can also lead to mapping explosion.
I like Approach 3, let's explore how it could be used to store and query the Document 3:
Firstly, I am not sure why we need to distinguish between key
and key_text
, and similarly between value
and value_text
. So for the sake of simplicity let's simplify this model for now into two fields only – key
and value
:
key | value |
---|---|
catalog.author.surename |
[ Smith, Green ] |
catalog.author.given |
[ Peter ] |
catalog.author.first_name |
[ Billy ] |
catalog.author |
[ John, Doe ] |
catalog.title |
[ Test, in, Action ] |
This means we get the following nested objects for indexing:
[
{
"key": "catalog.author.surename",
"value": "Smith Green"
},
{
"key": "catalog.author.given",
"value": "Peter"
},
{
"key": "catalog.author.first_name",
"value": "Billy"
},
{
"key": "catalog.author",
"value": "John Doe"
},
{
"key": "catalog.title",
"value": "Test in Action"
},
]
The Cons is that we are loosing the option to correctly search across child objects, meaning that we can search for catalog.author.surename:Green AND catalog.author.given:Peter
and get a hit while there is no such combination for the author. But I think it is accepted fact that we are not able to model nested objects using abstraction on top of nested objects (:smile:). Generally, I think it applies to all approaches that we will not be simulate nested objects.
Of course, internally, such query would need to be translated to a more complicated bool query of nested queries. That is something that would have to be carefully implemented.
// Not sure if I got this query perfectly correct, but something like this...
{
"query":{
"bool":{
"must":[
{
"nested":{
"path":"catalog",
"query":{
"bool":{
"must":[
{
"term":{
"catalog.key":"catalog.author.surename"
}
},
{
"match":{
"catalog.value":"Green"
}
}
]
}
}
}
},
{
"nested":{
"path":"catalog",
"query":{
"bool":{
"must":[
{
"term":{
"catalog.key":"catalog.author.given"
}
},
{
"match":{
"catalog.value":"Peter"
}
}
]
}
}
}
}
]
}
}
}
As a bonus – I can imagine that it might be useful/possible to further specify text analyzers using some kind of key
pattern matching. For example all catalog.author.*name
fields could use different analyzer (but that would probably require to add extra value_analyzed
field). But that is not important at this point.
I feel a bit lost on this one. I think the text formatting is not helping much. But on a high-lvel it feels to me that it is going the similar direction as my last proposal (see below).
I was thinking about approaches without nested documents. It is a known fact that nested documents are expensive. So can we do without them?
I think part of the answer is what specificaly we want to use the flat-object
data for during querying (I will get to this point later). The following are two differet approaches that I would like to discuss:
Let's start with this question: What would a JSON data look like when transformed into a single line form? Basically it is a Map with key-value pairs. For each key I get a value. For example the Document 0 could be represented like this:
ISBN13: V9781933988175 catalog.title: Java in Action catalog.author: John Doe
To make it more clear and readable let's add formatting and arbitrary boundries around each key-value pair:
|
ISBN13: V9781933988175 |
catalog.title: Java in Action |
catalog.author: John Doe |
What I see here is a sequence of tokens organized into groups defined by some boundries. The first token in each group is called a "key" followed by one or more "value" tokens.
From here the flat-object
search could be seen as search for specific value token(s) inside the group identified by its key.
(Hint: I can imagine that every key can be processed using "synonym filter technique" to yield all possible path expansions: ie. "foo.bar.clazz" can yield "foo" and "foo.bar" at the same token position, this is to enable more flexible search. But let's leave this discussion for later...)
While this is still very vague definition I found the following article from Mike very inspiring: https://blog.mikemccandless.com/2014/08/a-new-proximity-query-for-lucene-using.html
Lucene can actually search for tokens in specifric order having some kind of "closeness". So the proximity search task is actualy a task of searching for a "key" token followed by the "value" token(s) until the border of the group is reached.
Of course the devil is in details. Both the Slop and TermAutomaton Queries seem to require explicit definition of the distance between tokens. Which may be challenging to fix or workaround (but maybe is doable). It might be worth investigating this approach a bit further.
In general I think such approach would not perform worse than what the performance of the TermAutomation queries is today.
The second approach is based on very simple idea: Let's prefix each value token with its key.
Taking the Document 0 again we would get the following (for clarity each token on a new row):
ISBN13:V9781933988175
catalog.title:Java
catalog.title:in
catalog.title:Action
catalog.author:John
catalog.author:Doe
This idea is based on my observation of what I would mostly likely be using any arbitrary JSON payload for in indexed document. And that is for filtering. Example: I want to search within K8s logs associated with a custom key-value annotation. So the artibtary JSON payload would be the annotations metadata. In this case I am primarily interested in filtering. And approach like this would probably work fine and perform well.
Thanks @lukas-vlcek for your comments on all three approaches!
I think I have to clarify two points in the three approaches, please correct me if you think there is anything wrong but AFAIK:
text
type and keyword
type:the "text" type is used to index values such as full text, and it is analyzed, meaning it is tokenized and transformed into a list of individual terms, known as tokens, before being indexed. This allows for full-text search capabilities, such as matching phrases, searching for individual words within the text, and more. However, the "text" type is not suitable for sorting, aggregating and filtering because it has been transformed into individual terms.
the "keyword" type is used to index values such as numbers, dates, and strings, which are treated as exact values, or single terms. The "keyword" type is not analyzed, meaning it is not tokenized or transformed in any way. It is useful for sorting, aggregating and filtering data, but it's not searchable by full text search.
This is also the reason why the approach 1 and approach 3 are a combination of two kinds of fields, a keyword
field and a text
field. Because they can support different functionalities, for example, text type will be useful for full text search while keyword type will be helpful using synonyms or stop words filters and aggregating.
Let's use the above example in approach 3 to explain:
in the following example:
key | key_text | value | value_text |
---|---|---|---|
catalog.author.surname | [ catalog, author, surname ] | [ Smith, Green ] | [ Smith, Green ] |
catalog.author.given | [ catalog, author, given ] | [ Peter ] | [ Peter ] |
catalog.author.first_name | [ catalog, author, first_name ] | [ Billy ] | [ Billy ] |
catalog.author | [ catalog, author ] | [ John Doe ] | [ John, Doe ] |
catalog.title | [ catalog, title ] | [ Test in Action ] | [ Test, in, Action ] |
withe value(in a keyword field) and value_text(in a text field), it would support:
withe key(in a keyword field) and key_text(in a text field), it would support:
The other point I want to clarify is that:
Approach 1 is taking the JSON object as a string, and storing in two string fields. One as text field another as keyword field. And the dot path notation idea for the content_and_path is very similar to your idea of Proximity search based(But Repeat key before each value token parsed into the smallest term like catalog.title:Java
,catalog.title: In
, catalog.title:Action
. but approach 1 will take like catalog.title:Java in Action
. Approach 1 is very efficiency in storing as a string for key-value pairs, the parsing in mapping will help with the standardized formats. But the biggest cons is that it can only support exact match. So there is a trade off in supporting partial search and full text search. In the Proximity search based idea, it can be a problem query for catalog.title = Java in Action
, and in the approach 1, it can a problem searching for catalog.title = Java'.
So I am leaning more towards approach 1 and 3, but I also want to hear more from you the community.
I agreed that approach 2 will be less favorable, because when there are enormous number of subfields, it can be expensive to flat-out the all the subfields in mappings, it will get expensive in mapping, but it's NOT using dynamic mapping approach, because all the subfields will be in text type, so it shouldn't be having different type of meta data.
Just commenting that I'm changing this to a 2.7 release. We still have some thinking, design, security considerations, and coding to do on this and I think we should get this one right rather than rushing it. We aren't pausing on this at all. Of course, the label is just a label, so if there are any major objections or thoughts about how to release in 2.7 then we can change it back, I think in the next week or so.
Just a few thoughts
The approaches which model flattened type as nested documents should be ruled out (it is good to mention those are being considered but due to the way nested documents are being implemented in OpenSearch, it is not viable).
The approach with proximity search is looking too restrictive on what could be searched against (simple terms are probably fine but phrases are not). Also, I think there are limits to that, let say the JSON has a text field with 64kb of text inside, how would the proximity search deal with that?
The 1st approach is probably the optimal solution for how the flattened type could be implemented. One thought to keep in mind - this is the "lazy man" approach to store and search over data, so the presence of the limitations are OKish (the non-lazy man would would properly design the data model).
@nknize @mikemccand may be you could share the expertise? Thank you!
This bug is related: https://github.com/opensearch-project/OpenSearch/issues/3733
Based on the problem statement in the issue description, why are we talking about indexing anything (approach 1 above)? The way I read the requirements, it is not about search, only about post-processing of results for retrieving and/or aggregating values within a json blob.
If searching the text is needed, we could handle that with a JSON text analyzer, right?
I also wonder why we wouldn't want to support arrays and/or numeric types, but that could presumably be a later extension.
@msokolov in the approach1, it's taking the entire Json object as a string in one index, but there are fields that can be indexed.
Could you please explain more about what you mean by JSON text analyzer?
In regarding of supporting numeric types, with the approaches above, they are currently using stringfield, so all fields are treated as string. But it can be a good enhancement that we can further develop in the future.
@lukas-vlcek's approach 3 would have the chance to support in future other data-types (e.g. numbers, dates etc.) by introducing typed value fields. With approach 1 I don't see any straightforward way to support other data-types besides strings in future.
@josefschiefer27 for this feature, our goals are to allow for fields in arbitrary complex documents to be stored as keywords, but not indexed to avoid dynamic mapping explosions where potentially thousands of fields are indexed by default which impacts performance of the cluster manager as it continues to keep track of state. If you still want to index a subdocument for searching, I believe that can be done either using a separate explicit mapping or allowing dynamic mapping to do its thing.
@macohen - the introduction of typed value fields for approach 3 would introduce only a new field for each new data-type. In other words, instead of having a generic string value field for all values, why not having in addition a generic numeric field for all numeric values etc. There wouldn't be a mapping explosion since there are only additional fields per data-type. I wanted to mention this as an advantage for approach 3 which could enable future extensions.
@josefschiefer27 There are pros and cons in approach 1 and approach 3, and I have been thinking that they are pretty good and different implementations that cannot be combined.
You are right about approach 3's biggest pro that can support multiple types, for example, numeric and dates. But the biggest con is that it creates sub docs. If the Json object are very complicated, it can goes up to exponential amount of sub docs in the worst case (if each leave has n leaves, summing n^k, k from 0 to n, that would be n^n amount sub-docs). It would be a lot of sub-docs to consider in the worst case.
Approach 1 treats everything as string,(for example number as string), a very complicated JSON, it can be a long string field to parse in mapping, but it doesn't do anything in the docs. it works the same way as uninterpreted keywords, and sort like strings. Nested fields cannot be used as dates or numbers. It's efficiency in storing and mapping.
Thinking of what use case the flat-object should fit. It's so called lazy man approach, we define a field as flat-object to store as string to avoid mapping explosion, can be retrieved in exact match in global field level and dot path notation. If someone wants to use numeric data in a fair simple JSON object, people can always go with dynamic mapping to allow each subfield has its specific field type.
Can you please give a sample use case for approach 3? We would like to see how it can fits in different ways.
Approach 1 does work well with most search operations (e.g. range queries with numbers/dates get tricky), but does fall short with most aggregations (e.g. aggregations for numeric fields).
In the current proposal for Approach 3, I agree there would be lots of sub-docs. However, do we really have to create nested sub-docs? Couldn't we just map the fields as proposed without nested docs? Query capabilities would be still better and more flexible as approach 1.
Maybe I am missing something - let me make an example. Let's say we have this json (note that I added numeric fields 'reviews' and 'price'):
{
{ "ISBN13" : "V9781933988177",
"catalog" :
{ "title" : "Lucene in Action",
"reviews": 1, "price": 11.5,
"authors" : [
{ "surname" : "McCandless",
"given" : "Mike" },
{ "surname" : "Hatcher",
"given" : "Erik" }
...
...]
}}}
If we want to flatten the field 'catalog' we could map it similar as suggested in approach 3 with typed values as follows:
[
{
"key": "catalog.title",
"value_string": "Lucene in Action"
},
{
"key": "catalog.authors.surname",
"value_string": ["McCandless", "Hatcher"]
},
{
"key": "catalog.authors.given",
"value_string": ["Mike","Erik"]
},
{
"key": "catalog.reviews",
"value_num": 1
},
{
"key": "catalog.price",
"value_num": 11.5
}
]
With this index structure for flattened fields we used 3 fields in total ('key', 'value_string', 'value_num') and writing search and aggregation queries are fairly simple.
Let's say we want to get the average price for all jsons, we could write the query as follows:
{
"query": {
"term": {
"key": {
"value": "catalog.price"
}
}
},
"aggs": {
"average_price": {
"avg": {
"field": "value_num"
}
}
}
}
Note, that queries and aggs do have to be rewritten to work with this index structure. However, most search and aggregation functions would work with this index structure as intended. There would be less limitations as for approach 1.
@josefschiefer27 trying to understand what would be the catalog
field data type, an object?
@reta - catalog is flattened. With my proposed index structure, it can be an object or array, both would work as expected (same behavior as for other fields in Opensearch). With 'as expected' I mean same result as I would get without flattening.
@josefschiefer27 sorry should have been more precise, catalog
is flattened, right. I am trying to understand what is underlying representation of this data structure in terms of Apache Lucene supported types (so we could apply term queries, etc). OpenSearch does not support arrays natively but only objects
or nested
types, which are mapped to Apache Lucene documents.
@reta - I am a bit confused when you say OpenSearch does not support arrays natively. Every field can be also an array in OpenSearch (https://opensearch.org/docs/2.0/opensearch/supported-field-types/index/) which is a feature supported through Lucene.
If we go with approach 1, we do something very similar what Elasticsearch does today with 'flattened' data type. And the number one limitation and complaint from users today is the lack of support of data types besides strings. In my opinion it's very challenging to get around that limitation. It would be nice if we don't put ourselves into the same corner.
Here some Elasticsearch limitations for reference https://github.com/elastic/elasticsearch/issues/61550 - there are many hearts on this issue ;-) See also https://github.com/elastic/elasticsearch/issues/43805 for possible limitations.
I tried to create a bigger example based on @lukas-vlcek's json examples from above to illustrate how the mapping would work. I added date and numeric fields as well as mixed field values to make it more interesting. Below my learnings when going through the example...
- // --- Document 0
- // The "catalog.author" is a simple text field. One numeric field.
{
"ISBN13": "V9781933988175",
"catalog": {
"title": "Java in Action",
"author": "John Doe",
"publication_score": 1023
}
}
- // --- Document 1
- // The "catalog.author" is an object. New date field.
{
"ISBN13": "V9781933988177",
"catalog": {
"title": "Lucene in Action",
"price": 12.5,
"publication_date": "2010-10-10T10:10:10",
"author": {
"surname": "McCandless",
"given": "Mike"
"publication_score": 1033
}
}
}
- // --- Document 2
- // The "catalog.author" is an array with objects.
- // And each object can be either simple value field or another object with variable "schema". Date is invalidate date value.
{
"ISBN13": "V9781933988176",
"catalog": {
"title": "Test in Action",
"publication_date": "none",
"price": 14,
"author": [
"John Doe",
{
"surname": "Smith",
"given": "Peter"
},
{
"surname": "Smith2",
"given": "Peter2"
},
{
"surename": "Green",
"first_name": "Billy"
}
]
}
}
These documents would be mapped (under the hood) into the following index structure.
- // --- Mapped Document 0
- // The "catalog.author" is a simple text field. One numeric field.
{
{
"key": "ISBN13",
"value_string": "V9781933988175"
},
{
"key": "catalog.title",
"value_string": "Java in Action"
},
{
"key": "catalog.author",
"value_string": "John Doe"
},
{
"key": "catalog.author",
"value_string": "John Doe"
},
{
"key": "catalog.publication_score",
"value_num": 1023
}
}
- // --- Document 1
- // The "catalog.author" is an object. New date field.
{
{
"key": "ISBN13",
"value_string": "V9781933988177"
},
{
"key": "catalog.title",
"value_string": "Lucene in Action"
},
{
"key": "catalog.price",
"value_num": 12.5
},
{
"key": "catalog.publication_date",
"value_date": "2010-10-10T10:10:10"
},
{
"key": "catalog.author.surname",
"value_string": "McCandless"
},
{
"key": "catalog.author.given",
"value_string": "Mike"
},
{
"key": "catalog.author.publication_score",
"value_num": 1033
}
}
- // --- Document 2
- // The "catalog.author" is an array with objects.
- // And each object can be either simple value field or another object with variable "schema". Date is invalidate date value.
{
{
"key": "ISBN13",
"value_string": "V9781933988176",
},
{
"key": "catalog.title",
"value_string": "Test in Action",
},
{
"key": "catalog.publication_date",
"value_string": "none",
},
{
"key": "catalog.price",
"value_num": 14
},
{
"key": "catalog.author",
"value_string": "John Doe",
},
{
"key": "catalog.author.surname",
"value_string": "Smith",
},
{
"key": "catalog.author.given",
"value_string": ["Peter", "Peter2"],
},
{
"key": "catalog.author.surname",
"value_string": ["Smith", "Smith2"]
},
{
"key": "catalog.author.surename",
"value_string": "Green",
},
{
"key": "catalog.author.first_name",
"value_string": "Billy"
}
}
For this index mapping we used 4 Lucene fields ('key', 'value_string', 'value_num', 'value_date') to map all fields into Lucene. You can see that we can map also 'weird' json data which wouldn't be supported by OpenSearch without flattening.
Queries and aggregations using flattened fields need to be rewritten - any query clause and aggregation needs to use generic value fields and requires an additional filter for the key.
Let's try some query example. Let's assume we want to find all docs which the word 'Action' in catalog.title.
Without flattening the query would be:
{
"query" : {
"wildcard" : {
"catalog.title": "*Action*"
}
}
}
To get the same result, we could try to rewrite this query as follows:
{
"query" : {
"bool": {
"filter": [
{
"term" : {
"key": "catalog.title"
}
},
{
"wildcard" : {
"value_string": "*Action*"
}
}
]
}
}
}
However, there is a big problem with this query - since we don't use nested docs/queries, it wouldn't deliver always the correct result (e.g. if there is a 'catalog.title' field and *Action* matches in some other field we would still get a hit). I possibly could use a scripted query to validate the match - however this wouldn't be an elegant solution anymore... it might work as discussed above by using nested docs/queries, however that might lead to a 'nested-doc' explosion.
The example was a helpful exercise for me to understand better the problem. It would be nice if we could find some way to support data-types beyond just strings.
An idea to get around the problem of having only string (sub-)fields in flattened objects. We could allow users to define in the mapping a parameter to exclude certain fields from the flattening (e.g. "exclude": "*date*" for not flattening all date fields). This would allow users to have full search/aggs supports for selected fields of the flattened object.
@reta - I am a bit confused when you say OpenSearch does not support arrays natively. Every field can be also an array in OpenSearch (https://opensearch.org/docs/2.0/opensearch/supported-field-types/index/) which is a feature supported through Lucene.
There is no dedicated array field type in OpenSearch. Instead, you can pass an array of values into any field. All values in the array must have the same field type. - taken from docs
@josefschiefer27 Approach 3 does create a lot of sub-docs, but not nested doc in multiple levels. To be clear, there will be root level and level one. That's two level in total. But level one might have n^n sub-docs in the worst case. Yes, it will support the numeric operation. It is an important point for the users, but it's not a minimum requirement addressing in this issue.
It seems that you have a clear idea of implementing approach 3 and would you like to raise a PR, or a draft PR to approach 3?
An idea to get around the problem of having only string (sub-)fields in flattened objects. We could allow users to define in the mapping a parameter to exclude certain fields from the flattening (e.g. "exclude": "date" for not flattening all date fields). This would allow users to have full search/aggs supports for selected fields of the flattened object.
I thought about dynamically adding subfields to identify typed fields, but if enabled adding non-limited subfields, for example, millions of dates and numbers subfields, it will have risk in leading to mapping explosion.
And there might be a way around to help with the numeric subfields, if a user would like to use one raw field as flat-object to injest entire JSON as string, when found out a numeric subfield and a date subfield within the JSON object, user can cherrypick the subfields, and try add additional new fields to update the documents with numeric fields or date fields. In this example, it can be three fields,
{
"raw field" :{
"type": "flat-object"
},
"date field" :{
"type": "Dates"
},
"number field" :{
"type": "numbers"
}
It might need some work, but this can a way around to help with the typed fields and avoid mapping explosion.
@mingshl - I think creating lots of sub-docs for flattened objects is sub-optimal and likely creates other problems. There might be flattened objects where the number of sub-docs can becomes huge and nested queries can be expensive. In my attempt for approach 3 I tried to avoid nested docs/queries.
Meanwhile, I do believe that approach 1 with smart string encoding is probably the most promising approach. In your description for approach 1 you are using two fields ('value' and 'content_and_path'). Wouldn't be the 'content_and_path' field sufficient? You mentioned as example catalog = 'Mike' - not sure when this would be needed in an OpenSearch query.
Edit: Found the answer to my question - such query is currently supported by 'flattened' data type.
@lukas-vlcek Reaching out since this is marked as a part of v2.7.0
roadmap. Please let me know if this isn't going to be a part of the release.
Hi @kotwanikunal,the flat-object is going to v2.7.0 release. We are planning to merge this PR later today. https://github.com/opensearch-project/OpenSearch/pull/6507
Hi @dblock Is the issue ready to be closed, since #6507 is merged.
we can close this issue now. flat_object is going to 2.7 and future enhancement issues are here: https://github.com/opensearch-project/OpenSearch/issues/7138 https://github.com/opensearch-project/OpenSearch/issues/7137 https://github.com/opensearch-project/OpenSearch/issues/7136
(updated from https://github.com/opensearch-project/OpenSearch/issues/1018#issuecomment-1188365805 below, @macrakis)
[Design Proposal] The flat data type in OpenSearch
Summary
JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.
Flat subfields support exact match queries and textual sorting.
Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.
Motivation
Demand
OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)
OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)
Specification
Mapping and ingestion
Searching and retrieving
Example
This declares catalog as being of type flattened:
Consider the ingestion of the following document:
Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.
Performance
Limitations
Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.
Possible implementation
These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.
Security
Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.
Possible enhancements
The current specification is minimal. It intentionally does not include many options offered by other vendors.
Depending on the user feedback we receive after the initial release, various enhancements are possible: