opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.43k stars 1.73k forks source link

Add flattened field type #1018

Closed dblock closed 1 year ago

dblock commented 3 years ago

(updated from https://github.com/opensearch-project/OpenSearch/issues/1018#issuecomment-1188365805 below, @macrakis)

[Design Proposal] The flat data type in OpenSearch

Summary

JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.

Flat subfields support exact match queries and textual sorting.

Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.

Motivation

Demand

OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)

OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)

Specification

Mapping and ingestion

Searching and retrieving

Example

This declares catalog as being of type flattened:

{ "mappings": 
  { "book" :
    { "properties" :
       { "ISBN13"  : "keyword",
         "catalog" : "flattened" }
}}}

Consider the ingestion of the following document:

{ 
  { "ISBN13" : "V9781933988177",
    "catalog" : 
       { "title" : "Lucene in Action",
         "author1" : 
            { "surname" : "McCandless",
              "given"   : "Mike" }
}}}

Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.

Performance

Limitations

Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.

Possible implementation

These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.

Security

Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.

Possible enhancements

The current specification is minimal. It intentionally does not include many options offered by other vendors.

Depending on the user feedback we receive after the initial release, various enhancements are possible:

mrkamel commented 2 years ago

i'd like to underline the need for an offical feature complete and performant alternative to the flattened datatype and it also was requested a lot on several sites already: https://github.com/opendistro-for-elasticsearch/opendistro-build/issues/523, https://discuss.opendistrocommunity.dev/t/flattened-type-with-opendistro/5014/4, https://forums.aws.amazon.com/thread.jspa?threadID=32970, ...

elfisher commented 2 years ago

@dblock and @nknize is this in Lucene already? I couldn't tell from the Lucene docs. If it isn't should it be contributed there and then pulled into OpenSearch? Seems like it could add value in Lucene too.

chipzzz commented 2 years ago

@aparo this link no longer there https://github.com/aparo/opensearch-flattened-mapper-plugin :(

abhishek-v commented 2 years ago

Is there any plan to support this functionality in the near future?

dblock commented 2 years ago

Is there any plan to support this functionality in the near future?

I don't think anyone is working on it, cc: @anasalkouz?

reta commented 2 years ago

I was involved in the discussion recently on the subject [1], it would be really beneficial to have the flattened type but I believe the [2] was the subject of the IP / copyright claims (@aparo would be great to hear the reasons, many people are asking, thank you :pray:). To keep it short: we probably could add something similar to flattened type but with different name (and obviously implementation), but migrating existing Elasticsearch indices using snapshot / restore would be problematic (unless we would internally support type aliases etc.) since type won't match.

[1] https://discuss.opendistrocommunity.dev/t/migration-via-snapshot-restore-mapping-roadblocks/7906/7 [2] https://github.com/aparo/opensearch-flattened-mapper-plugin

andreaAlkalay commented 2 years ago

Hi, Do you know when can we expect to have the new flattened type implemented? It is very crucial for our business scenario. Thanks, Andrea.

dblock commented 2 years ago

@andreaAlkalay @abhishek-v it's not on any roadmap today, but we would gladly accept a PR

tristandostaler commented 2 years ago

+1!

macrakis commented 2 years ago

@andreaAlkalay Can you tell us more about your business scenario and why the flattened type is important to it? Thanks.

amalgamm commented 2 years ago

@andreaAlkalay Can you tell us more about your business scenario and why the flattened type is important to it? Thanks.

Let me jump in :) We have a case when kubernetes pod having built-in labels like app=foo, also we have a bunch of services which use label like app.kubernetes.io/managed-by: Helm. In first case field app in just a string. In other case it is nested object. When logshipper send such entry to opensearch it throws back mapper_parsing_exception and drops documents

Flattened type solves such problems for fields that you don't want to use as nested

CEHENKLE commented 2 years ago

@anasalkouz heya Anas -- what's the latest on this?

CEHENKLE commented 2 years ago

(question is also for @macrakis) :)

macrakis commented 2 years ago

[Design Proposal] The flat data type in OpenSearch

Summary

JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.

Flat subfields support exact match queries and textual sorting.

Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.

Motivation

Demand

OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)

OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)

Specification

Mapping and ingestion

Searching and retrieving

Example

This declares catalog as being of type flattened:

{ "mappings": 
  { "book" :
    { "properties" :
       { "ISBN13"  : "keyword",
         "catalog" : "flattened" }
}}}

Consider the ingestion of the following document:

{ 
  { "ISBN13" : "V9781933988177",
    "catalog" : 
       { "title" : "Lucene in Action",
         "author1" : 
            { "surname" : "McCandless",
              "given"   : "Mike" }
}}}

Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.

Performance

Limitations

Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.

Possible implementation

These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.

Security

Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.

Possible enhancements

The current specification is minimal. It intentionally does not include many options offered by other vendors.

Depending on the user feedback we receive after the initial release, various enhancements are possible:

elfisher commented 2 years ago

@dblock I think we should move the design proposal @macrakis pasted here into the issue summary as it is a pretty comprehensive proposal for the feature. Do you have any issue with that? I can make the change.

dblock commented 2 years ago

No issues with that! Sounds great.

CEHENKLE commented 2 years ago

@dblock @macrakis @elfisher Done.

elfisher commented 2 years ago

Thanks @CEHENKLE!

dblock commented 2 years ago

I think for posterity we might want to do this not by replacing the original content, but adding Update ... and linking and copy-pasting something from below, just so it doesn't look like I did all the work on that proposal (@macrakis did ;)).

CEHENKLE commented 2 years ago

Good point, @dblock . Will do it that way going forward.

@aabukhalil Can you pick this up?

aabukhalil commented 2 years ago

@CEHENKLE yes I will be working on this

aabukhalil commented 2 years ago

Open questions:

Checklist of things to do:

reta commented 2 years ago

@aabukhalil I agree, going with flattened would be ideal and significantly ease the migration, but it is indeed poses legal risks. We actually had a very similar discussion regarding dense_vector field type [1], may be the sign off from one of the PMs (@CEHENKLE ?) would help here.

[1] https://github.com/opensearch-project/OpenSearch/issues/3545#issuecomment-1164749810

aabukhalil commented 2 years ago

@reta yes I'm asking for help regarding legal implications.

aabukhalil commented 2 years ago

When designing how to store the new field type, we should take into consideration forward compatibility and future extensibility for this field. Because depending on how we store the data, some features can be added easily by us and activated easily by customers. Otherwise, if we don’t count for future features, a full revamp for how the field is stored might be needed and we will lose compatibility between versions. What do you think ? @dblock @nknize @macrakis @reta I need your opinion

aabukhalil commented 2 years ago

@chipzzz thanks for your feedback. I'm sorry but I didn't get what do you mean by lag ? which lag ? and what do you mean by event here. Can you please elaborate more so we can help ? even if you can provide samples that would help

lukas-vlcek commented 2 years ago

Just my 2 cents regarding the naming,

given the implementation will not guarantee the exact same functionality to what flattened filed provides in Elasticsearch we should intentionally not try to use similar naming. Going with different naming will make it clear that the functionality can differ (and also a clear signal that there are no legal concerns).

When migrating I think it is better for user to cope with the fact that the mapping field naming is not exactly the same than learning later that despite the naming was the same the functionality is actually not.

dblock commented 2 years ago

@lukas-vlcek Do you have an alternate naming proposal? Let's only discuss technical (vs. legal) merits of our options?

macrakis commented 2 years ago

@mrkamel @abhishek-v @reta @andreaAlkalay @amalgamm @lukas-vicek The current Design Proposal is intentionally minimal. It covers the core functionality of the flat data type, namely ingesting nested objects as a single object and not indexing individual subfields. This has good performance characteristics and avoids mapping explosion. However, it does not implement the many options available on other systems. Notably it does not index the subfields. It also does not support snapshot restore from Elastic indexes. If I'm not mistaken, snapshots aren't even guaranteed compatible between different versions of Elastic.

What we'd like to know is whether that meets your needs. If not, which additional options are useful to you, and why?

The goal is to base feature development on your needs so as to keep the feature simple and performant.

mrkamel commented 2 years ago

@macrakis thanks for the notification. Not sure what not indexing individual subfields means performance wise for queries. Our use cases cover mostly i) querying leaf key/value pairs like with keyword fields, ii) aggregating leaf keys iii) dot retrieval and iv) textual sorting. Regarding querying we mostly use term/terms queries, but range/exists queries would be nice also. Querying without specifying a concrete leaf key is not important for us.

Regarding querying performance, it was stated that Filtering by subfields is supported, but may be inefficient (full scan) and Finding a document with a specific value of a nested field (e.g., given = ‘Mike’) is not efficient: it may require a full scan of the index, an expensive operation.

Does all that mean we can expect much worse query performance compared to the elasticsearch flattened type for a query like:

POST bug_reports/_search
{
  "query": {
    "term": {"labels.release": "v1.3.0"}
  }
}

where labels is of type flattened and release is a leaf key? Comparable performance for those queries is very important to us.

elfisher commented 2 years ago

Hi everyone, I’ve discussed this topic previously with the AWS Legal team and got their approval to proceed. In the future if there are legal questions or concerns, please reach out to opensearch@amazon.com. As @dblock mentioned, let’s focus on the technical discussion here.

macrakis commented 2 years ago

@mrkamel That's very useful feedback, thanks.

aabukhalil commented 2 years ago

The solution we have been implementing so far will support all P0s and most of the P1s in here . The new field type is called "flat".

E.g. This is how to define field test_flat_object as flat field type.

{
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
   "properties": {
        "test_flat_object": {
            "type": "flat"
        }
     }
  }
}

and this is a sample of flat field document

{
    "test_flat_object": {
        "level0fieldname": "level0fieldvalue",
        "level0intfieldname": 1230,
        "level0doublefieldname": 1230.123,
        "level0arrayname": [
            "level0arrayvalue0",
            "level0arrayvalue1",
            "level0arrayvalue2",
            "level0arrayvalue3"
        ],
        "level0nullvalue": null,
        "level0objectname": {
            "level1fieldname": "level1fieldvalue",
            "level1intfieldname": 1231,
            "level1doublefieldname": 1231.123,
            "level1arrayname": [
                "level1arrayvalue0",
                "level1arrayvalue1",
                "level1arrayvalue2",
                "level1arrayvalue3"
            ],
            "level1nullvalue": null,
            "level2objectname": {
                "level2fieldname": "level2fieldvalue",
                "level2arrayname": [
                    "level2arrayvalue0",
                    "level2arrayvalue1",
                    "level2arrayvalue2",
                    "level2arrayvalue3"
                ],
                "level2nullvalue": null,
                "level2fieldname1": "level2fieldvalue1",
                "level2fieldname2": "level2fieldvalue2"
            }
        }
    }
}

When the flat field type indexing is enabled (index=true), this will cause creation of exactly two internal Lucene StringField no matter how many nested fields the flat field has. All leaf values will be treated as strings.

And when doc values is enabled, it will create another two internal Lucene SortedSetDocValuesField, one containing the concrete leaf values only without the path to values which corresponds to test_flat_object terms dict. The other one will correspond to test_flat_object.__content_and_path terms dict.

So no matter how many nested subfields the flat object has, maximum it cause 4 Lucene fields to get created.

All Keyword based queries (exists, fuzzy, prefix, range, regexp, term, terms, wildcard ) is supported on both the root name (without specifying path to leaf) and when specifying path to leaf. E.g.

{
  "query": {
    "fuzzy": {
      "test_flat_object": {
        "value": "level9fieldvalue7"
      }
    }
  }
}

{
  "query": {
    "prefix": {
      "test_flat_object": {
        "value": "level0arrayvalue0"
      }
    }
  }
}

{
  "query": {
    "regexp": {
      "test_flat_object": {
        "value": "l.*e",
        "flags": "ALL",
        "case_insensitive": true,
        "max_determinized_states": 10000,
        "rewrite": "constant_score"
      }
    }
  }
}

{
  "query": {
    "term": {
      "test_flat_object": {
        "value": 1230.123,
        "boost": 1.0
      }
    }
  }
}

{
    "query": {
        "exists": {
            "field": "test_flat_object.level0fieldname"
        }
    }
}

{
    "query": {
        "exists": {
            "field": "test_flat_object.level0objectname.level2objectname.only_existing_field_here_2"
        }
    }
}

{
    "query": {
        "fuzzy": {
            "test_flat_object.level0fieldname": {
                "value": "level0fieldvalue"
            }
        }
    }
}

{
    "query": {
    "term": {
        "test_flat_object.level0doublefieldname": {
        "value": 1230.123,
        "boost": 1.0
        }
    }
    }
}

Aggregation, Sorting and Scripting which depends on doc values, is also supported with or without specifying leaf path. E.g.

{
    "query": {
        "ids" : {
        "values" : ["zvqYzIIBPsCiAUBXycbp"]
        }
    },
    "aggs": {
        "uniq_values": {
            "terms": {
                "field": "test_flat_object",
                "size": 100
            }
        }
    }
}

{
     "sort" : [
        { 
            "test_flat_object" : {
                "order" : "asc", 
                "mode": "max"
            }
        }
  ]
}

{
     "sort": {
      "_script": {
         "script": "doc[\"test_flat_object\"][0]",
         "type": "string",   
         "order": "asc"
      }
   }
}

{
     "sort": {
      "_script": {
         "script": "doc[\"test_flat_object\"].value",
         "type": "string",   
         "order": "asc"
      }
   }
}

{
     "sort": {
      "_script": {
         "script": "params._source.test_flat_object.level0fieldname",
         "type": "string",   
         "order": "asc"
      }
   }
}

{
    "size": 1,
    "aggs": {
        "uniq_values": {
            "terms": {
                "field": "test_flat_object.level0fieldname",
                "size": 100
            }
        }
    }
}

This solution is implemented as new field mapper of DynamicKeyFieldMapper where when only the root field name (test_flat_object) is selected in query or agg, the concrete leaf values will be used for searching and aggregation just how any normal field works because it has the plain values.

And when subfield path is selected in query or agg (test_flat_object.level0fieldname) then some transformation of queries might be needed (Exists query will be rewritten to Prefix query, The passed path to leaf (level0fieldname) will be a prefix for the passed value to match ...) E.g.

{
    "query": {
    "term": {
        "test_flat_object.level0doublefieldname": {
        "value": 1230.123
        }
    }
    }
}
will be rewritten to
{
    "query": {
    "term": {
        "test_flat_object._content_and_path": {
        "value": "level0doublefieldname: 1230.123"
        }
    }
    }
}

And when accessing field data of the root field test_flat_object it will work out of the box as multi value set because the values were indexed without modification. But when field data of subfield of test_flat_object is accessed like test_flat_object.level0doublefieldname then additional filtering, value extraction is needed and will be done automatically for users.

anasalkouz commented 2 years ago

The other one, will contain the path to leaf value and the value of leaf itself so it could support efficient searching when query has path to leaf.

For more storage efficiency, Can we consider hashing the path to the leaf?

rcwsr commented 2 years ago

Sorry if I missed it but will OpenSearch's flattened data type support queries like Elasticsearch's? https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html#supported-operations

getorca commented 1 year ago

curious, any eta on this? Wanting to store schema.org markup, so this would be a big help.

dblock commented 1 year ago

@aabukhalil are you still working on this?

ahopp commented 1 year ago

Hey @aabukhalil wanted to follow-up here as well. Any progress/is this still in-flight? Trying to understand where we're at.

aabukhalil commented 1 year ago

no I'm not working on this anymore

lukas-vlcek commented 1 year ago

Hi,

I noticed that @aabukhalil is no longer working on this. Would it be possible to get access to the code? Was it published anywhere? It would be great if this work can continue. I would be interested in looking at it.

Regards, Lukáš

anasalkouz commented 1 year ago

I think @macohen is going to follow up on this. Do you know if someone is going to pick this soon?

macohen commented 1 year ago

Not very soon, unfortunately. Probably looking at early in Jan at the soonest. @lukas-vlcek, if you're interested in picking this up please go for it. There's no code for the feature yet of which I am aware.

lukas-vlcek commented 1 year ago

@macohen I am interested, feel free to assign me.

macohen commented 1 year ago

Done. Thanks @lukas-vlcek!

macohen commented 1 year ago

@lukas-vlcek How is this going? I may be able to see if someone from the team can work more closely with you on this or take it on in the next few weeks.

lukas-vlcek commented 1 year ago

@macohen I was finishing https://forum.opensearch.org/t/opensearch-mixin-1-0-0-rc-1-released/11717 but now I am back on this task. I am definitely open to collaboration here, that would be great.

The general plan on our side is to implement a high-level API first (mocking on low level) to see if we can come up with an API that is acceptable, understandable and well designed (in simple words it should answer the question: "Is this what we want to implement?"). Once we get a basic agreement about the API then we can start drilling down and filling missing pieces of the implementation.

I am currently working on the high-level API part (as a new core plugin: ie ./plugins/flat-object) and I would be more then happy to share what I have soon. If you then want to join (or take over) that would be really welcome. Let me know how this sounds to you.

macohen commented 1 year ago

@lukas-vlcek Sounds great. It probably won't be me personally taking over here, but either @mingshl or @noCharger may jump in. Can you say why this should be a new core plugin as opposed to an addition to server/src/main/java/org/opensearch/search? I believe that's where most of the query language definition lives, but I'm curious to understand the choice.

mingshl commented 1 year ago

@lukas-vlcek The feature branch flattened-field is created, we will try out doing a feature branch collaboratively. Please publish draft pr and compare towards the feature branch. Aligned to start from top to bottom approach and get to more technical details and planning after we go through the pr together.

mingshl commented 1 year ago

@lukas-vlcek If using feature branch in opensearch production repo, every time you make a need to merge a PR, or sync feature branch with upstream, or rerun testing, it will need to ask one of the maintainers to initiate the actions.

So to improve productivity, I am proposing using fork repo until we are ready to merge to opensearch\main repo. I just openned my fork of opensearch production repo, created branch flattened-field and invited you as a collaborator, you should get an email notification about the invite.

I hope in this case you will gain the proper access to develop and manage the branch in a more convenient way. You can feel free to create a PR toward (https://github.com/mingshl/OpenSearch-Mingshl/tree/flattened-field) and merge freely. Please let me know if that works for you.

lukas-vlcek commented 1 year ago

@mingshl Thanks, I think the branch in your fork will work fine for me. It is a good start.

The only question/concern I have at this moment is the naming of it. Are you sure we want to go with flattened-field? One of the concerns is that right now there is no guaranty that the functionality will perfectly correspond to identically called enterprise feature from product preceding the OpenSearch fork. I think that the naming should not lead to inaccurate expectations on user side.

I like flat-object more TBH.