Closed dblock closed 1 year ago
i'd like to underline the need for an offical feature complete and performant alternative to the flattened datatype and it also was requested a lot on several sites already: https://github.com/opendistro-for-elasticsearch/opendistro-build/issues/523, https://discuss.opendistrocommunity.dev/t/flattened-type-with-opendistro/5014/4, https://forums.aws.amazon.com/thread.jspa?threadID=32970, ...
@dblock and @nknize is this in Lucene already? I couldn't tell from the Lucene docs. If it isn't should it be contributed there and then pulled into OpenSearch? Seems like it could add value in Lucene too.
@aparo this link no longer there https://github.com/aparo/opensearch-flattened-mapper-plugin :(
Is there any plan to support this functionality in the near future?
Is there any plan to support this functionality in the near future?
I don't think anyone is working on it, cc: @anasalkouz?
I was involved in the discussion recently on the subject [1], it would be really beneficial to have the flattened
type but I believe the [2] was the subject of the IP / copyright claims (@aparo would be great to hear the reasons, many people are asking, thank you :pray:). To keep it short: we probably could add something similar to flattened
type but with different name (and obviously implementation), but migrating existing Elasticsearch indices using snapshot / restore would be problematic (unless we would internally support type aliases etc.) since type won't match.
[1] https://discuss.opendistrocommunity.dev/t/migration-via-snapshot-restore-mapping-roadblocks/7906/7 [2] https://github.com/aparo/opensearch-flattened-mapper-plugin
Hi, Do you know when can we expect to have the new flattened type implemented? It is very crucial for our business scenario. Thanks, Andrea.
@andreaAlkalay @abhishek-v it's not on any roadmap today, but we would gladly accept a PR
+1!
@andreaAlkalay Can you tell us more about your business scenario and why the flattened type is important to it? Thanks.
@andreaAlkalay Can you tell us more about your business scenario and why the flattened type is important to it? Thanks.
Let me jump in :)
We have a case when kubernetes pod having built-in labels like app=foo
, also we have a bunch of services which use label like app.kubernetes.io/managed-by: Helm
. In first case field app in just a string. In other case it is nested object. When logshipper send such entry to opensearch it throws back mapper_parsing_exception and drops documents
Flattened type solves such problems for fields that you don't want to use as nested
@anasalkouz heya Anas -- what's the latest on this?
(question is also for @macrakis) :)
JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.
Flat subfields support exact match queries and textual sorting.
Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.
OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)
OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)
This declares catalog as being of type flattened:
{ "mappings":
{ "book" :
{ "properties" :
{ "ISBN13" : "keyword",
"catalog" : "flattened" }
}}}
Consider the ingestion of the following document:
{
{ "ISBN13" : "V9781933988177",
"catalog" :
{ "title" : "Lucene in Action",
"author1" :
{ "surname" : "McCandless",
"given" : "Mike" }
}}}
Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.
Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.
These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.
Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.
The current specification is minimal. It intentionally does not include many options offered by other vendors.
Depending on the user feedback we receive after the initial release, various enhancements are possible:
@dblock I think we should move the design proposal @macrakis pasted here into the issue summary as it is a pretty comprehensive proposal for the feature. Do you have any issue with that? I can make the change.
No issues with that! Sounds great.
@dblock @macrakis @elfisher Done.
Thanks @CEHENKLE!
I think for posterity we might want to do this not by replacing the original content, but adding Update ... and linking and copy-pasting something from below, just so it doesn't look like I did all the work on that proposal (@macrakis did ;)).
Good point, @dblock . Will do it that way going forward.
@aabukhalil Can you pick this up?
@CEHENKLE yes I will be working on this
flattened
or flat
field type ? We need to close on what name to use.
flattened
as name ? not using matching name will make migration harder. should we introduce field type aliasing ?@aabukhalil I agree, going with flattened
would be ideal and significantly ease the migration, but it is indeed poses legal risks. We actually had a very similar discussion regarding dense_vector
field type [1], may be the sign off from one of the PMs (@CEHENKLE ?) would help here.
[1] https://github.com/opensearch-project/OpenSearch/issues/3545#issuecomment-1164749810
@reta yes I'm asking for help regarding legal implications.
When designing how to store the new field type, we should take into consideration forward compatibility and future extensibility for this field. Because depending on how we store the data, some features can be added easily by us and activated easily by customers. Otherwise, if we don’t count for future features, a full revamp for how the field is stored might be needed and we will lose compatibility between versions. What do you think ? @dblock @nknize @macrakis @reta I need your opinion
@chipzzz thanks for your feedback. I'm sorry but I didn't get what do you mean by lag ? which lag ? and what do you mean by event here. Can you please elaborate more so we can help ? even if you can provide samples that would help
Just my 2 cents regarding the naming,
given the implementation will not guarantee the exact same functionality to what flattened filed provides in Elasticsearch we should intentionally not try to use similar naming. Going with different naming will make it clear that the functionality can differ (and also a clear signal that there are no legal concerns).
When migrating I think it is better for user to cope with the fact that the mapping field naming is not exactly the same than learning later that despite the naming was the same the functionality is actually not.
@lukas-vlcek Do you have an alternate naming proposal? Let's only discuss technical (vs. legal) merits of our options?
@mrkamel @abhishek-v @reta @andreaAlkalay @amalgamm @lukas-vicek The current Design Proposal is intentionally minimal. It covers the core functionality of the flat data type, namely ingesting nested objects as a single object and not indexing individual subfields. This has good performance characteristics and avoids mapping explosion. However, it does not implement the many options available on other systems. Notably it does not index the subfields. It also does not support snapshot restore from Elastic indexes. If I'm not mistaken, snapshots aren't even guaranteed compatible between different versions of Elastic.
What we'd like to know is whether that meets your needs. If not, which additional options are useful to you, and why?
The goal is to base feature development on your needs so as to keep the feature simple and performant.
@macrakis thanks for the notification. Not sure what not indexing individual subfields means performance wise for queries. Our use cases cover mostly i) querying leaf key/value pairs like with keyword fields, ii) aggregating leaf keys iii) dot retrieval and iv) textual sorting. Regarding querying we mostly use term/terms queries, but range/exists queries would be nice also. Querying without specifying a concrete leaf key is not important for us.
Regarding querying performance, it was stated that Filtering by subfields is supported, but may be inefficient (full scan) and Finding a document with a specific value of a nested field (e.g., given = ‘Mike’) is not efficient: it may require a full scan of the index, an expensive operation.
Does all that mean we can expect much worse query performance compared to the elasticsearch flattened type for a query like:
POST bug_reports/_search
{
"query": {
"term": {"labels.release": "v1.3.0"}
}
}
where labels
is of type flattened and release
is a leaf key? Comparable performance for those queries is very important to us.
Hi everyone, I’ve discussed this topic previously with the AWS Legal team and got their approval to proceed. In the future if there are legal questions or concerns, please reach out to opensearch@amazon.com. As @dblock mentioned, let’s focus on the technical discussion here.
@mrkamel That's very useful feedback, thanks.
The solution we have been implementing so far will support all P0s and most of the P1s in here . The new field type is called "flat".
E.g. This is how to define field test_flat_object
as flat field type.
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"properties": {
"test_flat_object": {
"type": "flat"
}
}
}
}
and this is a sample of flat field document
{
"test_flat_object": {
"level0fieldname": "level0fieldvalue",
"level0intfieldname": 1230,
"level0doublefieldname": 1230.123,
"level0arrayname": [
"level0arrayvalue0",
"level0arrayvalue1",
"level0arrayvalue2",
"level0arrayvalue3"
],
"level0nullvalue": null,
"level0objectname": {
"level1fieldname": "level1fieldvalue",
"level1intfieldname": 1231,
"level1doublefieldname": 1231.123,
"level1arrayname": [
"level1arrayvalue0",
"level1arrayvalue1",
"level1arrayvalue2",
"level1arrayvalue3"
],
"level1nullvalue": null,
"level2objectname": {
"level2fieldname": "level2fieldvalue",
"level2arrayname": [
"level2arrayvalue0",
"level2arrayvalue1",
"level2arrayvalue2",
"level2arrayvalue3"
],
"level2nullvalue": null,
"level2fieldname1": "level2fieldvalue1",
"level2fieldname2": "level2fieldvalue2"
}
}
}
}
When the flat field type indexing is enabled (index=true), this will cause creation of exactly two internal Lucene StringField no matter how many nested fields the flat field has. All leaf values will be treated as strings.
field: terms dictionary
test_flat_object: ["level0fieldvalue", "1230", "1230.123", "level0arrayvalue0", "level0arrayvalue1", ...., "level1fieldvalue", "level1arrayvalue0", "level2fieldvalue1", ...]
field: terms dictionary
test_flat_object.__content_and_path: ["level0fieldname: level0fieldvalue", "level0intfieldname: 1230", "level0doublefieldname: 1230.123", "level0arrayname: level0arrayvalue0", "level0arrayname: level0arrayvalue1", ...., "level0objectname.level1fieldname: level1fieldvalue", "level0objectname.level1arrayname: level1arrayvalue0", "level0objectname.level2objectname. level2fieldname1: level2fieldvalue1", ...]
And when doc values is enabled, it will create another two internal Lucene SortedSetDocValuesField, one containing the concrete leaf values only without the path to values which corresponds to test_flat_object
terms dict. The other one will correspond to test_flat_object.__content_and_path
terms dict.
So no matter how many nested subfields the flat object has, maximum it cause 4 Lucene fields to get created.
All Keyword based queries (exists, fuzzy, prefix, range, regexp, term, terms, wildcard ) is supported on both the root name (without specifying path to leaf) and when specifying path to leaf. E.g.
{
"query": {
"fuzzy": {
"test_flat_object": {
"value": "level9fieldvalue7"
}
}
}
}
{
"query": {
"prefix": {
"test_flat_object": {
"value": "level0arrayvalue0"
}
}
}
}
{
"query": {
"regexp": {
"test_flat_object": {
"value": "l.*e",
"flags": "ALL",
"case_insensitive": true,
"max_determinized_states": 10000,
"rewrite": "constant_score"
}
}
}
}
{
"query": {
"term": {
"test_flat_object": {
"value": 1230.123,
"boost": 1.0
}
}
}
}
{
"query": {
"exists": {
"field": "test_flat_object.level0fieldname"
}
}
}
{
"query": {
"exists": {
"field": "test_flat_object.level0objectname.level2objectname.only_existing_field_here_2"
}
}
}
{
"query": {
"fuzzy": {
"test_flat_object.level0fieldname": {
"value": "level0fieldvalue"
}
}
}
}
{
"query": {
"term": {
"test_flat_object.level0doublefieldname": {
"value": 1230.123,
"boost": 1.0
}
}
}
}
Aggregation, Sorting and Scripting which depends on doc values, is also supported with or without specifying leaf path. E.g.
{
"query": {
"ids" : {
"values" : ["zvqYzIIBPsCiAUBXycbp"]
}
},
"aggs": {
"uniq_values": {
"terms": {
"field": "test_flat_object",
"size": 100
}
}
}
}
{
"sort" : [
{
"test_flat_object" : {
"order" : "asc",
"mode": "max"
}
}
]
}
{
"sort": {
"_script": {
"script": "doc[\"test_flat_object\"][0]",
"type": "string",
"order": "asc"
}
}
}
{
"sort": {
"_script": {
"script": "doc[\"test_flat_object\"].value",
"type": "string",
"order": "asc"
}
}
}
{
"sort": {
"_script": {
"script": "params._source.test_flat_object.level0fieldname",
"type": "string",
"order": "asc"
}
}
}
{
"size": 1,
"aggs": {
"uniq_values": {
"terms": {
"field": "test_flat_object.level0fieldname",
"size": 100
}
}
}
}
This solution is implemented as new field mapper of DynamicKeyFieldMapper where when only the root field name (test_flat_object
) is selected in query or agg, the concrete leaf values will be used for searching and aggregation just how any normal field works because it has the plain values.
And when subfield path is selected in query or agg (test_flat_object.level0fieldname
) then some transformation of queries might be needed (Exists query will be rewritten to Prefix query, The passed path to leaf (level0fieldname
) will be
a prefix for the passed value to match ...)
E.g.
{
"query": {
"term": {
"test_flat_object.level0doublefieldname": {
"value": 1230.123
}
}
}
}
will be rewritten to
{
"query": {
"term": {
"test_flat_object._content_and_path": {
"value": "level0doublefieldname: 1230.123"
}
}
}
}
And when accessing field data of the root field test_flat_object
it will work out of the box as multi value set because the values were indexed without modification. But when field data of subfield of test_flat_object
is accessed like test_flat_object.level0doublefieldname
then additional filtering, value extraction is needed and will be done automatically for users.
The other one, will contain the path to leaf value and the value of leaf itself so it could support efficient searching when query has path to leaf.
For more storage efficiency, Can we consider hashing the path to the leaf?
Sorry if I missed it but will OpenSearch's flattened data type support queries like Elasticsearch's? https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html#supported-operations
curious, any eta on this? Wanting to store schema.org markup, so this would be a big help.
@aabukhalil are you still working on this?
Hey @aabukhalil wanted to follow-up here as well. Any progress/is this still in-flight? Trying to understand where we're at.
no I'm not working on this anymore
Hi,
I noticed that @aabukhalil is no longer working on this. Would it be possible to get access to the code? Was it published anywhere? It would be great if this work can continue. I would be interested in looking at it.
Regards, Lukáš
I think @macohen is going to follow up on this. Do you know if someone is going to pick this soon?
Not very soon, unfortunately. Probably looking at early in Jan at the soonest. @lukas-vlcek, if you're interested in picking this up please go for it. There's no code for the feature yet of which I am aware.
@macohen I am interested, feel free to assign me.
Done. Thanks @lukas-vlcek!
@lukas-vlcek How is this going? I may be able to see if someone from the team can work more closely with you on this or take it on in the next few weeks.
@macohen I was finishing https://forum.opensearch.org/t/opensearch-mixin-1-0-0-rc-1-released/11717 but now I am back on this task. I am definitely open to collaboration here, that would be great.
The general plan on our side is to implement a high-level API first (mocking on low level) to see if we can come up with an API that is acceptable, understandable and well designed (in simple words it should answer the question: "Is this what we want to implement?"). Once we get a basic agreement about the API then we can start drilling down and filling missing pieces of the implementation.
I am currently working on the high-level API part (as a new core plugin: ie ./plugins/flat-object
) and I would be more then happy to share what I have soon. If you then want to join (or take over) that would be really welcome. Let me know how this sounds to you.
@lukas-vlcek Sounds great. It probably won't be me personally taking over here, but either @mingshl or @noCharger may jump in. Can you say why this should be a new core plugin as opposed to an addition to server/src/main/java/org/opensearch/search? I believe that's where most of the query language definition lives, but I'm curious to understand the choice.
@lukas-vlcek The feature branch flattened-field is created, we will try out doing a feature branch collaboratively. Please publish draft pr and compare towards the feature branch. Aligned to start from top to bottom approach and get to more technical details and planning after we go through the pr together.
@lukas-vlcek If using feature branch in opensearch production repo, every time you make a need to merge a PR, or sync feature branch with upstream, or rerun testing, it will need to ask one of the maintainers to initiate the actions.
So to improve productivity, I am proposing using fork repo until we are ready to merge to opensearch\main repo. I just openned my fork of opensearch production repo, created branch flattened-field and invited you as a collaborator, you should get an email notification about the invite.
I hope in this case you will gain the proper access to develop and manage the branch in a more convenient way. You can feel free to create a PR toward (https://github.com/mingshl/OpenSearch-Mingshl/tree/flattened-field) and merge freely. Please let me know if that works for you.
@mingshl Thanks, I think the branch in your fork will work fine for me. It is a good start.
The only question/concern I have at this moment is the naming of it. Are you sure we want to go with flattened-field
? One of the concerns is that right now there is no guaranty that the functionality will perfectly correspond to identically called enterprise feature from product preceding the OpenSearch fork. I think that the naming should not lead to inaccurate expectations on user side.
I like flat-object
more TBH.
(updated from https://github.com/opensearch-project/OpenSearch/issues/1018#issuecomment-1188365805 below, @macrakis)
[Design Proposal] The flat data type in OpenSearch
Summary
JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.
Flat subfields support exact match queries and textual sorting.
Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.
Motivation
Demand
OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)
OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)
Specification
Mapping and ingestion
Searching and retrieving
Example
This declares catalog as being of type flattened:
Consider the ingestion of the following document:
Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.
Performance
Limitations
Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.
Possible implementation
These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.
Security
Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.
Possible enhancements
The current specification is minimal. It intentionally does not include many options offered by other vendors.
Depending on the user feedback we receive after the initial release, various enhancements are possible: