opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.87k stars 1.84k forks source link

[Feature Request] Add mapping information for single-/multi-valued fields #16420

Open msfroh opened 1 month ago

msfroh commented 1 month ago

Is your feature request related to a problem? Please describe

Fields in an OpenSearch index are all allowed to be multivalued. Any keyword field will also accept an array of keyword values. This all works, because under the hood, Lucene doesn't really make a distinction between adding a field once and adding it multiple times.

Unfortunately, for cases that try to project a fixed schema (like the SQL plugin or the proposed join support in core), it's useful to make a distinction between a field that represents a single keyword and one that represents an array of keywords. We could treat every field as an array, but a lot of fields would come out as arrays of length 1 (since, at least in my experience, the majority of fields are single-valued).

Describe the solution you'd like

It would be great if we could add a property in a mapping that conveys whether a field is single- or multi-valued. Unfortunately, from a backwards compatibility standpoint, we can't just add a new required property, since we would break all existing index mappings.

My suggestion is that we add an optional multivalued property for field mappings. Essentially, this property would have three possible values:

  1. true, meaning the field should be treated as an array,
  2. false, meaning that the field only has a single value -- a document with multiple values for the field will be rejected -- or
  3. null, meaning that we don't know. This means the field was dynamically added to the mapping or the field was specified in a mapping without a value for the multivalued property.

I would also suggest that if a document specifies multiple values for a field where multivalued is null, we should update the mapping to set multivalued to true. (Maybe we can't do that if dynamic mapping changes are disabled.)

Going forward, if we add this property in OpenSearch 2.x, maybe we can make it mandatory for new indices created in OpenSearch 3.0. (Of course, we would still need to support the OpenSearch 2.x null behavior, at least until OpenSearch 4.0 is released.) Starting in OpenSearch 3.0, we could dynamically infer the property from the first document containing a given field (which would require a bit of work, since we would need to distinguish between "fieldA":"foo" and "fieldA":["foo"], where the former would be single-valued and the latter would be multivalued).

Related component

Indexing

Describe alternatives you've considered

I was chatting with @anirudha today about an idea of making it a search-time problem, since it's at search time that knowing the schema is useful (since indexing "just works" right now). Essentially, you could take a hint at search time to force an interpretation for a field.

You could also make a best effort to guess whether a field has multiple values by inspecting a sample of documents (the first 500?). Since you may want the coordinator to get a response from each shard with the same interpretation, you could do a preliminary search phase (kind of like can_match) to ask each shard to vote on the arity of each field. If any shard says a field is multivalued, we would interpret it as multivalued.

Additional context

I'm categorizing this as "Indexing", but the property is mostly useful at search time. I think I'll add the "Search:Query Capabilities" label too.

RS146BIJAY commented 1 month ago

@msfroh We evaluated this feature as a part of triage meeting and this seems a nice feature to add in OpenSearch. Looking forward to more discussion on this.

normanj-bitquill commented 4 weeks ago

This would be useful for the SQL plugin.