opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.51k stars 1.75k forks source link

Data projection with views #6181

Open peternied opened 1 year ago

peternied commented 1 year ago

Is your feature request related to a problem? Please describe.

Sometimes there are clear relationships between indices, e.g. http-logs-2023-01-20 http-logs-2023-01-21. As data gets reshaped or physically moved there is a desire to preserve how the data is referenced. OpenSearch Dashboards has a feature around this called index patterns that doesn't exist in the backend.

If there was a way to create a logical grouping of these physical storage mediums the responsibilities between data usage and ingestion could be separated. I think this would be a big win for lower maintenance of OpenSearch clusters over time.

Describe the solution you'd like

In SQL there are tables and views, views offer flexibility and centralized management, see great answers on this stack overflow question What is a good reason to use SQL views? Pulling from the great answer by user210748 I'd suggest this system does the following:

Describe alternatives you've considered

Aliases

OpenSearch already has aliases that represent a virtualized view, maybe they could be built up to offer these additional features. Alternatively, there are some quirks like the is_write_index that we might want to be careful around.

Data streams

Data streams are a virtualized view focused on managing the physical storage, maybe they could be built up to handle data projections filtering.

Additional context

Coming from the security plugin, there are features for document level security (DLS), field level security, and field masking. These features are built into index permissions and they are kind of clunky where a query to apply DLS has to be double-encoded in the json body. Views could easily encompass these scenarios. Modeling view creation and management as a separately from managing permissions to the views is a cleaner separation compared to what is available in the security plugin.

anasalkouz commented 1 year ago

@peternied Thanks for the proposal. Could you elaborate more on the use-case? I don't related the connection between rotated indexes and the materialized views. Can you elaborate more about the security feature use-cases?

nandi-github commented 1 year ago

@peternied Reading through your description, it does look like an access control /security use-case. You can help us understand the permission better

peternied commented 11 months ago

Attended the Search Relevance - Triage & Backlog Review Triage meeting today and had an opportunity to bring up this issue - thanks @macohen

I'll (@peternied) will continue to iterate on this issue as I get more information. Thanks!

msfroh commented 10 months ago

I really like this idea and I think it can combine with search pipelines to open some really exciting possibilities.

I'm imagining a scenario where:

  1. A view has a simple name so they're easy to resolve (just a lookup, no pattern matching) -- which would hopefully help with access-control resolution.
  2. A view cannot be used as the target of an index / bulk / update request. (I could see some value in letting a view be the "source" of a reindex request, as a way of creating a materialized view. Personally, I would delay that to a later release, though.)
  3. A view can front an arbitrary index pattern.
  4. A view can have an attached search pipeline, which is not overridable. We could reasonably replace DLS with a FilterQueryRequestProcessor. If you want to use a different search pipeline or no search pipeline, use a different view. (This would address @besha100's comment on https://github.com/opensearch-project/OpenSearch/issues/10938 about disallowing disabling of a search pipeline.)
peternied commented 9 months ago
peternied commented 9 months ago

Wanted to provide an update before the holidays arrived - I've got a functional POC in OpenSearch [1] and the Security Plugin [2] alongside a breakdown design and implement for an experimental release [3]. After returning from the break we will see about a demo and more feedback.

peternied commented 7 months ago
sequenceDiagram
    participant Client
    participant HTTP_Request as ActionHandler
    participant Cluster_Metadata as Cluster Metadata Store
    participant Data_Store as Indices

    Client->>HTTP_Request: View List/Get/Update/Create/Delete<BR>/views or /views/{view_id}
    HTTP_Request->>Cluster_Metadata: Query Views
    alt Update/Create/Delete
        Cluster_Metadata->>Cluster_Metadata: Refresh Cluster
    end
    Cluster_Metadata-->>HTTP_Request: Return
    HTTP_Request-->>Client: Return

    Client->>HTTP_Request: Search View<br>/views/{view_id}/search
    HTTP_Request->>Cluster_Metadata: Query Views
    Cluster_Metadata-->>HTTP_Request: Return
    HTTP_Request->>HTTP_Request: Rewrite Search Request
    HTTP_Request->>HTTP_Request: Validate Search Request
    HTTP_Request->>Data_Store: Search indices
    Data_Store-->>HTTP_Request: Return
    HTTP_Request-->>Client: Return
jainankitk commented 7 months ago

@peternied - Thank you for proposing this idea. While some of the aspects around access control / security make sense to me, I am unable to reason other benefits of views compared to alias/index-pattern for Opensearch. Does it make more sense to extend alias for this purpose instead of introducing another concept views as first class citizen of Opensearch. Can you help me understand how are we designing functionalities as suggested by user210748:

Views can join and simplify multiple indices into a single virtual index.

SQL literally joins multiple indices for querying related information across multiple indices using the single virtual index. I don't see any such correlation across indices or shards in opensearch. The only join supported by Opensearch is using parent/child relation which is limited to single shard, not even index.

Views can act as aggregated tables, where the database engine aggregates data (sum, average etc) and presents the calculated results as part of the data

The views in SQL are nothing but named query and schema-on-write makes easily translates the queries on views into bigger query on original datasets. I am unable to understand how are we planning to aggregate the information across potentially unrelated indices.

Views can hide the complexity of data; for example, a view could appear as Sales2000 or Sales2001, transparently partitioning the actual underlying indices Views take very little space to store; the database contains only the definition of a view, not a copy of all the data it presents Views can limit the degree of exposure of an index or indices to the outer world

We should be able to achieve this using alias!?

Depending on the SQL engine used, views can provide extra security

Can you expand more on this?

The above diagram gives some idea about the request/response flow for CRUD API, but I am really interested in how we are planning to compose the result together from potentially completely different indices, without tying them together to specific schema.

rishabhmaurya commented 7 months ago

If there was a way to create a logical grouping of these physical storage mediums the responsibilities between data usage and ingestion could be separated.

@peternied it would be nice if we can list down use cases/users looking for such logical separation across indexes (both sharing and not sharing common fields) over index patterns.

Coming from performance standpoint, I don't think we have done any changes in the indexing/search architecture (correct me if I'm wrong) for Views which justifies advertising it as a feature which lets - "logical grouping of these physical storage mediums the responsibilities between data usage and ingestion could be separated". It only increases chances of its abuse and creating performance issues in the cluster with the increasing scale.

The thing which I slightly like about index patterns over Views is that it adds a soft constraint on the index names which can be collectively searched for. This has an advantage as there is a high chance that these indexes are similar in terms of common fields they share (most of the time rolled over indices), and that's what I have seen majority of users use it for. It can be abused too if someone just do a * and search across all indexes but they were never advertised like creating logical grouping over physical storage mediums not sharing anything in common.

Coming to the security point of view, I agree we need some sort of entity on which permission management is easier across indices. So should we introduce it a security feature where a security resource can be specified when defining security. This resource can contain multiple indexes and permission can be managed on this entity? Now question is if we want to provide explicit search apis over these resources? - we need to check from users standpoint if its really needed? and it can always be introduced if in demand? Thanks for putting this together, let me know your thoughts?

peternied commented 7 months ago

@jainankitk & @rishabhmaurya thanks for your thoughts.

Performance

Coming from performance standpoint, I don't think we have done any changes in the indexing/search architecture (correct me if I'm wrong) for Views which justifies advertising it as a feature which lets - "logical grouping of these physical storage mediums the responsibilities between data usage and ingestion could be separated". It only increases chances of its abuse and creating performance issues in the cluster with the increasing scale.

There is great work in search performance - and views impacts reducing overhead for privilege evaluation. When a search request is created, such as GET indexes-*/_search the permissions evaluation system had to transform the wildcard pattern into a list of concrete indices, then validate that the user making the request has permission to those indices. This result cannot be trivially* be cached because 1) the wildcard expression resolves can change over time and 2) permissions of a user to an index can change.

When considering API design and the kinds of information that needs to be available to a permissions system, ensuring that all reasoning on a request is visible in the request itself amplifies that cacheability. By building an API which insures that for an identity the permission evaluation will always be the same it can save considerablity on cpu/memory overhead. These benefits are only viable with a new access model as alias and index patterns as designed work in opposition.

~Joins~

~In my experience OpenSearch has a add-fields-as-you-go data model, where getting a schema 'correct' isn't as needed as putting all the data in one place. This often runs in opposition to the small refined tables I'm used to seeing in sql design. SQL encourages use of design patterns and has many tools to make there use easier. Consider how often join is used in sql - very much the opposite in my experience with OpenSearch.~

~I'm going to say something strange - OpenSearch uses joins all the time.~

~OpenSearch uses joins even thought conceptually we don't think of them of them in the SQL architecture sense. In OpenSearch each index is a 'table', and when we make a query GET reports-2024-*/_search we are joining all of those 'tables' together in a projected output. Often these indexes' have mapping properties that are aligned, but they don't have to. When we use index patterns with a wildcard - we are implicitly performing a joins on these queries.~

~If we used an abstraction - a view - to handle these groupings cluster operators have the ability to change them without forcing the client to make a change. When there is a centralized place to control rolling forward or rolling-backward there are huge advantages for our largest clusters to better control query behavior.~

This section was not correct, these are not joins, but unions

jainankitk commented 7 months ago

There is great work in search performance - and views impacts reducing overhead for privilege evaluation. When a search request is created, such as GET indexes-*/_search the permissions evaluation system had to transform the wildcard pattern into a list of concrete indices, then validate that the user making the request has permission to those indices. This result cannot be trivially* be cached because 1) the wildcard expression resolves can change over time and 2) permissions of a user to an index can change.

This is where the alias fits perfectly. We can restrict the permissions to an alias for specific users/groups without worrying about the underlying indices the alias is getting mapped to.

When considering API design and the kinds of information that needs to be available to a permissions system, ensuring that all reasoning on a request is visible in the request itself amplifies that cacheability. By building an API which insures that for an identity the permission evaluation will always be the same it can save considerablity on cpu/memory overhead. These benefits are only viable with a new access model as alias and index patterns as designed work in opposition.

* [*] The union of this information _could_ be cached - but it hasn't been pursued deeply because on A indexes * (B identities * C role mappings * D roles) varies greatly in different cluster configurations making this cache could be considerable in size - and often see cache flush to add/removing an index.

Can you please expand this to help me understand the viability only with a new access model compared to alias or index patterns. Also even if there are minor limitations with using alias, we should be looking to augment that instead of introducing completely new concept called "views".

I'm going to say something strange - OpenSearch uses joins all the time.

OpenSearch uses joins even thought conceptually we don't think of them of them in the SQL architecture sense. In OpenSearch each index is a 'table', and when we make a query GET reports-2024-*/_search we are joining all of those 'tables' together in a projected output. Often these indexes' have mapping properties that are aligned, but they don't have to. When we use index patterns with a wildcard - we are implicitly performing a joins on these queries.

I believe you're confusing unions with joins, especially if you consider common log analytics use case of Opensearch. If I am looking for monthly aggregation of 4xx/5xx http status codes within log* index, it is nothing but unioning the results from different indices. Whereas SQL joins are used to run operations on related albeit very different data sets.

peternied commented 7 months ago

I believe you're confusing unions with joins

Yup - I did! I'll correct that in the previous comment

msfroh commented 7 months ago

This is where the alias fits perfectly. We can restrict the permissions to an alias for specific users/groups without worrying about the underlying indices the alias is getting mapped to.

@peternied -- This is a good point. Can we manage permissions on aliases?

I feel like there was some other reason why aliases are not a good fit, but I'm struggling to remember.

peternied commented 7 months ago

@msfroh maybe this sparks something; in the Security Plugin - aliases don't have permissions concepts around them. When you use an alias, or an index pattern foo-* the security plugin resolves these to the concrete indexes and then checks the user permissions on that concrete list.

So a user could run a query GET my-alias/_search could return 200 and then an admin changes the underlying pointer and it starts to return 403 - not because you don't have access to the alias, but because the alias -> index mapping changed.

In my mind, there is an existing conceptual model that are users are aware of, GET BLANK/_search it works in a very specific way and if we change that it will create confusion around security controls. If there is a new thing GET _views/BLANK/_search it creates that opportunity to change the permissions model and the implications of that model in a clear opt-in pattern. Query authors won't mistake a view for an alias, or index, or index pattern - even though they can perform many of the same operations on them.

I can see the argument that an opt-in model is not a feature, but a bug. There are other manageability issues and historical features that we might not want to support, but I think those concerns can be built up and mitigated.

jainankitk commented 7 months ago

So a user could run a query GET my-alias/_search could return 200 and then an admin changes the underlying pointer and it starts to return 403 - not because you don't have access to the alias, but because the alias -> index mapping changed.

IMO, that is the correct behavior. Even in SQL world, Permissions need to be granted to the person executing the query for every object referenced by the view. Except if the referenced object is owned by the view owner. In which case, the authorization decision is made using ownership chains. Should we introduce the concept of ownership to views and indices in OpenSearch?

reta commented 6 months ago

IMO, that is the correct behavior. Even in SQL world, Permissions need to be granted to the person executing the query for every object referenced by the view.

:+1, the views are "moving targets" and not designated users may gain unexpected permissions

Should we introduce the concept of ownership to views and indices in OpenSearch?

I think this may not be applicable to OpenSearch at large, it may change int the future but the identity is optional. And it still opens up the hole in a system since the view could be created with * pattern.

jainankitk commented 6 months ago

In my mind, there is an existing conceptual model that are users are aware of, GET BLANK/_search it works in a very specific way and if we change that it will create confusion around security controls.

I have been thinking of security as new feature for alias, instead of changing existing model.

If there is a new thing GET _views/BLANK/_search it creates that opportunity to change the permissions model and the implications of that model in a clear opt-in pattern.

I am wondering if there should be explicit _views qualifier while querying them. Probably, the end user should be agnostic of whether they are querying view or an index?

jainankitk commented 6 months ago

So a user could run a query GET my-alias/_search could return 200 and then an admin changes the underlying pointer and it starts to return 403 - not because you don't have access to the alias, but because the alias -> index mapping changed. +1, the views are "moving targets" and not designated users may gain unexpected permissions

Does that mean views can run into similar scenario as alias if the view -> index mapping changed?

reta commented 6 months ago

Does that mean views can run into similar scenario as alias if the view -> index mapping changed?

The alias model does not bundle permissions - the individual indices behind the alias are checked

msfroh commented 6 months ago

IIRC, the specific point of a view in the context of this issue is that the permissions are on the view.

If the view is updated to point to a different index (or indices), then, yes, the user would be able to query that different index (through the view).

reta commented 6 months ago

If the view is updated to point to a different index (or indices), then, yes, the user would be able to query that different index (through the view).

I believe view could use index patterns, right? If yes - no updates are needed

jainankitk commented 6 months ago

If the view is updated to point to a different index (or indices), then, yes, the user would be able to query that different index (through the view).

I am wondering, how are we enforcing permissions on the index in this case. Can this result in some escalation of privilege? User might have been explicitly denied for index A, but might have access to view pointing to index A. IAM resolves this by having pass role permission, I guess Ownership Chains does similar in SQL world. But I might be wrong.

peternied commented 6 months ago

@msfroh @reta @jainankitk This is really good discussion. I've created an RFC [1] to discuss the problem space - I think that will lead to better alignment before I jump into low level implementation details.

jainankitk commented 6 months ago

@msfroh @reta @jainankitk This is really good discussion. I've created an RFC [1] to discuss the problem space - I think that will lead to better alignment before I jump into low level implementation details.

* [1] [[RFC] Aligning Access and Visibility in OpenSearch security#4069](https://github.com/opensearch-project/security/issues/4069)

Thanks @peternied for getting this started. Can we also add below question to the above or separate issue?

_I am wondering if there should be explicit views qualifier while querying them. Probably, the end user should be agnostic of whether they are querying view or an index?

peternied commented 6 months ago

if there should be explicit views qualifier while querying them.

Until we are aligned on the problem being solved I don't think we can reason over this implementation detail, lets circle back around to this one. I think doing a broader API/name review will be required and this topic will come up during those discussions.