opensearch-project / security

🔐 Secure your cluster with TLS, numerous authentication backends, data masking, audit logging as well as role-based access control on indices, documents, and fields
https://opensearch.org/docs/latest/security-plugin/index/
Apache License 2.0
199 stars 278 forks source link

Optimized Privilege Evaluation #3870

Open nibix opened 11 months ago

nibix commented 11 months ago

Background

Performance tests indicate that the OpenSearch security layer adds a noticeable overhead to the indexing throughput of an OpenSearch cluster (cf https://github.com/opensearch-project/OpenSearch/issues/2461, https://github.com/opensearch-project/security/issues/3104). The overhead may vary depending on the number of indices, the use of aliases, the number of roles and the size of the user object.

As the latency is dependent on the number of indices, fresh clusters with only few indices perform well; however, users may experience with ongoing time and thus an increasing number of indices a continuous performance degradation. Especially such creeping changes are very undesirable, as they are hard to analyze and debug.

Analysis

High computational complexity of privilege evaluation

The privileges evaluation code uses a number of nested loops for determining if a user has privileges for an index (see for example https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/securityconf/ConfigModelV7.java#L1103 and https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/securityconf/ConfigModelV7.java#L488 ).

In human language, the privilege resolution algorithm is similar to this:

Especially the last loop works on a cross product on two subsets of all existing indices. This is a O(n²) complexity class on the number of indices. On a cluster with many indices and in case very broad index patterns are used, this can require hundreds of thousands iterations.

A number of additional things can be observed here:

Blow up of thread context data for DLS/FLS privilege evaluation

The privilege evaluation code for DLS/FLS works independently from the normal privilege evaluation code.

The resolution algorithm looks similar to this (compare https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/securityconf/ConfigModelV7.java#L355 and https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/configuration/DlsFlsValveImpl.java#L151):

This yields a resolved/expanded view of the DLS/FLS config which maps concrete index names to rules from the roles

It is important to note that this is view is not restricted to the indices requested by the operation. It covers all indices on the cluster.

This data structure is then put into the thread context and sent in serialized form with each further sub-request of the operation: https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/configuration/DlsFlsValveImpl.java#L445 On clusters with many indices, this will be a big data structure which will impact performance both due to increased serialization overhead and increased network data usage.

The actually desired indices are then selected from these data structures on the various shards targeted by the particular operation: https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/configuration/DlsFlsValveImpl.java#L353

Proposals

Optimized data structures

In order to achieve a more efficient privilege evaluation, we should have a closer look at the questions we actually have to answer during privilege evaluation. We can then model data structures which efficiently allow answering these questions.

We have the following effective parameters during privilege evaluation:

An optimized data structure could be a chain of maps looking like this (pseudo-code):

Map<Action, Map<ConcreteIndex, Set<RoleName>>>

This maps action x index to the names of the roles which provide privileges for the particular combination of action and index.

This would be pre-computed based on the role configuration and the cluster state. The index information from the cluster state is used to pre-resolve the index patterns from the roles to produce concrete index names. Thus, this data-structure must be re-computed whenever a index is deleted or created.

The privilege evaluation algorithm would the look like this:

While loops are still necessary, these are clearly bounded by the requested objects and the roles a user has.

Special case for *

Additionally, it makes sense to have further a data structure which help to cover the special but often reoccurring case of operations working on the index pattern *:

Given an action, which roles give privileges for the wildcard * index pattern: Map<Action, Set<RoleName>>

Privilege evaluation on non-resolved index patterns

There are two cases, in which the advance resolving will not work:

For these, role index patterns must be still resolved on the fly during the privilege evaluation phase.

Aliases

OpenSearch allows to define privileges on aliases. In that case, the member indices of the alias automatically inherit the privileges of the alias.

The data structures defined above can already support this by resolving aliases to concrete index names. However, this can lead to a potentially high number of indices which need to be checked for each request.

It would be more efficient to alleviate the need of resolving the alias to single index names by also keeping data structures which operate on an alias level:

Map<Action, Map<Alias, Set<RoleName>>>

Data Streams

Data streams work similar to aliases, so similar considerations apply here.

Knowledge of available action names

In order to be able to also expand patterns on action names to concrete action names, a knowledge of the names of all available actions is necessary.

Distributed DLS/FLS privilege evaluation

As described above, the DLS/FLS implementation currently includes the whole resolved DLS/FLS configuration in the thread context. This has a significant performance impact.

While this approach reduces the amount of privilege evaluation work that has to be done on the final shards, we can achieve the same by also keeping pre-computed data structures for DLS/FLS. This way, it would be completely unnecessary to put the DLS/FLS configuration into the thread context. The code working on the particular shards would be able to directly query the pre-computed data structures without additional overhead.

In the case of DLS/FLS, the action name is not taken into account. Thus, the data structure do not have to consider actions.

A suitable data structure for DLS could look like this (pseudo-code):

Map<ConcreteIndex, Map<RoleName, DlsQuery>> 

As the code is executed on the shards, there is always only one index name to be taken into account when privilege evaluation takes place. Thus, an evaluation algorithm looks like this:

Additionally, it is necessary to keep track of roles which do not define DLS queries and thus allow unrestricted document access:

Map<ConcreteIndex, Set<RoleName>>

Special case for *

Like for the normal privilege evaluation, it makes sense to have dedicated data structures for the special case of the * index pattern.

Privilege evaluation on non-resolved index patterns

The advance resolving of index patterns will not work for dynamic index patterns (containing ${...} variables).

For these, role index patterns must be still resolved on the fly during the privilege evaluation phase.

In contrast to the normal privilege evaluation, operations on non-existing indices do not need to be considered as DLS/FLS only operates on already existing indices.

Migration strategy

As the change of the implementation changes the thread context, which acts as an interface between cluster nodes, we have to take care for the case of a mixed cluster (i.e. a cluster with nodes on different OpenSearch versions). A strategy/logic is necessary to make this that nodes on different versions can safely interoperate.

Performance tests

Of course, the result needs to be validated. At the moment, there exists a basic performance test suite for OpenSearch. This however focuses on core operations rather than on the different aspects of the security layer. Thus, further effort is needed to build tests that verify the security performance from the needed perspective.

We want to build a dedicated performance test suite focused on security aspects on the performance. A RFC covering this topic exists at #3903 .

Low-level configurability of privilege evaluation

Currently, there are a number of static configuration options that influence the behaviour of the privilege evaluation code:

Both can be re-implemented using the new data structures. However, supporting them of course will make the code more complicated. It might make sense to check whether it makes sense to discontinue some of these, as to a certain extent these seem to only guard legacy behaviour.

Request for comments

Please feel very welcome to share your thoughts on this concept. Are there missing considerations? Do you have suggestions for further improvements? Do you have questions?

Side-note: A related software project already uses a similar approach: https://git.floragunn.com/search-guard/search-guard-suite-enterprise/-/blob/master/security/src/main/java/com/floragunn/searchguard/authz/RoleBasedActionAuthorization.java (Apache 2 licensed code)

stephen-crawford commented 10 months ago

[Triage] Hi @nibix, thanks for the very detailed RFC description. It would be great to get some community input here so going to mark as help wanted etc. Thanks again!

nibix commented 10 months ago

@scrawfor99 For increasing visibility, do you think whether it might make sense to tag this issue with the performance label?

stephen-crawford commented 10 months ago

Hey @nibix, sure let me add that :). You can also mention it on the public slack if you think that would help.

peternied commented 10 months ago

@nibix I like all of these ideas - I think we should explore them.

As you mention there are a diverse set of cluster configurations that would work better with different implementations to get around this what do you think about adding interfaces around the current evaluation checks? By adding a mechanism to dynamically shift from the existing implementation to proposed one the trade-offs could be studied in depth that should give us better insights before shifting.

expani commented 10 months ago

@nibix Thanks for the opening this RFC. This will be very useful for performance improvements in security plugin.

A few questions for you :

While loops are still necessary, these are clearly bounded by the requested objects and the roles a user has.

Can we pre-compute this data asynchronously as well based on changes in user roles and requested objects ? Similar to how it's being already done for indices.

Map<ConcreteIndex, Set> Map<Action, Map<Alias, Set>>

We could allow the data structures to be hidden from the code using them for evaluation of an action by a user. This could easily allow a custom extension to change the data to use a custom L1 + L2 Cache implementation.

Are there are already separate efforts towards this direction not covered under this RFC ?

Loop through roles of user Aggregate DLS queries

Are the DLS queries aggregated for every evaluation or pre-computed like other data ?

Some suggestions on the design:

On clusters with many indices, this will be a big data structure which will impact performance both due to increased serialization overhead and increased network data usage. Same for other data structures used for privilege evaluation with DLS/FLS.

We can provide some knobs to control the size of data structures for a granular memory foot print estimation when using security plugin as an advanced user in a custom installation.

Loop through requested actions (usually 1, seldomly 2)

Roles of a user: Set; can be a very varying relatively low number. There are edge cases where users have hundreds of roles, though.

We can use these dimensions to perform a memory estimation and decide on a scale factor from memory footprint to cached users.

Diagnostic APIs like node/stats to detect these would be very helpful.

PS I am still going through the RFC and will follow-up with some more thoughts around this initiative. Thanks for taking time to write up the detailed explanation.

nibix commented 10 months ago

@peternied

As you mention there are a diverse set of cluster configurations that would work better with different implementations to get around this what do you think about adding interfaces around the current evaluation checks? By adding a mechanism to dynamically shift from the existing implementation to proposed one the trade-offs could be studied in depth that should give us better insights before shifting.

Yes, that's a good idea. For the basic privilege evaluation, that will be straight forward. For the DLS/FLS improvements, though, it will be a bit more complicated, as we have to touch several locations for this. That should be still doable, though.

nibix commented 10 months ago

@anijain-Amazon

Thanks for your comments!

Can we pre-compute this data asynchronously as well based on changes in user roles and requested objects ? Similar to how it's being already done for indices.

Precomputing is a bit more difficult here, for t as we have quite a big space of possible combinations (number of all users multiplied by number of indices). But, one could consider to cache the evaluation results and re-use them later. However, I would propose to first check how the performance would be without the cache. I would hope that this will be already so fast that the hassle of a cache would not leverage the additional gains.

We could allow the data structures to be hidden from the code using them for evaluation of an action by a user. This could easily allow a custom extension to change the data to use a custom L1 + L2 Cache implementation.

I guess this should be doable with the suggestion from @peternied to introduce interfaces to decouple the places which need privilege evaluation from the actual implementation.

Are there are already separate efforts towards this direction not covered under this RFC ?

I am not aware of such.

Are the DLS queries aggregated for every evaluation or pre-computed like other data ?

It is not really feasible to pre-compute user-specific information, as their might be many users. Additionally, depending on the authc mechanism, users might be actually not known up-front. One could also think here of a caching solution.

On clusters with many indices, this will be a big data structure which will impact performance both due to increased serialization overhead and increased network data usage.

We can provide some knobs to control the size of data structures for a granular memory foot print estimation when using security plugin as an advanced user in a custom installation.

Just to avoid confusion: my sentence above refers to the current state. The goal is to make this data structure unnecessary.

Still, we can think about introducing such APIs for the newly introduced data structures.

expani commented 10 months ago

I agree we should not pre-optimise before running the preliminary performance tests.

Will there be an eviction logic for all the data structures mentioned in the document ?

For instance, Map<Action, Map<ConcreteIndex, Set<RoleName>>> will cause a significant overhead on memory if number of concrete indices are more.

An end user would want to control the overall cached indices and the size of these data structures. We can provide cache configurations to make it tuneable and APIs to allow users to inspect the current size.

Also, what are you thoughts on skipping cache evictions for certain highly used actions and concrete indices that can be controlled via configs ?
This will help users representing automated workflows like ingestion be more optimized.

nibix commented 10 months ago

Will there be an eviction logic for all the data structures mentioned in the document ?

The data structure mentioned in the document so far, are not intended to be caches. These are pre-computed and updated when the cluster state changes or the security configuration changes. Also, these are independent of users.

For instance, Map<Action, Map<ConcreteIndex, Set<RoleName>>> will cause a significant overhead on memory if number of concrete indices are more.

Indeed, these structures will contain quite a few objects. On the other hand, the leaf objects themselves will be quite small. Also, many objects can be re-used, because, there will be many identical Set<RoleName> sets around, as strong normalization is going on. In the end, it is a de-normalization of the security/index config.

But, yes, we should do some tests/math which heap usage can be expected, also in extreme cases. My hope would be that it won't be bigger than the cluster state which is already required now.

If this turns out to be too large, we can still think about turning these data structures into caches.

cwperks commented 10 months ago

@nibix Thank you for the detailed RFC, I agree with a lot of the comments left here on memory and wondering if there are optimizations that can be made on the data structure to lower the memory usage.

One common example could be a daily rollover of logs. Example:

logs-2024-01-26 logs-2024-01-25 logs-2024-01-24 logs-2024-01-23 logs-2024-01-22 ...

Many roles may have access to logs-* which would give access to each concrete index corresponding to a day's worth of logs. They may be the same set of roles as well.

In the data-structure of this RFC (Map<Action, Map<ConcreteIndex, Set<RoleName>>>) there would be a lot of repetition of the same set of roles for each concrete index corresponding to each day of logs. On the data structure side, would it be necessary to store an entry for each concrete index if there are many indices that match a pattern and permit the same action to the same set of roles?

There may be a further opportunity to optimize when evaluating privileges too by preventing the need to iterate over all concrete indices matching logs-* if all of the concrete indices permit the same set of roles for the same action.

I agree we should not pre-optimise before running the preliminary performance tests.

+1. I've got the famous quote popularized by Donald Knuth stuck in my head as I type this reply: "Premature optimization is the root of all evil".

nibix commented 10 months ago

On the data structure side, would it be necessary to store an entry for each concrete index if there are many indices that match a pattern and permit the same action to the same set of roles?

The goal of the hash map with the concrete indices is to keep complexity as low as possible. If you want to have O(1) complexity, you need to use concrete indices in a hash map. Everything else would have at least O(log n) complexity, and O(n) for non-trivial (prefix) patterns.

In the end, we save CPU resources by investing in heap resources. ATM, I am working on some example figures, will update soon.

nibix commented 10 months ago

So, here are some test results for heap usage of the Map<Action, Map<ConcreteIndex, Set<RoleName>>> data structure.

I used this test program:

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

import org.openjdk.jol.info.GraphLayout;

public class Test {

    final static Set<String> ACTIONS = sampleStrings("action", 50);

    final static Set<String> INDICES = sampleStrings("index", 1000);

    final static Set<String> ROLES = sampleStrings("role", 100);

    public static void main(String ...args) {
        Map<String, Map<String, Set<String>>> object = new HashMap<>();

        for (String action : ACTIONS) {
            for (String index : INDICES) {
                for (String role : ROLES) {
                    object.computeIfAbsent(action, (k) -> new HashMap<>()).computeIfAbsent(index, (k) -> new HashSet<>()).add(role);
                }
            }
        }

        GraphLayout g = GraphLayout.parseInstance(object);

        System.out.println(g.toFootprint());

        System.out.println("TOTAL " + (g.totalSize() / 1024 / 1024) + " MB");
    }

    static Set<String> sampleStrings(String prefix, int number) {
        HashSet<String> result = new HashSet<>();

        for (int i = 0; i < number; i++) {
            result.add(prefix + "_" + i);
        }

        return result;
    }    
}

Note: The test code creates the same number of roles for each index. In reality, it is likely that there will be a different, lower number of roles for each index.

Additionally, as the java.util collection classes are not the most efficient in terms of heap space, I tested with an alternative collection library. In this case, I used com.carrotsearch.hppc. However, this only serves as an example, other libraries are also thinkable.

I assumed 50 actions; this roughly matches the number of index actions provided by OpenSearch.

Number of indices java.util, 100 roles hppc, 100 roles java.util, 1000 roles hppc, 1000 roles
100 20 MB 5 MB 192 MB 39 MB
1000 207 MB 53 MB 1922 MB 394 MB
10000 2071 MB 529 MB OOM 3947 MB

So, judging from the numbers, I guess that we indeed need to limit the size of the object. Even though reasonable cases have reasonable figures in heap usage, we cannot really exclude the case of very big amounts of roles or indices. Using a pre-warmed LRU cache might be then a good choice. This will add some further overhead, as we will need more thread synchronization, though.

DarshitChanpura commented 10 months ago

Thank you @nibix for this RFC in depth. I like the approach of pre-computing the actions + index + role names. This would be easy for a fresh cluster, however, how do we plan to approach this for existing clusters. I know you mentioned about backwards-compatibility and mixed-clusters, and I believe that a working PoC with bwc-tests would raise our confidence levels with this approach. This RFC also seems like a great candidate for an iterative approach in terms of implementing concrete indices, then supporting wildcards, then aliases followed by other smaller changes.

My question here is since these data-structures sit in-memory, would a cluster-restart require a re-computation or we planning to save state before restarts? How do we plan to address wildcard expansions? Would there be a case where a user has access to all indices exception 1-2 and a * pattern is supplied.

Using a pre-warmed LRU cache might be then a good choice. This will add some further overhead, as we will need more thread synchronization, though.

I like the idea of using LRU cache, but would computing it add an extra overhead to each operation to maintain a fresh state of the cache?

Privilege evaluation on non-resolved index patterns

I'm not sure how often does this scenario present itself, but if it is quite often then we would essentially have the same state as we have today of computing this evaluation on the fly. Can the new implementation of a pre-computed data-structures in any way be leveraged?

Do you think that creating a meta index to store and update details of these data structures address some concerns with the new data-structures?

nibix commented 10 months ago

@DarshitChanpura

This would be easy for a fresh cluster, however, how do we plan to approach this for existing clusters.

For the basic (non-DLS/FLS) privilege evaluation, the logic is confined to the single node which processes the current action. If we do not change the semantics, both the old and new implementation can co-exist and interoperate in a mixed cluster.

For the proposed DLS evaluation, one would indeed need some more complex strategy. For this, BwC tests are certainly indicated.

My question here is since these data-structures sit in-memory, would a cluster-restart require a re-computation or we planning to save state before restarts?

Re-computing the data will be pretty quick, thus there is no real point in persisting these data. This also avoids the hassle of making sure that the data is up-to-date.

How do we plan to address wildcard expansions? Would there be a case where a user has access to all indices exception 1-2 and a * pattern is supplied.

Can you elaborate a bit more on that, please?

Privilege evaluation on non-resolved index patterns I'm not sure how often does this scenario present itself, but if it is quite often then we would essentially have the same state as we have today of computing this evaluation on the fly. Can the new implementation of a pre-computed data-structures in any way be leveraged?

This generally only occurs for operations on the meta-level, especially for creating a new index. As the index does not exist beforehand, it also cannot exist in the pre-computed data. Such operations are executed comparatively seldom, so I think it won't be an issue to handle these with more conventional algorithms.

Do you think that creating a meta index to store and update details of these data structures address some concerns with the new data-structures?

At the moment, I would rather just keep this information in the heap. Having an index would of course alleviate the size concerns. However, this would also annihilate any performance advantage.

DarshitChanpura commented 10 months ago

Thank you @nibix for the quick response!

For the basic (non-DLS/FLS) privilege evaluation, the logic is confined to the single node which processes the current action. If we do not change the semantics, both the old and new implementation can co-exist and interoperate in a mixed cluster. For the proposed DLS evaluation, one would indeed need some more complex strategy. For this, BwC tests are certainly indicated.

Agreed.

Re-computing the data will be pretty quick, thus there is no real point in persisting these data. This also avoids the hassle of making sure that the data is up-to-date.

There won't be any performance issues as they sit in memory and there will be no I/O retrieval/storage involved. The updates will be similar to ConfigUpdateAction and the listener we have today. Correct?

Can you elaborate a bit more on that, please?

Sure thing. Let's say there are 5 indices in cluster, aa, ab, ac, ad and ae. What would be the expected outcome when I have a role with * access but exclude indices ad and ae? Would this have any effect on proposed logic in terms of pre-computing and storing?

This generally only occurs for operations on the meta-level, especially for creating a new index. As the index does not exist beforehand, it also cannot exist in the pre-computed data. Such operations are executed comparatively seldom, so I think it won't be an issue to handle these with more conventional algorithms.

okay. I think this is an known unknown.

At the moment, I would rather just keep this information in the heap. Having an index would of course alleviate the size concerns. However, this would also annihilate any performance advantage.

I'm not sure how would this affect performance except the I/O retrieval/storage operations. In that case, how often would these updates be triggered. If it is quite often maybe heap is faster. If not, IMO there won't be any need to implement/maintain LRU or something similar to gain some boost in terms of heap size, and no need to recompute all indices/roles/actions; instead we can just load the index in memory upon restart, and then update as required.

nibix commented 10 months ago

There won't be any performance issues as they sit in memory and there will be no I/O retrieval/storage involved. The updates will be similar to ConfigUpdateAction and the listener we have today. Correct?

Yes.

What would be the expected outcome when I have a role with * access but exclude indices ad and ae? Would this have any effect on proposed logic in terms of pre-computing and storing?

How can you exactly exclude indices ad and ae?

I'm not sure how would this affect performance except the I/O retrieval/storage operations. In that case, how often would these updates be triggered. If it is quite often maybe heap is faster.

An update would be necessary upon each index or alias creation or deletion and upon each change of the security roles configuration.

instead we can just load the index in memory upon restart, and then update as required.

The issue is that the data structure could be in extreme cases too big for the heap. That's why I was thinking of an LRU in order to be able to set a upper usage bound.

DarshitChanpura commented 9 months ago

I like the overall approach here @nibix . I think I was confused about exclude index feature in roles, but I believe we don't support that yet. Do you think we are in a good place to get a POC with an in-memory data structure you proposed along with LRU implementation?

nibix commented 9 months ago

@DarshitChanpura

Do you think we are in a good place to get a POC with an in-memory data structure you proposed along with LRU implementation?

Absolutely. One open thing is still, how we shall verify the resulting performance gain. We filed https://github.com/opensearch-project/security/issues/3903 for this, but there we still have a couple of open threads/questions. We can possibly start with simple micro benchmarks, but there are some limits where these can be used: First, one does not get an actual figure on improvements in indexing/search performance. Second, micro benchmarks for DLS/FLS won't be really possible.

peternied commented 9 months ago

there we still have a couple of open threads/questions.

@nibix Would you mind pulling out anything that you feel is still unresolved so we can be sure to provide feedback on next steps?

nibix commented 9 months ago

@peternied

I have added my thoughts at https://github.com/opensearch-project/security/issues/3903#issuecomment-1938822478

nibix commented 6 months ago

@peternied @cwperks @DarshitChanpura @scrawfor99 An incomplete first draft implementation can be found at https://github.com/opensearch-project/security/pull/4380

Would be keen on hearing your opinions.

nibix commented 4 months ago

I would like to follow up on the old discussion on the size of the heap required for the denormalized data structure: https://github.com/opensearch-project/security/issues/3870#issuecomment-1916949519

There, we came to the conclusion to use a LRU cache in order to limit the amount of required heap memory.

In the meantime, I tried this further detail this approach. However, I came to the conclusion that a LRU cache is not really feasible here. This is mainly because of these reasons:

Due the that, I would not change the strategy regarding heap size:

I think it would be best to limit and compress the data structure using several methods: