nibix commented 1 month ago

Description

This is a preview of a possible implementation for #3870: Optimized Privilege Evaluation.

The major purpose is to give an initial impression on the approach and to facilitate review. However, more implementation will be necessary.

Category: Enhancement
Why these changes are required?

Performance tests indicate that the OpenSearch security layer adds a noticeable overhead to the indexing throughput of an OpenSearch cluster. The overhead may vary depending on the number of indices, the use of aliases, the number of roles and the size of the user object. The goal of these changes is to improve privilege evaluation performance and to make it less dependent on the number of indices, etc.

What is the old behavior before changes and new behavior after changes?

The main behavior will not change. However, I would like to discuss whether this opportunity can be used to get rid of special behaviors which seem to be obscure and mostly useless.

Issues Resolved

3870

Testing

At the moment, there is a unit test at org.opensearch.security.privileges.ActionPrivilegesTest
However, as this is a draft, it is likely that some of the existing integration tests will fail.

Check List

[] New functionality includes testing
[ ] New functionality has been documented
[x] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

nibix commented 1 month ago

Please have a look at the approaches. I would be very interested in your opinions.

As mentioned above, the implementation is not complete yet. The code contains a couple of TODO comments to indicate what work needs to be done.

I would also like to discuss whether a couple of things would be really necessary or whether there might be a chance to simplify the implementation by abolishing them.

These are:

At the moment, only RoleV7 and ActionGroupV7 configurations are supported by the class ActionPrivileges. I am wondering wether there are any plans to get rid of the *V6 configurations at some point in time. The figure V6 still stems from ODFE support for Elasticsearch 6. Are there really still OpenSearch users on this configuration? Additionally, the use of two impl classes per config type makes many unsafe casts of the generic SecurityDynamicConfiguration class necessary. If there would be only one impl class per config type, the APIs could be designed much safer.
In config.yml, the semantics of roles.yml can be changed by setting multi_rolespan_enabled to false. The OpenSearch docs do not mention this flag. In my perception, there is no real use of this setting except maintaining backwards compatiblity. However, for OpenSearch, the default of was always true since its inception. Are there really users having it set to false?

nibix commented 1 month ago

I have also started to work on the micro benchmarks as discussed in #3903. The generally accepted standard for micro benchmarks in Java is the JMH framework. However, this is licensed as GPL v2 with classpath exception: https://github.com/openjdk/jmh/blob/master/LICENSE Is the inclusion of a dependency with such a license acceptable in OpenSearch?

peternied commented 1 month ago

I have also started to work on the micro benchmarks as discussed in #3903. The generally accepted standard for micro benchmarks in Java is the JMH framework. However, this is licensed as GPL v2 with classpath exception: https://github.com/openjdk/jmh/blob/master/LICENSE Is the inclusion of a dependency with such a license acceptable in OpenSearch?

@cwperks Can you look into this?

cwperks commented 1 month ago

@cwperks Can you look into this?

We're looking into this and will get back with an answer.

DarshitChanpura commented 3 weeks ago

However, this is licensed as GPL v2 with classpath exception: https://github.com/openjdk/jmh/blob/master/LICENSE Is the inclusion of a dependency with such a license acceptable in OpenSearch?

The feedback we received was that the code can be used only for internal operations. Since JMH usage will be part of Open-source security, my understanding is that this is not approved.

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 82.36220% with 112 lines in your changes missing coverage. Please review.

Project coverage is 65.37%. Comparing base (9caf5cb) to head (f653945). Report is 5 commits behind head on main.

:exclamation: Current head f653945 differs from pull request most recent head b5ff5c8

Please upload reports for the commit b5ff5c8 to get more accurate results.

Additional details and impacted files

[![Impacted file tree graph](https://app.codecov.io/gh/opensearch-project/security/pull/4380/graphs/tree.svg?width=650&height=150&src=pr&token=rBpySfQXMt&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project)](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project) ```diff @@ Coverage Diff @@ ## main #4380 +/- ## ========================================== + Coverage 65.27% 65.37% +0.10% ========================================== Files 313 318 +5 Lines 22058 22567 +509 Branches 3563 3666 +103 ========================================== + Hits 14398 14753 +355 - Misses 5889 6024 +135 - Partials 1771 1790 +19 ``` | [Files](https://app.codecov.io/gh/opensearch-project/security/pull/4380?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project) | Coverage Δ | | |---|---|---| | [...rch/security/configuration/DlsFlsRequestValve.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fconfiguration%2FDlsFlsRequestValve.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9jb25maWd1cmF0aW9uL0Rsc0Zsc1JlcXVlc3RWYWx2ZS5qYXZh) | `0.00% <ø> (ø)` | | | [...search/security/configuration/DlsFlsValveImpl.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fconfiguration%2FDlsFlsValveImpl.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9jb25maWd1cmF0aW9uL0Rsc0Zsc1ZhbHZlSW1wbC5qYXZh) | `59.80% <100.00%> (+0.75%)` | :arrow_up: | | [...org/opensearch/security/filter/SecurityFilter.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Ffilter%2FSecurityFilter.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9maWx0ZXIvU2VjdXJpdHlGaWx0ZXIuamF2YQ==) | `66.51% <100.00%> (+0.79%)` | :arrow_up: | | [...ch/security/privileges/PitPrivilegesEvaluator.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fprivileges%2FPitPrivilegesEvaluator.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9wcml2aWxlZ2VzL1BpdFByaXZpbGVnZXNFdmFsdWF0b3IuamF2YQ==) | `96.15% <100.00%> (-0.15%)` | :arrow_down: | | [...urity/privileges/RestLayerPrivilegesEvaluator.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fprivileges%2FRestLayerPrivilegesEvaluator.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9wcml2aWxlZ2VzL1Jlc3RMYXllclByaXZpbGVnZXNFdmFsdWF0b3IuamF2YQ==) | `93.10% <100.00%> (-1.02%)` | :arrow_down: | | [...earch/security/resolver/IndexResolverReplacer.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fresolver%2FIndexResolverReplacer.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9yZXNvbHZlci9JbmRleFJlc29sdmVyUmVwbGFjZXIuamF2YQ==) | `66.84% <100.00%> (-1.17%)` | :arrow_down: | | [...ecurityconf/impl/SecurityDynamicConfiguration.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fsecurityconf%2Fimpl%2FSecurityDynamicConfiguration.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9zZWN1cml0eWNvbmYvaW1wbC9TZWN1cml0eUR5bmFtaWNDb25maWd1cmF0aW9uLmphdmE=) | `81.02% <100.00%> (+0.71%)` | :arrow_up: | | [.../opensearch/security/OpenSearchSecurityPlugin.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2FOpenSearchSecurityPlugin.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9PcGVuU2VhcmNoU2VjdXJpdHlQbHVnaW4uamF2YQ==) | `84.33% <50.00%> (+0.02%)` | :arrow_up: | | [...urity/privileges/SecurityIndexAccessEvaluator.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fprivileges%2FSecurityIndexAccessEvaluator.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9wcml2aWxlZ2VzL1NlY3VyaXR5SW5kZXhBY2Nlc3NFdmFsdWF0b3IuamF2YQ==) | `71.09% <83.33%> (+0.69%)` | :arrow_up: | | [...security/privileges/TermsAggregationEvaluator.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fprivileges%2FTermsAggregationEvaluator.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9wcml2aWxlZ2VzL1Rlcm1zQWdncmVnYXRpb25FdmFsdWF0b3IuamF2YQ==) | `61.29% <85.71%> (+4.14%)` | :arrow_up: | | ... and [8 more](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project) | | ... and [6 files with indirect coverage changes](https://app.codecov.io/gh/opensearch-project/security/pull/4380/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project)

nibix commented 1 day ago

@cwperks @peternied @DarshitChanpura @scrawfor99

Just FYI:

I worked a bit on the micro benchmarking part of this issue. As JMH was out due to its license, I reviewed other frameworks. It is remarkable that in most cases the descriptions of the frameworks will say "rather use JMH instead of this framework".

Anyway, I tried out https://github.com/noconnor/JUnitPerf because the idea of using JUnit infrastructure seemed to be nice. The big downside of JUnitPerf is that it does not work well together with parameterized JUnit tests.

See here for an example:

https://github.com/opensearch-project/security/blob/004df3bbdc69514f0c95acd2a1653a01e71758b9/src/performanceTest/java/org/opensearch/security/privileges/PrivilegesEvaluatorPerformanceTest.java

The high number of very similar methods is caused by the lack of parameter support - in the end we need to test quite a few different dimensions (like number of indices, number of roles, etc), on the same operation.

As I was really keen on getting some broader result, I went on the "roll your own" path and quick threw together some naive micro benchmarking code. So, this is just a temporary thing, thus very messy, but it gives me some numbers. See here:

https://github.com/opensearch-project/security/blob/004df3bbdc69514f0c95acd2a1653a01e71758b9/src/performanceTest/java/org/opensearch/security/privileges/PrivilegesEvaluatorPeformanceTest2.java

So, I let run some tests and here are some preliminary results.

Micro benchmark test results

Disclaimer

Generally, the real world meaningfulness of micro benchmarks is limited. On a full real cluster, this can look totally different due to:

Proportion of effects to other time consuming operations
Effects caused by garbage collection, thread synchronization or JIT
Different hardware which can sustain constant CPU load much better than the consumer system I used to run the benchmarks

On the other hand, micro benchmarks make some tests so much easier. For micro benchmarking, a Metadata instance with 100000 indices can be mocked within a few seconds. On the other hand, creating so many indices on a real cluster would take much, much longer.

Full cluster benchmarks are also coming up, but these are still in the works.

Scope

The micro benchmarks were applied to the following code:

https://github.com/opensearch-project/security/blob/004df3bbdc69514f0c95acd2a1653a01e71758b9/src/performanceTest/java/org/opensearch/security/privileges/PrivilegesEvaluatorPeformanceTest2.java#L501-L512

For comparison, we also applied the micro benchmarks to the following code on the old code base:

https://github.com/nibix/security/blob/300d138578ef853071d649d647335d8430320f14/src/performanceTest/java/org/opensearch/security/privileges/PrivilegesEvaluatorPeformanceTest2.java#L502-L510

Due to refactorings, the code looks different. However, what happens under the hood is effectively the same.

Additionally some further code changes were necessary to make PrivilegeEvaluator independent from dependencies like ClusterService in order to make it really unit testable/benchmarkable. I first tried to use Mockito to mock ClusterService instances but had to learn that the performance characteristics of Mockito are so bad that it is unsuitable for micro benchmarking.

As we only look at the evaluate() method, DLS/FLS evaluation is disabled for this scope.

Tested dimensions

Action requests

We tested privilege evaluation with three different actions:

indices:data/write/bulk[s] with BulkShardRequest
- with 10 bulk items
- with 1000 bulk items
indices:data/write/bulk with BulkRequest
indices:data/read/search with SearchRequest
- with an index pattern that matches 2% of all indices (randomized)
- with an index pattern that matches 20% of all indices (randomized)

Number of indices on cluster

We tested with these indices:

10 indices:index_a0, index_a1, index_b0, index_b1, index_c0, index_c1, ... , index_e0, index_e1
30 indices: index_a0, ..., index_a5, ... , index_e0, ... index_e5
100 indices: index_a0, ..., index_a19, ... , index_e0, ... index_e19
300 indices
1000 indices
3000 indices
10000 indices
30000 indices
100000 indices

Different user configurations
A user with full privileges (using * for index_permissions and cluster_permssions)
A user with a single role giving CRUD permissions on index_a* and index_b*
A user with 20 roles giving CRUD permissions individually on index_a0, index_a1, ...
A user with 40 roles in total: 20 roles giving READ permissions individually on index_a0, index_a1, ... and 20 more roles giving WRITE permissions on the same indices
A user with a single role which uses a regex index pattern with a user attribute. This is interesting because it makes certain pre-computations impossible and requires to re-evaluate the index patterns for each request.

Results

The raw result data can be found here: https://docs.google.com/spreadsheets/d/1Hd6pZFICTeplXIun3UpEplANAwQDE0jXbqLnnJz61AI/edit?usp=sharing

In the shards below, dashed lines indicate the performance of the old privilege evaluation code on a particular combination of test dimensions. Solid lines with the same color indicate the performance of the new code with the same test dimensions. The x-axis represents the number of indices on the cluster, the y-axis represents the throughput in operations per second.

`bulk[s]`, `BulkShardRequest`

The performance of BulkShardRequests is the most interesting factor on clusters doing heavy ingestion. A single bulk requests will be broken down into the individual indices and shards, resulting in quite a few BulkShardRequests for which the privilege evaluation needs to be done in parallel, thus performance characteristics here have a high impact.

The privilege evaluation for the top level BulkRequest is less interesting because it is just an index-independent cluster privilege, which is easy to evaluate. Still, we will also review this below.

Requests with 10 items

chart

Requests with 1000 items

chart(1)

Observation

The performance of the old code degrades with the increasing number of indices. Starting with 30000 indices, we have a method call latency which is > 10 ms. This is where users on ingestion heavy clusters often start to experience performance issues and the method calls start to show up in the hot thread dumps.

In contrast, the throughput of the new code stays constant, independent of the number of indices. It can be seen that the number of roles still has quite an effect on the throughput. But here we talk about time differences below 0.1 ms, which should not be significant in a real world cluster.

`bulk`, `BulkRequest`

The top level bulk action is a cluster action, so it does not require considering the indices on a cluster.

chart(3)

Observation

As expected, performance is independent of number of indices, both on the new implementation and on the old implementation. However, the new implementation improves throughput by a factor between 2 and 3.

`search`, `SearchRequest`

Search operations become interesting when there are monitoring/alerting solutions issuing search requests on broad index patterns in short time intervals.

Search with search patterns that match 2% of the indices

chart(4)

Search with search patterns that match 20% of the indices

chart(5)

Observation

Both the old and new code degrade with the growing number of indices. Profiling shows that this is mostly not due to privilege evaluation, but due to the index pattern expression resolution.

However, the new code retains method call latencies below 20 ms even on clusters with 100000 indices. The old code however, takes up to 5 seconds for a single method call on clusters with 100000 indices.

See the following chart for a zoomed in section of the 2% of indices case for 10000-100000 indices:

chart

opensearch-project / security

Optimized Privilege Evaluation [DRAFT] #4380

Description

Issues Resolved

3870

Testing

Check List

Codecov Report

Micro benchmark test results

Disclaimer

Scope

Tested dimensions

Action requests

Number of indices on cluster

Different user configurations

Results

`bulk[s]`, `BulkShardRequest`

Requests with 10 items

Requests with 1000 items

Observation

`bulk`, `BulkRequest`

Observation

`search`, `SearchRequest`

Search with search patterns that match 2% of the indices

Search with search patterns that match 20% of the indices

Observation

opensearch-project / security

Optimized Privilege Evaluation [DRAFT] #4380

Description

Issues Resolved

3870

Testing

Check List

Codecov Report

Micro benchmark test results

Disclaimer

Scope

Tested dimensions

Action requests

Number of indices on cluster

Different user configurations

Results

bulk[s], BulkShardRequest

Requests with 10 items

Requests with 1000 items

Observation

bulk, BulkRequest

Observation

search, SearchRequest

Search with search patterns that match 2% of the indices

Search with search patterns that match 20% of the indices

Observation

`bulk[s]`, `BulkShardRequest`

`bulk`, `BulkRequest`

`search`, `SearchRequest`