opensearch-project / security

🔐 Secure your cluster with TLS, numerous authentication backends, data masking, audit logging as well as role-based access control on indices, documents, and fields
https://opensearch.org/docs/latest/security-plugin/index/
Apache License 2.0
181 stars 264 forks source link

Optimized Privilege Evaluation [DRAFT] #4380

Open nibix opened 1 month ago

nibix commented 1 month ago

Description

This is a preview of a possible implementation for #3870: Optimized Privilege Evaluation.

The major purpose is to give an initial impression on the approach and to facilitate review. However, more implementation will be necessary.

Performance tests indicate that the OpenSearch security layer adds a noticeable overhead to the indexing throughput of an OpenSearch cluster. The overhead may vary depending on the number of indices, the use of aliases, the number of roles and the size of the user object. The goal of these changes is to improve privilege evaluation performance and to make it less dependent on the number of indices, etc.

The main behavior will not change. However, I would like to discuss whether this opportunity can be used to get rid of special behaviors which seem to be obscure and mostly useless.

Issues Resolved

Testing

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

nibix commented 1 month ago

Please have a look at the approaches. I would be very interested in your opinions.

As mentioned above, the implementation is not complete yet. The code contains a couple of TODO comments to indicate what work needs to be done.

I would also like to discuss whether a couple of things would be really necessary or whether there might be a chance to simplify the implementation by abolishing them.

These are:

nibix commented 1 month ago

I have also started to work on the micro benchmarks as discussed in #3903. The generally accepted standard for micro benchmarks in Java is the JMH framework. However, this is licensed as GPL v2 with classpath exception: https://github.com/openjdk/jmh/blob/master/LICENSE Is the inclusion of a dependency with such a license acceptable in OpenSearch?

peternied commented 1 month ago

I have also started to work on the micro benchmarks as discussed in #3903. The generally accepted standard for micro benchmarks in Java is the JMH framework. However, this is licensed as GPL v2 with classpath exception: https://github.com/openjdk/jmh/blob/master/LICENSE Is the inclusion of a dependency with such a license acceptable in OpenSearch?

@cwperks Can you look into this?

cwperks commented 1 month ago

@cwperks Can you look into this?

We're looking into this and will get back with an answer.

DarshitChanpura commented 3 weeks ago

However, this is licensed as GPL v2 with classpath exception: https://github.com/openjdk/jmh/blob/master/LICENSE Is the inclusion of a dependency with such a license acceptable in OpenSearch?

The feedback we received was that the code can be used only for internal operations. Since JMH usage will be part of Open-source security, my understanding is that this is not approved.

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 82.36220% with 112 lines in your changes missing coverage. Please review.

Project coverage is 65.37%. Comparing base (9caf5cb) to head (f653945). Report is 5 commits behind head on main.

:exclamation: Current head f653945 differs from pull request most recent head b5ff5c8

Please upload reports for the commit b5ff5c8 to get more accurate results.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/opensearch-project/security/pull/4380/graphs/tree.svg?width=650&height=150&src=pr&token=rBpySfQXMt&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project)](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project) ```diff @@ Coverage Diff @@ ## main #4380 +/- ## ========================================== + Coverage 65.27% 65.37% +0.10% ========================================== Files 313 318 +5 Lines 22058 22567 +509 Branches 3563 3666 +103 ========================================== + Hits 14398 14753 +355 - Misses 5889 6024 +135 - Partials 1771 1790 +19 ``` | [Files](https://app.codecov.io/gh/opensearch-project/security/pull/4380?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project) | Coverage Δ | | |---|---|---| | [...rch/security/configuration/DlsFlsRequestValve.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fconfiguration%2FDlsFlsRequestValve.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9jb25maWd1cmF0aW9uL0Rsc0Zsc1JlcXVlc3RWYWx2ZS5qYXZh) | `0.00% <ø> (ø)` | | | [...search/security/configuration/DlsFlsValveImpl.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fconfiguration%2FDlsFlsValveImpl.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9jb25maWd1cmF0aW9uL0Rsc0Zsc1ZhbHZlSW1wbC5qYXZh) | `59.80% <100.00%> (+0.75%)` | :arrow_up: | | [...org/opensearch/security/filter/SecurityFilter.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Ffilter%2FSecurityFilter.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9maWx0ZXIvU2VjdXJpdHlGaWx0ZXIuamF2YQ==) | `66.51% <100.00%> (+0.79%)` | :arrow_up: | | [...ch/security/privileges/PitPrivilegesEvaluator.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fprivileges%2FPitPrivilegesEvaluator.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9wcml2aWxlZ2VzL1BpdFByaXZpbGVnZXNFdmFsdWF0b3IuamF2YQ==) | `96.15% <100.00%> (-0.15%)` | :arrow_down: | | [...urity/privileges/RestLayerPrivilegesEvaluator.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fprivileges%2FRestLayerPrivilegesEvaluator.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9wcml2aWxlZ2VzL1Jlc3RMYXllclByaXZpbGVnZXNFdmFsdWF0b3IuamF2YQ==) | `93.10% <100.00%> (-1.02%)` | :arrow_down: | | [...earch/security/resolver/IndexResolverReplacer.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fresolver%2FIndexResolverReplacer.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9yZXNvbHZlci9JbmRleFJlc29sdmVyUmVwbGFjZXIuamF2YQ==) | `66.84% <100.00%> (-1.17%)` | :arrow_down: | | [...ecurityconf/impl/SecurityDynamicConfiguration.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fsecurityconf%2Fimpl%2FSecurityDynamicConfiguration.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9zZWN1cml0eWNvbmYvaW1wbC9TZWN1cml0eUR5bmFtaWNDb25maWd1cmF0aW9uLmphdmE=) | `81.02% <100.00%> (+0.71%)` | :arrow_up: | | [.../opensearch/security/OpenSearchSecurityPlugin.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2FOpenSearchSecurityPlugin.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9PcGVuU2VhcmNoU2VjdXJpdHlQbHVnaW4uamF2YQ==) | `84.33% <50.00%> (+0.02%)` | :arrow_up: | | [...urity/privileges/SecurityIndexAccessEvaluator.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fprivileges%2FSecurityIndexAccessEvaluator.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9wcml2aWxlZ2VzL1NlY3VyaXR5SW5kZXhBY2Nlc3NFdmFsdWF0b3IuamF2YQ==) | `71.09% <83.33%> (+0.69%)` | :arrow_up: | | [...security/privileges/TermsAggregationEvaluator.java](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree&filepath=src%2Fmain%2Fjava%2Forg%2Fopensearch%2Fsecurity%2Fprivileges%2FTermsAggregationEvaluator.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-c3JjL21haW4vamF2YS9vcmcvb3BlbnNlYXJjaC9zZWN1cml0eS9wcml2aWxlZ2VzL1Rlcm1zQWdncmVnYXRpb25FdmFsdWF0b3IuamF2YQ==) | `61.29% <85.71%> (+4.14%)` | :arrow_up: | | ... and [8 more](https://app.codecov.io/gh/opensearch-project/security/pull/4380?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project) | | ... and [6 files with indirect coverage changes](https://app.codecov.io/gh/opensearch-project/security/pull/4380/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project)
nibix commented 1 day ago

@cwperks @peternied @DarshitChanpura @scrawfor99

Just FYI:

I worked a bit on the micro benchmarking part of this issue. As JMH was out due to its license, I reviewed other frameworks. It is remarkable that in most cases the descriptions of the frameworks will say "rather use JMH instead of this framework".

Anyway, I tried out https://github.com/noconnor/JUnitPerf because the idea of using JUnit infrastructure seemed to be nice. The big downside of JUnitPerf is that it does not work well together with parameterized JUnit tests.

See here for an example:

https://github.com/opensearch-project/security/blob/004df3bbdc69514f0c95acd2a1653a01e71758b9/src/performanceTest/java/org/opensearch/security/privileges/PrivilegesEvaluatorPerformanceTest.java

The high number of very similar methods is caused by the lack of parameter support - in the end we need to test quite a few different dimensions (like number of indices, number of roles, etc), on the same operation.

As I was really keen on getting some broader result, I went on the "roll your own" path and quick threw together some naive micro benchmarking code. So, this is just a temporary thing, thus very messy, but it gives me some numbers. See here:

https://github.com/opensearch-project/security/blob/004df3bbdc69514f0c95acd2a1653a01e71758b9/src/performanceTest/java/org/opensearch/security/privileges/PrivilegesEvaluatorPeformanceTest2.java

So, I let run some tests and here are some preliminary results.

Micro benchmark test results

Disclaimer

Generally, the real world meaningfulness of micro benchmarks is limited. On a full real cluster, this can look totally different due to:

On the other hand, micro benchmarks make some tests so much easier. For micro benchmarking, a Metadata instance with 100000 indices can be mocked within a few seconds. On the other hand, creating so many indices on a real cluster would take much, much longer.

Full cluster benchmarks are also coming up, but these are still in the works.

Scope

The micro benchmarks were applied to the following code:

https://github.com/opensearch-project/security/blob/004df3bbdc69514f0c95acd2a1653a01e71758b9/src/performanceTest/java/org/opensearch/security/privileges/PrivilegesEvaluatorPeformanceTest2.java#L501-L512

For comparison, we also applied the micro benchmarks to the following code on the old code base:

https://github.com/nibix/security/blob/300d138578ef853071d649d647335d8430320f14/src/performanceTest/java/org/opensearch/security/privileges/PrivilegesEvaluatorPeformanceTest2.java#L502-L510

Due to refactorings, the code looks different. However, what happens under the hood is effectively the same.

Additionally some further code changes were necessary to make PrivilegeEvaluator independent from dependencies like ClusterService in order to make it really unit testable/benchmarkable. I first tried to use Mockito to mock ClusterService instances but had to learn that the performance characteristics of Mockito are so bad that it is unsuitable for micro benchmarking.

As we only look at the evaluate() method, DLS/FLS evaluation is disabled for this scope.

Tested dimensions

Action requests

We tested privilege evaluation with three different actions:

Number of indices on cluster

We tested with these indices:

Results

The raw result data can be found here: https://docs.google.com/spreadsheets/d/1Hd6pZFICTeplXIun3UpEplANAwQDE0jXbqLnnJz61AI/edit?usp=sharing

In the shards below, dashed lines indicate the performance of the old privilege evaluation code on a particular combination of test dimensions. Solid lines with the same color indicate the performance of the new code with the same test dimensions. The x-axis represents the number of indices on the cluster, the y-axis represents the throughput in operations per second.

bulk[s], BulkShardRequest

The performance of BulkShardRequests is the most interesting factor on clusters doing heavy ingestion. A single bulk requests will be broken down into the individual indices and shards, resulting in quite a few BulkShardRequests for which the privilege evaluation needs to be done in parallel, thus performance characteristics here have a high impact.

The privilege evaluation for the top level BulkRequest is less interesting because it is just an index-independent cluster privilege, which is easy to evaluate. Still, we will also review this below.

Requests with 10 items

chart

Requests with 1000 items

chart(1)

Observation

The performance of the old code degrades with the increasing number of indices. Starting with 30000 indices, we have a method call latency which is > 10 ms. This is where users on ingestion heavy clusters often start to experience performance issues and the method calls start to show up in the hot thread dumps.

In contrast, the throughput of the new code stays constant, independent of the number of indices. It can be seen that the number of roles still has quite an effect on the throughput. But here we talk about time differences below 0.1 ms, which should not be significant in a real world cluster.

bulk, BulkRequest

The top level bulk action is a cluster action, so it does not require considering the indices on a cluster.

chart(3)

Observation

As expected, performance is independent of number of indices, both on the new implementation and on the old implementation. However, the new implementation improves throughput by a factor between 2 and 3.

search, SearchRequest

Search operations become interesting when there are monitoring/alerting solutions issuing search requests on broad index patterns in short time intervals.

Search with search patterns that match 2% of the indices

chart(4)

Search with search patterns that match 20% of the indices

chart(5)

Observation

Both the old and new code degrade with the growing number of indices. Profiling shows that this is mostly not due to privilege evaluation, but due to the index pattern expression resolution.

However, the new code retains method call latencies below 20 ms even on clusters with 100000 indices. The old code however, takes up to 5 seconds for a single method call on clusters with 100000 indices.

See the following chart for a zoomed in section of the 2% of indices case for 10000-100000 indices:

chart