opensearch-project / security

🔐 Secure your cluster with TLS, numerous authentication backends, data masking, audit logging as well as role-based access control on indices, documents, and fields
https://opensearch.org/docs/latest/security-plugin/index/
Apache License 2.0
184 stars 266 forks source link

[RFC] Security Performance Test Suite #3903

Open nibix opened 6 months ago

nibix commented 6 months ago

Background

Already now, the OpenSearch project systematically conducts benchmark test runs and publishes the results: https://opensearch.org/benchmarks/ This is based on the pre-defined benchmark-workloads which are executed using the tool opensearch-benchmark.

Additionally, there is ongoing work to define testing processes for performance related quality assurance: https://github.com/opensearch-project/OpenSearch/issues/7499

These efforts, however, focus on the OpenSearch core; the efficiency of the code implementing data retrieval and indexing is tested. In some cases, there is a comparison between OpenSearch with security enabled vs OpenSearch without security enabled. While this can give a first idea of the overhead produced by the security plugin, the results cannot be generalized. This is because the performance overhead of the security plugin depends on a number of different factors. See below for more information on this.

Ideally, there would be an additional security-focused performance test suite. This test suite could be used for assuring that the security plugin does not suffer any performance regressions. Additionally, it could be used to determine the actual benefit of performance enhancing code changes (for example https://github.com/opensearch-project/security/issues/3870).

Goal of this RFC is to collect requirements for such a test suite, learn about ongoing efforts, and to find a way to integrate it into the software development process.

Performance influencing factors

The performance overhead of the security plugin depends on a number of different factors:

Proposal for test dimensions

Given the complex dependencies, a three dimensional test matrix seems to be necessary. I would propose these dimensions:

  1. Operations: bulk index, search on specific index, search on *, search on alias, search with aggregation
  2. Number of indices: Cluster with 10 indices, cluster with 1000 indices
  3. Security configuration: Basic, large user object, utilizing DLS/FLS, dynamic index patterns, user with many roles

Note: Some combinations in this matrix might not make sense and can be skipped; for example, DLS/FLS makes indices read-only, thus testing indexing does not make sense in that case

Questions:

Process definition

We need to define when and how the performance tests should be executed.

Factors which might influence the decision are:

Related work

There are a number of other ongoing efforts/proposals covering OpenSearch benchmarking. In some cases, it might make sense to coordinate with these efforts:

Question: Are there further efforts?

peternied commented 6 months ago

Proposal for test dimensions

Good list of scenarios as there are clear classes where the security plugin has undefined cpu/memory performances characteristics O(n^2) or O(2^N) based on the number of indices. This would help identify where we can provide better instruction to existing clusters with best practices and also give us a way to measure the impact of performance improvements over time.

Should a comparison of TLS versions, ciphers or implementations be also covered here?

I feel like this is represents a different 'class' of test case - which is different than correctness. I think of the correctness and perf areas being those measures such as the number of indices or number of roles assigned to a user. For TLS versions the considerations depend on compliance to standards such as FIPS - they need a level of configurability.

I think it is a very useful area to test - but I'd break those out separately the dimension and I'm less concerned about getting results until we are looking to add additional configuration options and we want to make sure they are in alignment.

A test run will take a significant time. Thus, it might be not suitable to be included in a pull request CI run. Should there be an optional way to execute the test for a pull request?

I'd prefer we keep it simple and always run them, see the ResourceFocusedTests.java and associated github workflow for an example of these 'constrained' tests.

What do you think of address specific performance time issues, such as having a snapshot that can be restored with the test data set rather than individual requests for items, maybe the benchmarks' folks can help with recommendations?

A test run will require significant CPU resources. Is the Github actions environment sufficient for (small) test runs? Is it necessary to pull up dedicated environments?

I would be concerned about creating new infrastructure for this - while it would be possible such as by using OpenSearch-ci, that becomes a management burden compared to the managed github runners. By using this infrastructure it also creates a bottleneck for testing and deployment of new features or test areas in forks.

Note; I know this makes a trade off where you have a hard time testing locally - I'd rather that PRs can get this testing than relying on contributors / maintainers to remember to run the tests.

  • How shall the results of the test runs be published?
  • How shall the results of the test runs be monitored?

Not sure what is available in this space, I'd love to see proposals. What codecov provides on pull requests is really nice - but I don't know what platforms are out there for these multi-dimensional perf metrics where you could see pull request vs main.

In a more bootstrapping way we could use the a LKG that is build and runs the tests, then the PR, then another test compares the results. This might also offer a way that these tests could be run locally via gradle that is triggered by a GHA.


Closing thoughts - Reliability

I believe the most important part of these tests would be that it is clear what they are testing, and its also clear what is wrong when they fail. If we have a failing perf test it would be unfortunate for it to be off-hand ignored because it isn't clear the impact. I'd rather have only a few highly trusted test cases than a bunch of edge case coverage with high variation in results and reliability.

peternied commented 6 months ago

Have you seen github-action-benchmark? Maybe this would be a good approach to tackle the data collection / analysis.

nibix commented 6 months ago

@peternied

I feel like this is represents a different 'class' of test case - which is different than correctness.

Good point! This helps defining the scope of the issue.

I'd prefer we keep it simple and always run them

Do you have an intuition what would be the max duration performance tests should take then?

Have you seen github-action-benchmark?

Good idea! Will have a look.

peternied commented 6 months ago

Do you have an intuition what would be the max duration performance tests should take then?

@nibix How about we start with max of 20 minutes and work backwards from there? I pulled that number from the current max PR check times.

nibix commented 6 months ago

@peternied

Sorry for the delay, but I now can follow up with some further responses and items to discuss.

I would be concerned about creating new infrastructure for this - while it would be possible such as by using OpenSearch-ci, that becomes a management burden compared to the managed github runners. By using this infrastructure it also creates a bottleneck for testing and deployment of new features or test areas in forks.

I can totally follow your reasoning here. Yet, there's one further problem we have to consider, which also plays into the trade-offs we have to consider for the infrastructure question.

It is about the variance of benchmark results we will have to expect. If you look at the existing OpenSearch benchmarks, you can see for some operations a variation of up to 18% between runs:

Screenshot from 2024-01-18 12-17-16

This poses a problem if one wants to use the benchmark results for automatically verifying the absence of performance regressions in pull requests. This is because we cannot just test for "equal or better" performance, but have to add some tolerance interval. However:

The Benchmark GitHub action will by default trigger an error only if the performance regressed by a factor of 2: https://github.com/benchmark-action/github-action-benchmark#alert-comment-on-commit-page

To be honest, I have some doubts whether this provides sufficient control.

Let's consider what factors influence the magnitude of the variance:

Cluster benchmarks vs micro benchmarks

Originally, I was thinking about running benchmarks using "real" clusters. This is the "natural" way to benchmark OpenSearch; there is a dedicated tool to perform such tests: opensearch-benchmark.

However, we see in the discussion that there might be some requirements which are possibly not compatible with this approach. One of these requirements is the ability to provide trustworthy benchmark results in the context of a pull request.

Another approach of doing benchmarks is doing micro-benchmarks, which focus on a particular code section. The relationship of these two approaches is similar to the relationship between integration tests and unit tests.

Let's compare the approaches a bit more!

Cluster level benchmarks

Micro benchmarks

Discussion

nibix commented 6 months ago

@peternied

What do you think of address specific performance time issues, such as having a snapshot that can be restored with the test data set rather than individual requests for items, maybe the benchmarks' folks can help with recommendations?

At the moment, I am not too concerned about the initialization of test data. Most test cases will only depend on the number of indices, but not the number of documents. Even if we need tens of thousands of indices, it should be possible to create these in a hunch. An exception might be DLS/FLS, but even there it is debatable whether a significant number of documents is required.

Still, having an opinion of the benchmarks' folks on the whole topic would be of course extremely helpful :-)

nibix commented 6 months ago

BTW, to increase visibility, it might make sense to tag this with the performance label.

peternied commented 6 months ago

Cluster benchmarks vs micro benchmarks ...

I like this thought process quite a bit - it satisfies my desire for test reliability.

I suppose where I'm curious is what micro encompasses, for example I think the 'base' test case is for privilege evaluator's 'allowed' response is something that we should capture. Is that too big already, what kind of examples did you have in mind?

nibix commented 5 months ago

@peternied

I suppose where I'm curious is what micro encompasses, for example I think the 'base' test case is for privilege evaluator's 'allowed' response is something that we should capture.

I would think the same. Roughly the call of PrivilegesEvaluator.evaluate(). However, this only captures one aspect of #3870. I would think the proposed improvements for DLS/FLS cannot be completely covered by such a micro benchmark, as this would be more a systemic improvement which is not locatable in a single method.

rishabh6788 commented 5 months ago

Great proposal @nibix. The biggest pain point of adding any new workload is sourcing the data, prepping it and then integrating it with opensearch-benchmark, but if we are not too keen on indexing performance we can quickly get started with using a snapshot from an existing cluster that caters to our requirement. We can in-parallel work to generate the corpus to test security performance on indexing path.

scrawfor99 commented 5 months ago

[Triage] Thanks for filing this issue @nibix. Since this is an RFC we can mark this triaged for more engagement and steps for moving forward.

peternied commented 5 months ago

Just to clarify what I'd expect of a performance test. I don't think we can reliability calibrate tests on microsecond/millisecond scales. There are far too many low level system compatibilities that might be hard to reproduce without dedicated hardware which I'm not in favor of.

Test that I do think would be viable without environment constraints

There might be ways to do more/better but I think we can establish a baseline performance and detect regressions as we build features over time. What do you think @nibix?

nibix commented 5 months ago

Generally, that makes sense. However, I am still a bit skeptical that the jitter induced by the environment will allow detecting regressions in a realistic scale. If the tolerance X% is too high (like 200% which is used by default by github-action-benchmark), the benchmarks will have no practical benefit and might actually give a false sense of security. Minor regressions will just go unnoticed and accumulate with the time.

In the end, doing the benchmarks as part of pull requests has the downside that you can only consider a single A/B test. This won't give you a sense of the present jitter and also won't detect accumulated minor regressions. On the other hand, a graph similar to the ones at https://opensearch.org/benchmarks/ give you these insights - while lacking automatic actions upon regressions.

But, in the end, I guess we just have to try things out. So, I would propose now the following. Let's create a prototype, which:

We would need to collect some data points with this prototype to see whether the results are reliable enough in order to be used for automatic regression testing. If we get confident that this is suitable, we can implement these checks in the CI. If not, we would have to re-consider what to do with the data.

peternied commented 5 months ago

If we get confident that this is suitable, we can implement these checks in the CI.

👍

If the tolerance X% is too high (like 200% which is used by default by github-action-benchmark), the benchmarks will have no practical benefit

Yea- this is definitely a concern in my mind as well. Being that we have no capacity to detect a 200% regression, I'd say even at a minimal level this is a great improvement. Beyond detecting large regressions when prompted investigation [1] [2] arrive, having the framework in place would help provide a starting point for measurement and validation on a developers environment.

cwperks commented 2 months ago

@nibix Thank you for the detailed RFC! I like the idea of adding performance tests in this repo that can help to establish a baseline for the "cost" of common (security) plugin operations under different scenarios. For instance, this user reports that there are CPU spikes when there is a config update in a cluster with 300 users: https://github.com/opensearch-project/security/issues/4008.

A performance test suite can help to ensure that performance issues are properly mitigated and help to understand how performance degrades or improves as changes are merged in. This will especially be helpful when assessing the performance gains from the related RFC: https://github.com/opensearch-project/security/issues/3903

I think other plugins could benefit from this test suite as well.

nibix commented 2 months ago

@cwperks #4008 is an interesting issue, indeed.

Generally, I think performance needs to be tackled on several dimensions or levels.

In this, RFC we have already concluded that we need two separate approaches:

Additionally, the following thing just crossed my mind looking at #4008. Just thinking out loudly:

For facilitating the diagnosis of things like #4008, it might be useful to have built-in instrumentation to gain performance metrics on production systems. Especially the security plugin, often needs to talk to external systems - like LDAP servers. This can quickly have big impacts on cluster performance. Thus, having a low-threshold diagnosis system would be beneficial. A "related" project :wink: has this: https://git.floragunn.com/search-guard/search-guard-suite-enterprise/-/merge_requests/200

DarshitChanpura commented 2 months ago

@nibix Agreed on making this a two-part approach. What would be the plan forward based on the all the discussions up until the latest?

nibix commented 2 months ago

@DarshitChanpura

Agreed on making this a two-part approach. What would be the plan forward based on the all the discussions up until the latest?

There's already a draft PR for the test suite, but a bit of work is still necessary. The micro benchmarks for privilege evaluation should come with the PR #3870 which should also come up very soon.