Closed cosmincatalin closed 9 months ago
@cosmincatalin Thanks for your interest in the project.
I can provide background about Regorus.
Regorus started off as a project to bring Rego into Confidential Containers. The following are the key goals/attributes:
Written in Rust due to its safety, performance and reduced attack surface. Parts of Kata Containers themselves were rewritten in Rust for the same reasons.
Fast. Kata Containers can bring up entire VMs in milliseconds. The policy validation therefore shouldn't take more than a few milliseconds even for large policies. See Performance.
Define semantics cleanly and enforce them. Regorus for example defaults to strict errors unlike OPA.
Smaller codebase. We want to keep the codebase as small as possible for easy vetting. The core interpreter is only 2.8 K lines of code currently.
Modularized. Parts of the Rego language can be selectively enabled/disabled via cargo features.
We also want to use it as a playground for language exploration, and also expose it to other environments.
Hi @anakrish , thank you for the description, but I'm still in doubt regarding the differences from regorust
. It sounds to me like you could have just used regorust
for the same purpose and vice versa. However, I understand that currently, one key advantages of regorus
over regorust
is performance due to it being native Rust.
@cosmincatalin
It sounds to me like you could have just used regorust for the same purpose and vice versa.
Excellent question!
There is a strong push towards writing software in memory-safe languages like Rust for security, reliability and performance. Here are few public links:
Microsoft Secure Future Initiative "We will accelerate and automate threat modeling, deploy CodeQL for code analysis to 100 percent of commercial products, and continue to expand Microsoft’s use of memory safe languages (such as C#, Python, Java, and Rust), building security in at the language level and eliminating whole classes of traditional software vulnerability."
NSA Software Memory Safety "Microsoft® revealed at a conference in 2019 that from 2006 to 2018 70 percent of their vulnerabilities were due to memory safety issues. [1] Google® also found a similar percentage of memory safety vulnerabilities over several years in Chrome®...Commonly used languages, such as C and C++, provide a lot of freedom and flexibility in memory management while relying heavily on the programmer to perform the needed checks on memory references. Simple mistakes can lead to exploitable memory-based vulnerabilities. Therefore, the overarching software community across the private sector, academia, and the U.S. Government have begun initiatives to drive the culture of software development towards utilizing memory safe languages."
Azure CTO: https://twitter.com/markrussinovich/status/1571995117233504257
Writing software in C/C++ and exposing it to Rust, unfortunately does not reap the benefits of using Rust directly. There will still be memory access violations, buffer overruns, double-frees etc. in the C/C++ code, despite programmers' best intentions. Such issues are to a large extent avoided by use of Rust.
Thus, it is better to write code in Rust and expose to C and not the other way around.
one key advantages of regorus over regorust is performance due to it being native Rust.
Based on initial measurements Regorus is
Benchmark script available here: https://github.com/anakrish/rego-compare.
In my experience, often times such large differences in performance are not due to it being in native Rust
. It has more to do with software design, algorithms used, trade-offs evaluated etc.
Thank you for this detailed response, @anakrish . Things are quite clear now.
For future readers of this issue:
Javascript/WASM
, and Python
are supported. C
, C++
, C#
, Go
bindings being worked on.I have just switched to using regorus
, your latest changes, supporting Arc
tipped the scales in your favor. Please continue doing what you're doing.
@cosmincatalin
I'm trying to make improvements to the design.
Can you share why you need the arc
feature? Do you share Value
between threads or do you share Engine
between threads?
I share an Engine
instance with one policy that needs to be used over and over again.
Thanks for sharing your usage.
The starting state of the Engine which consists of the parsed policies and the data value are immutable and Arc counted. I'm thinking of
The above scheme may be used instead of a global Arc<Mutex<Engine>>
.
It may not be that useful if the Engine instance part of an object that is already shared between threads using Arc+Mutex.
I think that would be a nice addition, certainly. Although I kind of got used to use Mutex
I'd much rather avoid it if I can, without loosing performance, of course.
@anakrish correct me if I'm wrong here -- but the referenced benchmark of calling a binary with json input seems to be nothing like a production use case.
What we want to do is create an engine and load policy/data to precalculate AST, then cache and re-use this object for an array of inputs. Was investigating the usage of golang bindings yesterday and performance-wise, this approach was getting crushed by opa/rego which can make use of precomputed AST and has some hooks for further optimizations.
For the regorus approach, I cached an Engine w/ policies loaded and cloned it for each evaluation. Certainly some of the performance delta can be attributed to cgo overhead; however, I'm also wondering if there is performance data available for native RUST uses cases more similar to what I just described?
Thanks!
@adams-shaun
but the referenced benchmark of calling a binary with json input seems to be nothing like a production use case.
This comparison intended to give a peek into the performance of the different components (parser, interpreter) of the 3 engines (regorus, opa. rego-cpp) when evaluating the same policy. Hence no caching is done.
For most production use cases, yes, the policy will be preloaded, and evaluation will be done reusing this parsed policy. We've observed Regorus to be much faster in this case as well.
Was investigating the usage of golang bindings yesterday and performance-wise, this approach was getting crushed by opa/rego which can make use of precomputed AST and has some hooks for further optimizations.
Would it be possible to share your benchmark? In your benchmark is Regorus parsing the policy for each evaluation whereas OPA isn't?
For the regorus approach, I cached an Engine w/ policies loaded and cloned it for each evaluation.
Typically, one doesn't need to clone the engine for each evaluation. The same engine can be reused. I guess, you need to clone because you are using it form Go?
Certainly some of the performance delta can be attributed to cgo overhead
Likely. But I'm curious about the query evaluation numbers in your benchmark. I would like to fix any performance issues in the Go bindings nonetheless.
I'm also wondering if there is performance data available for native RUST uses cases more similar to what I just described?
The last performance experiment I worked on was to run the benchmarks in https://arxiv.org/pdf/2403.04651 using Regorus in place of OPA. The paper finds that Cedar is 80x faster than OPA. We observed that with unmodified Rego policies, Cedar is only 5x faster than Regorus and with tweaked Rego policy, Regorus is faster than Cedar. Note that in this benchmark, only the policy evaluation is measured, similar to the real-world use case you describe. My experiments are here https://github.com/anakrish/cedar-examples. Let me know if you want instructions to try it out.
Hi @anakrish -- Thanks for the response.
I'm still trying to grok the results of our testing, and I'm not sure how much of it I can break apart for public consumption, but I'll try to give you a gist of the test setup:
My knowledge with CGO and Rust performance profiling is limited. Admittedly, I'm probably not doing a fair apples-to-apples comparison here. I was trying to determine whether Regorus could serve as a drop-in replacement in the interim while we transition towards a native Rust policy evaluation service.
Questions that still come to mind:
Hope that all makes sense.
@adams-shaun
I have pushed a compare-eval
script to https://github.com/anakrish/rego-compare that benchmarks the policy evaluation time using OPA and Regorus. For the chosen policies, you should see output similar to below indicating a speed up of 2x to 3x.
./compare-eval
Testing example policy
evaluating data.example.allow using OPA average eval time = 136.22 microseconds
evaluating data.example.allow using regorus eval-rule average eval time = 67.03837 microseconds
evaluating data.example using OPA average eval time = 160.16 microseconds
evaluating data.example using regorus eval-query average eval time = 77.98542 microseconds
evaluating data using OPA average eval time = 163.71 microseconds
evaluating data using regorus eval-query average eval time = 75.61737 microseconds
Testing ACI policy
evaluating data.framework.mount_overlay using OPA average eval time = 194.37 microseconds
evaluating data.framework.mount_overlay using regorus eval-rule average eval time = 70.82748 microseconds
evaluating data.framework using OPA average eval time = 1545.58 microseconds
evaluating data.framework using regorus eval-query average eval time = 548.64717 microseconds
evaluating data using OPA average eval time = 2054.17 microseconds
evaluating data using regorus eval-query average eval time = 650.39836 microseconds
The script uses two binaries opa-profile
and regorus-profile
that takes policies, data and input and evaluates a query or rule for a given number of iterations and prints average eval time.
You could try out your policy, data and input using opa-profile
and regorus-profile
to determine how the performance would look like in regorus and OPA without cgo overhead.
Note that regorus provides both eval-rule
and eval-query
. The former will always be faster than the latter since eval_query
has to parse the query each time where as eval-rule
doesn't.
@adams-shaun
Thanks for sharing aspects of your use case.
Input data is transformed into rego text
By this, do you mean that rego code is generated for input instead of regorus::Value? Using Engine::add_data
and Engine::set_input
will be faster than emitting input/data values as rego code within a policy. In the former case, values are directly available whereas in the latter case the rego expressions need to be evaluated to on each query to compute values.
I was using "clone" because re-use of an engine across goroutines would require some locking mechanism
Makes sense. Is the cost of cloning also included in the measurement? Currently Engine::clone is not as optimized as it can be; deep copies are performed instead of just incrementing a ref count.
However; All things being equal, we still seem to spend more % of CPU cycles in the CGO evaluate policy path.
You could try eval_rule
instead of eval_query
. The former will be faster due to not having to parse the query as well as returning a Value
directly instead of an OPA style Result
.
Maybe the performance overhead could also be due to having to marshal the result back to json?
while we transition towards a native Rust policy evaluation service.
Would the policies be written in Rust or just the evaluator would be in Rust? I'm curious what is the motivation for the move to native Rust based policy evaluation service.
On a related note, a major feature in the plan for Regorus is the introduction of a policy intermediate language (IL) that different policy languages including Rego will be compiled into. In addition to being optimally evaluated, the IL would support code generation to Rust, C# etc.
Is CGO overhead a limiting factor here
I'm not yet sure. It could also be that OPA implements some optimizations that regorus doesn't. Trying out the policy using opa-profile
, regorus-profile
would help determine whether it is the CGO overhead or policy evaluation overhead.
Could exported FFI interface be optimized for further improvement (take input and evaluate pre-parsed query, returning result in a single call). i.e. limiting stack swaps and getting a more fair comparison.
Yes, we can do this if this is what is causing the performance difference.
@adams-shaun
You might also want to try out evaluating a rule immediately after all the policies and data have been loaded into the engine instance. This rule evaluation will cause the engine instance to be "prepared for evaluation". Then if this engine is cloned, then the clones will inherit the "preparation" without having to redo the preparation again. This will speed up the evaluation using the clones, but I'm not sure whether this might make a difference in your case or not.
Hey @anakrish,
Circling back to look at your data... thanks again.
I think for the rego-compare
delta, most of the gap (for my use case) is closed by doing:
rawPtr := util.Reference(input)
parsedInput, _ := ast.InterfaceToValue(*rawPtr)
start := time.Now()
for i := 0; i < numIterations; i++ {
rs, err = query.Eval(ctx, rego.EvalParsedInput(parsedInput))
if err != nil {
return err
}
}
I think this is analogous to parsing input to Value object in regorus script. Can you provide any more details on the "tweaked" policies you referenced for the white paper demo repo? I'm just digging into the code now, but wondering if there are any guidelines for optimizing performance.
Our policies are quite small and we iterate through multiple of them to arrive at a deny/allow. The input object is shared across multiple policies and is pre-filtered to only include relevant fields. The folks who worked on this before me have obviously put in a great deal of effort to squeeze out as much performance as possible from Opa.
I've made a few changes to my local regorus repo :
eval_rule
that avoids passing json responses back through cgo -- now just a struct with a couple string fields.Taking all of the above, minus some CGO overhead, we are now beating the optimized opa/rego policy engine in local benchmarks (and much fewer heap allocs)! As Opa has not been performant enough for us, there has been an effort to hand-write native Golang rule evaluation code. I was hoping to show regorus as an alternative to this maintenance-heavy approach, but I'm not sure I can close the gap. My policies in regorus are evaluating in the 5-20us range, but these hand written functions are <1us. I should note, even 5us is quite small as a percentage of CPU time spent processing the gRPC request -- Probably with a high performance Rust gRPC server, that gap dissapears.
I'm fairly new (as of yesterday) to Rust, so I'm unsure how much faster I can push this. I'm hoping to continue this effort, learning the language and someday push some contributions back upstream.
I can share some examples with ya if you are curious, are you in Slack somewhere?
Thanks again!
@adams-shaun Thanks for all the investigation. Great work! Let me address the various points in your post via multiple comments.
I think for the rego-compare delta, most of the gap (for my use case) is closed by doing:
Thanks for pointing out rego.EvalParsedInnput
. It seems to improve OPA performance for the example
policy, not so much for the ACI
policy.
An interesting thing I noticed was that, even with all the parsing and preparation, if a new PreparedEvalQuery is created in each instance then there is a noticeable performance slowdown if specific rules are evaluated.
Testing example policy
evaluating data.example.allow using OPA average eval time = 88.96 microseconds
evaluating data.example.allow using OPA (fresh instance each time) average eval time = 101.58 microseconds
evaluating data.example.allow using regorus eval-rule average eval time = 69.33359 microseconds
evaluating data.example using OPA average eval time = 108.55 microseconds
evaluating data.example using OPA (fresh instance each time) average eval time = 116.01 microseconds
evaluating data.example using regorus eval-query average eval time = 77.91995 microseconds
evaluating data using OPA average eval time = 111.65 microseconds
evaluating data using OPA (fresh instance each time) average eval time = 117.74 microseconds
evaluating data using regorus eval-query average eval time = 75.10071 microseconds
Testing ACI policy
evaluating data.framework.mount_overlay using OPA average eval time = 185.59 microseconds
evaluating data.framework.mount_overlay using OPA (fresh instance each time) average eval time = 498.73 microseconds
evaluating data.framework.mount_overlay using regorus eval-rule average eval time = 70.60009 microseconds
evaluating data.framework using OPA average eval time = 1568.31 microseconds
evaluating data.framework using OPA (fresh instance each time) average eval time = 1717.96 microseconds
evaluating data.framework using regorus eval-query average eval time = 550.49225 microseconds
evaluating data using OPA average eval time = 2082.92 microseconds
evaluating data using OPA (fresh instance each time) average eval time = 2202.94 microseconds
evaluating data using regorus eval-query average eval time = 650.8659 microseconds
This leads me to believe that some kind of caching or initialization happens the first time PreparedEvalQuery
is used.
Do you know what could explain this? We could add similar caching to Regorus to improve performance further.
@adams-shaun
Can you provide any more details on the "tweaked" policies you referenced for the white paper demo repo? I'm just digging into the code now, but wondering if there are any guidelines for optimizing performance.
We have an email thread with the authors of the OOPSLA paper (some of the MSR researchers I work with are well connected with the creators of the Cedar language who authored the paper) showing our findings that Regorus is much faster than OPA for the rego policies in the paper and when policies are tweaked, Regorus even outperforms Cedar. This is remarkable given that the policies being benchmarked are tailored for the authorization scenarios which Cedar specialized in. The understanding is that the Cedar team will redo the benchmarks and update the paper as appropriate to reflect the new findings.
The tweaks to the policies to get better performance are:
preprocess
rule and adds the results to the data document for use by actual query evaluation.Taking all of the above, minus some CGO overhead, we are now beating the optimized opa/rego policy engine in local benchmarks (and much fewer heap allocs)! ... I'm hoping to continue this effort, learning the language and someday push some contributions back upstream. I can share some examples with ya if you are curious, are you in Slack somewhere?
It is quite impressive that you are able to beat OPA in local benchmarks for policies/input that have been heavily optimized for OPA. I think your work will greatly benefit other users of Regorus from Go. Let me know if you need any help contributing upstream. I seem to have a slack account (anakrish@microsoft.com) for confidential-containers workspace but am not too active there.
As Opa has not been performant enough for us, there has been an effort to hand-write native Golang rule evaluation code. I was hoping to show regorus as an alternative to this maintenance-heavy approach, but I'm not sure I can close the gap. My policies in regorus are evaluating in the 5-20us range, but these hand written functions are <1us.
We will continue to improve the current Regorus interpreter which is a AST walking interpreter. That coupled with policy tweaks outlined earlier should result in reduced evaluation time. However, it might not still be close in performance to hand-written optimized policy evaluation code in say Rust/Go.
Our long-term plan is to introduce a policy intermediate language that many policy languages including Rego will be compiled into. The IL will be optimized and will be faster to evaluate than the current regorus ast walking approach. Also the IL will support generating policy evaluation code in Rust, C#, Go etc. This generated code will likely be much faster than interpreted evaluation and will likely be very competitive with hand-written policy evaluation code.
Not an issue, per say, but what is the difference between
regorus
andregorust
? I'm interested in this especially in the context of both projects being under themicrosoft
organisation.