open-policy-agent / opa

Open Policy Agent (OPA) is an open source, general-purpose policy engine.
https://www.openpolicyagent.org
Apache License 2.0
9.71k stars 1.35k forks source link

Proposal: Add metrics in OPA-WASM library #2341

Open zyy17 opened 4 years ago

zyy17 commented 4 years ago

I'm working with OPA wasm for a while and develop the easy metrics(only counter) mechanism in OPA-WASM library to display the execution times of the low-level functions.

For example, if we are interested in how many times of opa_strncmp() runs, we can add counter in the entry:

int opa_strcmp(const char *a, const char *b)
{
#ifdef ENABLE_METRICS
    OPA_METRICS_COUNTER_INC(string_compare);
#endif
...
}

We provide the metrics counter to collect all the metrics and dump_metrics() function to print all the metrics(split by \n), looks like(consider the performance, the metrics code need to be wrapped by condition compile macro):

...
typedef struct opa_metrics_counter opa_metrics_counter;

struct opa_metrics_counter
{
    unsigned int string_compare;
    ...
}

const char *dump_metrics(void)
{
   OPA_PRINT_METRICS_TO_BUF(metrics_buf, METRICS_BUF_SIZE, metrics, string_compare);
}
...

When user call dump_metrics(),it can print all the metrics:

===========================
...
opa_value_compare_array: 11604
opa_value_compare_object: 33203
string_compare: 216654
...
===========================

Because of the lack of wasm profile tools, we have to develop the metrics to find out the exact times of key functions runs(such as __builtin_wasm_grow_memory()) and it works very well.

If it is the accepted way to add metrics, I can open the PR for code review :)

patrick-east commented 4 years ago

Thanks for submitting this proposal @zyy17 !

Can you share more about the use-case for these types of metrics? Is this primarily for the development of the WASM internals for OPA, or to modify the policy to reduce evaluation time?

Stepping back a little bit from the WASM specific part of this, there are two main sets of metrics that users/devs have access to for OPA using the golang API's, CLI, etc:

The proposed counters you have is somewhere in the middle of the two.

I ask about the use-case because there are a couple of sort of "end users" I can think of that might consume these. There are the policy authors writing Rego, compiling into WASM, and wanting to do performance analysis of their policy to make improvements, track regressions, etc. Then there are the OPA developers working on the wasm compiler/planner code that are looking to improve the generated WASM program.

What we probably want to do is be mindful of which group would be using these types of metrics. For the policy authors I would lean towards trying to aim for similar levels for the WASM tooling as what is offered in the OPA tooling. That way someone can write policies using the normal golang OPA runtime, be familiar with the tooling there, and it would translate better to the WASM environment.

For the OPA developers, like if we wanted to have some ability to generate like debug WASM binaries for dev & test these kind of counters I think are potentially more interesting/useful.

I guess maybe another follow up question relating to the use-case. If you had metrics similar to what opa eval --profile gives, would that be as useful? Or for the WASM use-case you have do you really need to know the exact counts and timing for something like the string comparisons?

zyy17 commented 4 years ago

@patrick-east Thank you for your response!

When I add metrics in OPA-WASM library, I have found two questions that related to the performance:

Because I'm not so familiar with the core logic of planner and internal/wasm, I need to have some observability to know what's happening when executing wasm binary. The metrics data shows the unusual phenomenon that I can track further in the specific code.

When we run rego code in Golang env, the profiling will be very simple and helpful, but I think there is still some difference with wasm env.

I think my proposal is just another auxiliary way to observe wasm for OPA developers and can work with --profile. Maybe most of time, the users don't need to use it unless they want to debug some performance issues.

patrick-east commented 4 years ago

That makes perfect sense, and I think makes a good case for this type of metric being useful for developers working on OPA. Like you said, without being familiar with the planner logic having that extra observability is important.

I think my proposal is just another auxiliary way to observe wasm for OPA developers and can work with --profile. Maybe most of time, the users don't need to use it unless they want to debug some performance issues.

+1, I think that we would probably want these types of metrics to be exposed only for OPA developers, or very advanced users. Similar to the pprof category of metrics for OPA.

As far as the implementation is concerned If using macro's it requires re-building OPA from source. This is maybe OK for the really low level stuff (similar to using pprof on opa eval), but we would need a different approach for the equivalent of --profile, or even like opa run -s --pprof option where the end-user can get metrics without having to build OPA from source.

Ideally for the equivalent of --profile we would have a flag for opa build to signal that the compiler should generate wasm code that includes profiling/metrics collection.

consider the performance, the metrics code need to be wrapped by condition compile macro

Out of curiosity, how much does it affect performance? Would it be possible to implement the important functions (memory alloc/free, etc) to call the metric API hooks all the time, and move the implementation of the hooks to be generated by the compiler? That way when someone supplies the flag to do low level instrumentation to opa build we could emit code to track them, but default to a noop implementation. There would be extra overhead to call the function but it might be pretty minimal compared to the whole policy evaluation.

I was also thinking about whether we would maybe want to just have these low level metric APIs (counter increment, timer start/stop, etc) call into external functions that could be implemented by the runtime environment. Eg, for the javascript wasm runtime they could be hooked into the console.time() and console.timeEnd() so the user would be able to integrate into the native debugging/profiling tools in their environment.

Last up, WRT to dump_metrics(), I think we would probably want to have a more structured API. I would imagine this sending back a JSON object or something, similar to the evaluation results. Then the runtime environment can handle them in the "best" way for the integration.

stale[bot] commented 2 years ago

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days.