Data API: Allow to overwrite/patch/delete bundle-owned documents

mdemierre commented 3 years ago

Our use case

In our OPA use case, we have the following situation:

OPA runs as a separate service
The policies and some data is fully static (loaded in OPA Docker image directory and loaded on startup)
Some of the data needed for evaluation is dynamic (controlled by an external service and filled in by users).
The dynamic data doesn't change often, but when it does it's important that it gets to OPA fast. This is because the users set some options and expects the policy decision changes to be visible shortly.

We initially implemented the Bundle API with the bundle being generated on-demand. It works well because when OPA starts it contacts the external service and downloads the full data to initialize.

Then, when there is a dynamic change of the data, we wanted to use the Data API to replace/patch/delete the relevant subdocument. However, this is not allowed by the Data API since the data is owned by the downloaded bundle.

The documentation is not very clear about this, as it's not mentioned in the Data API docs, but only hinted at in https://www.openpolicyagent.org/docs/latest/management/#bundles:

By default, the OPA REST APIs will prevent you from modifying policy and data loaded via bundles. If you need to load policy and data from multiple sources, see the section below.

The "by default" wording seems to imply there is a way to change this. Also the section it refers to doesn't mention the Data API.

With the current setup, the only way to do it we found is to stop using the Bundle API and use the Data API to push the whole data periodically (or when OPA is restarted) and push the small diffs when needed.

However, this inverts the relationship between the external service and OPA. Now:

The external service needs some way to know that OPA has restarted in order to push the whole data to it
The external service needs to know track all the OPA instances and their status in order to push to them
If the external service is scaled, it needs to have logic to not push N times.
To avoid routing policy evaluation to an empty (just restarted) OPA instance we need some custom logic

This is not ideal either, as we run in a dynamic environment (CloudFoundry) and services change.

Feature request

We would see the ability to use the Data API to push changes to Bundle-owned documents as a elegant solution to this problem:

OPA would download the whole data on startup, thus be initialized right away
OPA would receive changes as needed through Data API PUT/PATCH/DELETE
Regular bundle re-downloads (with a longish interval, say 3 hours) would allow to recover from scenarios where data couldn't be pushed (network issues, OPA restarting, auto-scaling...)

In fact if I'm not wrong it works when not doing multi-bundle: if I point OPA to a directory with policies and data, I can push changes to this data. It seems the behavior I described is specific to bundles downloaded from bundle servers.

This could be an option set in the "bundles" part of the config if it's desired to prevent such updates by default.

Questions

Is there a fundamental reason that such document updates are not allowed?
If we switch to the Data API for everything, what would be the recommended way to handle the "OPA cold startup" issue?

Related issues

1055: would allow the same kind of reactive updates but has not seen any development yet and is quite a major change.

tsandall commented 3 years ago

Adding the ability to patch bundle owned documents is an interesting idea. It's attractive because (minimally) we could just relax the path check on the data API and suddenly it would work! On the other hand, I'm a bit concerned about the side effects (no pun intended). E.g., the revision ID on decision logs would be meaningless (or worse, harmful).

I have to wonder, if you're prepared to write code that sends PUT/PATCH/DELETE calls to OPA, have you considered extending OPA with a custom plugin? With a custom plugin you could extend OPA to read updates from a data source like Kafka and then apply them to the in-memory store. You could still use bundles for distribution of policy and static data but dynamic data could be sourced from elsewhere. I've heard of a few folks that have tried this out with success.

Alternatively, #1055 would provide an OPA-native solution to this. Let's say we implemented the changes in OPA to support deltas and push updates; do you still feel like implementation of the server-side would be more work than what you've proposed?

mdemierre commented 3 years ago

Hi @tsandall thanks for the quick reply.

I understand the Revision ID issue: there would be no way to know what state OPA really was in since PUTs might have been done in the meantime.

In our case the revision is actually not included in the manifest. Maybe this could be a requirement (that the bundle has no revision)? But it's not elegant I agree. Also, if we use PUT we already don't have the revision feature. One way could also be to add an optional revision number in the PUT.

Custom plugin

We didn't consider going the custom plugin route yet. The implementation you suggested is very elegant. In fact we used this exact pattern of reading changes for another project (not OPA related, but also for policy enforcement based on dynamic data). The Kafka topic was compacted and contained the latest version of each key. It worked quite well.

The client (OPA plugin in this case) would:

On startup, read the whole Kafka topic and commit the final state to memory (this is in order to not apply old changes)
Mark itself as ready/healthy
Continue to receive new elements from Kafka and apply them as they come
Re-sync the whole when disconnected and reconnected

In terms of implementation and maintenance effort we are quite constrained, and I would assume switching to PUSH only would take less time (with the drawbacks I mentioned). The plugin implementation has the following challenges:

Our team doesn't really have Go experience -> maintenance risk
The plugin API is experimental -> maintenance effort + could be removed at some point before OPA 1.0
The source of truth is a PostgreSQL DB --> something needed to sync to Kafka reliably, not "fire and forget when inserting, let bundle download reconciliate" we would have with push

On the other hand, the custom plugin would be something I'd personally love to implement, and we have a Kafka cluster at our disposal. We'll think about it.

Feature #1055

It would solve the same problem. I think the implementation on the bundle server side is a bit more complex with the constant connection and creation bundle deltas, but the need would be met. Probably the difficulty is more in making this constant connection work with web frameworks than anything else. Some don't support this kind of pattern really well.

It's actually very similar to the Kafka-based solution, except with a different transport mechanism (more coupled). It's even more similar to how many Kubernetes components work (with Watch API).

mdemierre commented 3 years ago

@tsandall Is the final decision on this to use custom plugin or wait for #1055?

tsandall commented 3 years ago

@mdemierre sorry for the delayed reply...

Yes, the recommendation for the time being would be to implement a custom plugin. We can keep this open for now in case other folks have a need to patch/modify bundle owned data.

trysetnull commented 3 years ago

I stumbled across this issue the other day - we were using the Data API to change the contents of the data.json file in our root directory. In v0.29.4 everything worked but the bug (feature) was closed in v0.30.0.

The solution was to use the --watch flag and list the data.json file explicitly; then we were able to edit the data.json file in the root directory and OPA automatically reloaded the changes.

Documenting here in case others have a similar work flow.

stale[bot] commented 2 years ago

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days.

open-policy-agent / opa

Data API: Allow to overwrite/patch/delete bundle-owned documents #3138

1055: would allow the same kind of reactive updates but has not seen any development yet and is quite a major change.