Breaking issue when running with more than 1 replica

mvaalexp commented 1 year ago

We have been having a reoccurring issue where not all the policies are getting loaded into opa. This is most easily reproducible when running with 2 replicas, they must launch at the exact same time.

We discovered there is a race condition when running that causes the Informer https://github.com/open-policy-agent/kube-mgmt/blob/2ae02bd1bdbd76ea85758373e98d0befa2b4f089/pkg/configmap/configmap.go#L148 to work incorrectly when dealing with the same policy configmap at the same time.

When the informers of the 2 pods compete, if the policies are loaded out of order and they fail in both pods, it adds an annotation with a retry, this retriggers the informer. The informer then tries again. When it resolves properly in the first pod, the retry count is set to 0, but in the second container retry, it then no longer tries to load the policy since the retry count is marked as 0. https://github.com/open-policy-agent/kube-mgmt/blob/2ae02bd1bdbd76ea85758373e98d0befa2b4f089/pkg/configmap/configmap.go#L179

This is what it looks like in the logs. Pod 1:

time="2023-03-07T20:25:52Z" level=info msg="Added policy opa/rule/rule.rego, err=code invalid_parameter: error(s) occurred while compiling module(s)"
time="2023-03-07T20:26:06Z" level=info msg="Added policy opa/rule/rule.rego, err=<nil>"
time="2023-03-07T20:26:08Z" level=info msg="Added policy opa/rule/rule.rego, err=<nil>"

Pod 2:

time="2023-03-07T20:25:54Z" level=info msg="Added policy opa/rule/rule.rego, err=code invalid_parameter: error(s) occurred while compiling module(s)"
time="2023-03-07T20:26:07Z" level=info msg="Added policy opa/rule/rule.rego, err=code invalid_parameter: error(s) occurred while compiling module(s)"

You can see in the timestamps here pod 1 becomes healthy, but pod 2 never resolves the rule (and it never loads).

eshepelyuk commented 1 year ago

Hello

As far as I know there is no synchronization between kube-mgmt pods. Alao as far as I know a single kube-mgmt instance can't work simultaneoulsy with multiple OPA instances.

So, currently, it is impossible to scale kube-mgmt Helm chart.

Suggestions and especially contributions are welcomed.

anderseknert commented 1 year ago

While they aren't synchronized, the "synchronization" in this case happens via the annotation, which seems to cause a race condition. But yes, ideas / fixes most definitely welcome :)

charlieegan3 commented 1 year ago

I've been chatting to Ievgenii in slack, x-posting my ideas here.

This would be a more 'ideal scenario' fix imo and might not be what we want to do to address this issue.

My premise is that users should only run one kube-mgmt per cluster because memory use of kube-mgmt in a large cluster is likely to be very large. However, it might still be desirable to have many OPAs so that decisions can be made quickly, or OPAs can be colocated with workloads.

At the moment, kube-mgmt and opa are deployed 1:1 like this:

kube-mgmt is notified of k8s data and it replicates it into opa using the REST API. This makes adding support for a large number of OPA instances challenging as kube-mgmt needs to go around and send out the data many times. This feels like it could be quite error prone.

Instead, I think it might be best to add a new mode to kube-mgmt where it exposes a bundle API endpoint instead. Then I think that we could do this:

An example of such a server is here: https://github.com/ashutosh-narkar/example-delta-bundle-server/blob/main/server.go#L267. This example server sends a snapshot (complete) bundle when an OPA is new and then incrementally updates the OPA as data changes. I think this example could be expanded into a working solution for kube-mgmt.

Using the bundle service API in this way would allow the kube-mgmt instance to easily serve many OPAs at the same time in a consistent manner.

mvaalexp commented 1 year ago

Not related to the above, but we think this popped up somewhere in our upgrade (we did a huge upgrade from 0.12.1 -> 7.1.1 at some point).

Originally, the order of the configmap loads was important, so we named the rules something like

1_helpers.rego
2_rule.rego

At some point this changed to a listener/retry style and load comes out of order and eventually resolves.

This doesn't solve the sync issue, but if the order came in correctly, you wouldn't need to retry continuously (which causes the problem). This would be more controllable if the load order was somewhat configurable.

My suggestion here is to add a annotation "hint" on priority order of the rego. 1_helpers.rego -> openpolicyagent.org/kube-mgmt-load-priority: 1 2_helpers.rego -> openpolicyagent.org/kube-mgmt-load-priority: 2

Not sure how this would be achieved in the current implementation (I didn't look where the initial configmap load happens).

eshepelyuk commented 1 year ago

@mvaalexp we should not invent workarounds for kube-mgmt itself. The idea is that there should be only one instance of kube-mgmt running.

eshepelyuk commented 1 year ago

@charlieegan3 I am thinking on idea of inteoducing 3rd container that will be a bunle serving service and take responsibility of authoriation, smth maybe like nginx with lua or oauth2-proxy. This allows not to program all those authorization mechanism in go ( that i dont know and avoid as much as possible ). So kube-mgmt will only listen to k8s events and create OPA bendle(s). The proxy and kube-mgmt can share a volume, so bendle(s) created by kube-mgmt can be serverdfrom the file system by a proxy.

anderseknert commented 1 year ago

I really like @charlieegan3's idea of having an alternative mode for kube-mgmt, where it acts as a bundle server for any number of OPAs. You'd still want to run more than one instance, or you'd very much have a single point of failure, but I don't think that would be an issue.

mvaalexp commented 1 year ago

It seems like regardless of it even being 1 replica, the problem still exists because when the pod updates (version update, deployment roll, etc...) it inevitably has 2 pods running at the same time and the problem occurs (the current deployment, and the new deployment). We were able to recreate it with just a replica 1 in our clusters (relatively easily, sometimes no policies would load because they were already loaded in the old pod). So even making it separate does not solve the problem. Some possible solutions would be to add a locking annotation (this risks it being set and not unset causing permanent locks), make policy state in the configmaps to be pod specific, remove state from the configmaps, use a recreate deployment strategy for the deployment (which would cause downtime), or replace the retry logic with something else. Quick and dirty solution would be to attach the pod hash to the retry annotation key because the retry update only checks that annotation and not the status (or store retry count in a in-memory cache instead of an annotation). https://github.com/open-policy-agent/kube-mgmt/blob/2ae02bd1bdbd76ea85758373e98d0befa2b4f089/pkg/configmap/configmap.go#L177

open-policy-agent / kube-mgmt

Breaking issue when running with more than 1 replica #194