Open mvaalexp opened 1 year ago
Hello
As far as I know there is no synchronization between kube-mgmt
pods. Alao as far as I know a single kube-mgmt
instance can't work simultaneoulsy with multiple OPA instances.
So, currently, it is impossible to scale kube-mgmt
Helm chart.
Suggestions and especially contributions are welcomed.
While they aren't synchronized, the "synchronization" in this case happens via the annotation, which seems to cause a race condition. But yes, ideas / fixes most definitely welcome :)
I've been chatting to Ievgenii in slack, x-posting my ideas here.
This would be a more 'ideal scenario' fix imo and might not be what we want to do to address this issue.
My premise is that users should only run one kube-mgmt per cluster because memory use of kube-mgmt in a large cluster is likely to be very large. However, it might still be desirable to have many OPAs so that decisions can be made quickly, or OPAs can be colocated with workloads.
At the moment, kube-mgmt and opa are deployed 1:1 like this:
kube-mgmt is notified of k8s data and it replicates it into opa using the REST API. This makes adding support for a large number of OPA instances challenging as kube-mgmt needs to go around and send out the data many times. This feels like it could be quite error prone.
Instead, I think it might be best to add a new mode to kube-mgmt where it exposes a bundle API endpoint instead. Then I think that we could do this:
An example of such a server is here: https://github.com/ashutosh-narkar/example-delta-bundle-server/blob/main/server.go#L267. This example server sends a snapshot (complete) bundle when an OPA is new and then incrementally updates the OPA as data changes. I think this example could be expanded into a working solution for kube-mgmt.
Using the bundle service API in this way would allow the kube-mgmt instance to easily serve many OPAs at the same time in a consistent manner.
Not related to the above, but we think this popped up somewhere in our upgrade (we did a huge upgrade from 0.12.1 -> 7.1.1 at some point).
Originally, the order of the configmap loads was important, so we named the rules something like
1_helpers.rego
2_rule.rego
At some point this changed to a listener/retry style and load comes out of order and eventually resolves.
This doesn't solve the sync issue, but if the order came in correctly, you wouldn't need to retry continuously (which causes the problem). This would be more controllable if the load order was somewhat configurable.
My suggestion here is to add a annotation "hint" on priority order of the rego.
1_helpers.rego -> openpolicyagent.org/kube-mgmt-load-priority: 1
2_helpers.rego -> openpolicyagent.org/kube-mgmt-load-priority: 2
Not sure how this would be achieved in the current implementation (I didn't look where the initial configmap load happens).
@mvaalexp we should not invent workarounds for kube-mgmt itself. The idea is that there should be only one instance of kube-mgmt
running.
@charlieegan3 I am thinking on idea of inteoducing 3rd container that will be a bunle serving service and take responsibility of authoriation, smth maybe like nginx with lua or oauth2-proxy. This allows not to program all those authorization mechanism in go ( that i dont know and avoid as much as possible ). So kube-mgmt
will only listen to k8s events and create OPA bendle(s). The proxy and kube-mgmt
can share a volume, so bendle(s) created by kube-mgmt
can be serverdfrom the file system by a proxy
.
I really like @charlieegan3's idea of having an alternative mode for kube-mgmt, where it acts as a bundle server for any number of OPAs. You'd still want to run more than one instance, or you'd very much have a single point of failure, but I don't think that would be an issue.
It seems like regardless of it even being 1 replica, the problem still exists because when the pod updates (version update, deployment roll, etc...) it inevitably has 2 pods running at the same time and the problem occurs (the current deployment, and the new deployment). We were able to recreate it with just a replica 1 in our clusters (relatively easily, sometimes no policies would load because they were already loaded in the old pod). So even making it separate does not solve the problem. Some possible solutions would be to add a locking annotation (this risks it being set and not unset causing permanent locks), make policy state in the configmaps to be pod specific, remove state from the configmaps, use a recreate deployment strategy for the deployment (which would cause downtime), or replace the retry logic with something else. Quick and dirty solution would be to attach the pod hash to the retry annotation key because the retry update only checks that annotation and not the status (or store retry count in a in-memory cache instead of an annotation). https://github.com/open-policy-agent/kube-mgmt/blob/2ae02bd1bdbd76ea85758373e98d0befa2b4f089/pkg/configmap/configmap.go#L177
We have been having a reoccurring issue where not all the policies are getting loaded into opa. This is most easily reproducible when running with 2 replicas, they must launch at the exact same time.
We discovered there is a race condition when running that causes the Informer https://github.com/open-policy-agent/kube-mgmt/blob/2ae02bd1bdbd76ea85758373e98d0befa2b4f089/pkg/configmap/configmap.go#L148 to work incorrectly when dealing with the same policy configmap at the same time.
When the informers of the 2 pods compete, if the policies are loaded out of order and they fail in both pods, it adds an annotation with a retry, this retriggers the informer. The informer then tries again. When it resolves properly in the first pod, the retry count is set to 0, but in the second container retry, it then no longer tries to load the policy since the retry count is marked as 0. https://github.com/open-policy-agent/kube-mgmt/blob/2ae02bd1bdbd76ea85758373e98d0befa2b4f089/pkg/configmap/configmap.go#L179
This is what it looks like in the logs. Pod 1:
Pod 2:
You can see in the timestamps here pod 1 becomes healthy, but pod 2 never resolves the rule (and it never loads).