open-policy-agent / opa

Open Policy Agent (OPA) is an open source, general-purpose policy engine.
https://www.openpolicyagent.org
Apache License 2.0
9.62k stars 1.33k forks source link

"undefined document, document missing: data.system.main" on OPA pod restart #5914

Closed porwalameet closed 1 year ago

porwalameet commented 1 year ago

kube-mgmt-logs-failed.txt

Short description

OPA pod complains about "undefined document" for a policy when the node where OPA pod is running is rebooted.

When API hitting the problematic pod:

* TLSv1.2 (IN), TLS header, Supplemental data (23):
{
  "code": "undefined_document",
  "message": "document missing: data.system.main"
}

When API hitting the working pod:

* TLSv1.2 (IN), TLS header, Supplemental data (23):
{"apiVersion":"admission.k8s.io/v1beta1","kind":"AdmissionReview","response":{"allowed":true}}
* Connection #0 to host 192.168.116.208 left intact

Environment: OPA version: 0.50.2 kube-mgmt version: 8.0.2 Kubernetes: 1.24.8 OPA replicas: 3 kube-mgmt is running as side-car to OPA container

Steps to Reproduce:

  1. Run OPA deployment with 3 replicas spread across different nodes
  2. Power Cycle the entire cluster. This will result in even master node reboot.
  3. Once the Cluster comes online, out of 3 replicas, one OPA pod (bad-pod)is not loading policies from configmap and throws the above error of "undefined document".
  4. The other 2 pods, works as expected.
  5. from Kube-mgmt logs of bad-pod, initially complaints about localhost:8181 as connection refused and later updates the policies.
time="2023-05-11T05:38:22Z" level=info msg="Policy/data ConfigMap processor connected to K8s: namespaces=[*]"
time="2023-05-11T05:38:22Z" level=info msg="Added policy fed-opa/policy-core/core.rego, err=Put \"http://localhost:8181/v1/policies/fed-opa/policy-core/core.rego\": dial tcp [::1]:8181: connect: connection refused"
time="2023-05-11T05:38:22Z" level=info msg="Added policy fed-opa/policy-captureagent/captureagent.rego, err=Put \"http://localhost:8181/v1/policies/fed-opa/policy-captureagent/captureagent.rego\": dial tcp [::1]:8181: connect: connection
 refused"
time="2023-05-11T05:38:22Z" level=info msg="Added policy fed-opa/policy-security-context/security-context.rego, err=Put \"http://localhost:8181/v1/policies/fed-opa/policy-security-context/security-context.rego\": dial tcp [::1]:8181: con
nect: connection refused"
time="2023-05-11T05:38:22Z" level=info msg="Added policy fed-opa/policy-istioproxyannotations/istioproxyannotations.rego, err=Put \"http://localhost:8181/v1/policies/fed-opa/policy-istioproxyannotations/istioproxyannotations.rego\": dial
 tcp [::1]:8181: connect: connection refused"
time="2023-05-11T05:38:22Z" level=info msg="Added policy fed-opa/policy-core/core.rego, err=Put \"http://localhost:8181/v1/policies/fed-opa/policy-core/core.rego\": dial tcp [::1]:8181: connect: connection refused"
time="2023-05-11T05:38:22Z" level=info msg="Added policy fed-opa/policy-captureagent/captureagent.rego, err=Put \"http://localhost:8181/v1/policies/fed-opa/policy-captureagent/captureagent.rego\": dial tcp [::1]:8181: connect: connection
 refused"
time="2023-05-11T05:38:22Z" level=info msg="Added policy fed-opa/policy-security-context/security-context.rego, err=Put \"http://localhost:8181/v1/policies/fed-opa/policy-security-context/security-context.rego\": dial tcp [::1]:8181: con
nect: connection refused"
time="2023-05-11T05:38:22Z" level=info msg="Added policy fed-opa/policy-istioproxyannotations/istioproxyannotations.rego, err=Put \"http://localhost:8181/v1/policies/fed-opa/policy-istioproxyannotations/istioproxyannotations.rego\": dial
 tcp [::1]:8181: connect: connection refused"
time="2023-05-11T05:38:22Z" level=info msg="Added policy fed-opa/policy-core/core.rego, err=Put \"http://localhost:8181/v1/policies/fed-opa/policy-core/core.rego\": dial tcp [::1]:8181: connect: connection refused"
time="2023-05-11T05:38:22Z" level=info msg="Added policy fed-opa/policy-captureagent/captureagent.rego, err=<nil>"
time="2023-05-11T05:38:22Z" level=info msg="Added policy fed-opa/policy-security-context/security-context.rego, err=<nil>"
time="2023-05-11T05:38:22Z" level=info msg="Added policy fed-opa/policy-istioproxyannotations/istioproxyannotations.rego, err=<nil>"

the other 2 pods, there was no such connection refused error seen.

Even after policy being successfully loaded, the command curl -k -X post https://<pod-ip> returns undefined document, where as other 2 pods gives 200 OK.

Expected Output: the policy should be loaded and admission request should be evaluated accordingly.[](url))

charlieegan3 commented 1 year ago

Hey @porwalameet, would you also be able to share the logs of the OPA container? Is it crashing on start up or showing any errors which might explain why it hasn't started?

porwalameet commented 1 year ago

The OPA pod is up and running and kube-mgmt reports policy loaded for after initial attempt which was rejected because of "connection refused" on host: localhost:8181

I am attaching the poa logs, i have removed admission request for sake of brevity. opa-logs-github.txt the decision logs keep repeating.

ashutosh-narkar commented 1 year ago

This does not look like an issue with OPA itself. @porwalameet you should look into Bundles for loading policies into OPA. It's a more robust and production-tested approach. You can refer to this tutorial for an example. Also with the config map approach, you need to make sure the openpolicyagent.org/policy-statusannotation is set to {"status": "ok"} before making any requests. Anyways please use the bundle approach for loading policies into OPA. You can even set a readinessProbe to ensure that policies are loaded before OPA handles any requests. Also the bundle approach is much easier to monitor using the Status API.

porwalameet commented 1 year ago

Thanks @ashutosh-narkar . Why I suspect OPA issue here is - in the kubernetes environment, the 3 replicas of OPA deployment only one is in error state. The config map which holds the rego policy reports status as{"status": "ok"} since the other 2 pod where issue is not seen have loaded.

infact on problematic mode, kube-mgmt reports policies are put successfuly if you see the kube-mgmt logs. The problem is when OPA pod gets admission request.

We dont want to use bundle approach at we want all resources confined in kubernetes only and no interaction with external server.

ashutosh-narkar commented 1 year ago

@porwalameet as seen in your logs, some policies are not loaded correctly. The PUT calls that kube-mgmt makes to OPA seem to be failing in some instances. I don't think kube-mgmt has any retry mechanism. So it's possible that not all policies are loaded correctly which results in the error you see. Hence I mentioned that the issue seems to be more related to the deployment/environment than OPA itself.

sba30 commented 1 year ago

Encountered this exact issue, the node where the OPA pod lives gets bounced , then the pod is restarted and in a running state but lots of error":{"code":"undefined_document","message":"document missing: data.system.main"}, errors.

Having to delete the pod in order to resolve the issue.