permitio / opal-helm-chart

You know, for Kubernetes
Apache License 2.0
21 stars 22 forks source link

Installing v0.0.7 with default configuration on EKS not working #23

Closed philipclaesson closed 1 year ago

philipclaesson commented 1 year ago

Hey @RazcoDev!

I'm trying to deploy OPAL with default configuration using this helm chart v0.0.7 on AWS EKS. Kubernetes version in v1.21.5-eks-9017834

helm install --create-namespace -n opal-ns --version 0.0.7 myopal opal/opal

This gives me three pods. The pgsql and server work fine but the client is not healthy.

NAME                          READY   STATUS             RESTARTS   AGE
xxx-client-7db887db78-rb99m   0/1     CrashLoopBackOff   31         139m
xxx-pgsql-6dcd6dbd64-pmbsw    1/1     Running            0          139m
xxx-server-5db6656dcc-2bjs4   1/1     Running            0          18m

Pulling the logs from the client pod shows me that it is crashing in the healthcheck method of client.py: https://github.com/permitio/opal/blob/master/packages/opal-client/opal_client/client.py#L212

2023-02-13T15:29:16.352136+0000 | uvicorn.protocols.http.httptools_impl   | INFO  | 10.11.14.224:53402 - "GET /healthcheck HTTP/1.1" 500
2023-02-13T15:29:16.352461+0000 | uvicorn.protocols.http.httptools_impl   |ERROR  | Exception in ASGI application

<enormous python trace redacted>

  File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 237, in app
    raw_response = await run_endpoint_function(
                         └ <function run_endpoint_function at 0x7f12773c70a0>
  File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 163, in run_endpoint_function
    return await dependant.call(**values)
                 │         │      └ {}
                 │         └ <function OpalClient._configure_api_routes.<locals>.healthcheck at 0x7f12771b7d00>
                 └ <fastapi.dependencies.models.Dependant object at 0x7f12770e1ea0>
  File "/usr/local/lib/python3.10/site-packages/opal_client-0.4.0-py3.10.egg/opal_client/client.py", line 212, in healthcheck
    healthy = resp["result"]
              └ {}

KeyError: 'result'

I that understand the client is trying to query the healthcheck policy in OPA, and for some reason that data is not there.

OPA is up and running and I can reach the web interface and also run curl /v1/data or /v1/policies. However v1/data/system/opal/healthy just times out.

Any ideas of what could be the error here?

RazcoDev commented 1 year ago

Hey @philipclaesson , I think we have a missing part in OPAL that allows the /healthcheck to fail. To fix it with a workaround, you can set the OPAL_OPA_HEALTH_CHECK_POLICY_ENABLED environment variable to true - in the Client's deployment. This will should the /healthcheck endpoint to work properly.

Lmk how it goes.

philipclaesson commented 1 year ago

Thanks @RazcoDev! Setting OPAL_OPA_HEALTH_CHECK_POLICY_ENABLED helped fixing the healthcheck endpoint.

However, the healthcheck fails. Looking at opa_client.py, it seems like this means that it is because either a data or policy transaction did not succeed or did not happen.

Looking at the data in http://localhost:8181/v1/data/system/opal, it looks like there has been no data transactions:

{
    "result": {
        "healthy": false,
        "last_data_transaction": {},
        "last_failed_data_transaction": {},
        "last_failed_policy_transaction": {},
        "last_policy_transaction": {
            "actions": [
                "set_policies"
            ],
            "creation_time": "2023-02-14T07:55:58.901524",
            "end_time": "2023-02-14T07:55:58.955286",
            "error": "",
            "id": "b21a57c305783805bc28b4fc134cb5c27cda967b",
            "remotes_status": [
                {
                    "error": null,
                    "remote_url": "http://nv-authorization-server:7002/policy",
                    "succeed": true
                }
            ],
            "success": true,
            "transaction_type": "policy"
        },
        "ready": false,
        "transaction_data_statistics": {
            "failed": 0,
            "successful": 0
        },
        "transaction_policy_statistics": {
            "failed": 0,
            "successful": 1
        }
    }
}

I have not set any dataConfigSources yet, I'm assuming this is the problem?

  dataConfigSources:
    config:
      entries: []

Setting OPAL_DATA_UPDATER_ENABLED to false made the healthcheck pass and deployment succeed!

philipclaesson commented 1 year ago

I think we have a missing part in OPAL that allows the /healthcheck to fail.

Would this missing part be the empty dataConfigSources? Or something else?

RazcoDev commented 1 year ago

Yes, the moment you'll set data config sources these statistics will get updated. About the failing healthcheck, it's actually because of missing condition there, pretty simple thing, we'll take care of it.

philipclaesson commented 1 year ago

Cool, thanks a lot for helping out!