sighupio / gatekeeper-policy-manager

A simple to use web-based OPA Gatekeeper policy manager
https://sighup.io
BSD 3-Clause "New" or "Revised" License
304 stars 33 forks source link

GPM is not working for multi-cluster #647

Closed Prasanna-543 closed 10 months ago

Prasanna-543 commented 1 year ago

Hello, we are using GPM where it's working for single cluster and when i set config to true i was facing below error in dashboard. and i even could see the context field in the web-page and can select the clusters .

Error Can't connect to cluster due to an invalid kubeconfig file Please verify your kubeconfig file and location

config looks like this:

     kind: Config
    preferences: {}
    users:
    - name: cluster_name
      user:
        exec:
          apiVersion: client.authentication.k8s.io/v1beta1
          args:
            - --region
            - region_name
            - eks
            - get-token
            - --cluster-name
            - cluster_name
          command: aws
          env: null
          interactiveMode: IfAvailable
          provideClusterInfo: false
ralgozino commented 1 year ago

Hi @Prasanna-543

It seems that you are using aws-based authentication, have you followed the instructions in the readme to use aws auth? https://github.com/sighupio/gatekeeper-policy-manager#aws-iam-authentication

Prasanna-543 commented 1 year ago

i haven't tried with aws-iam-authentication, when i check logs the error was ERROR:root:[Errno 2] No such file or directory: 'aws'

do we just need this part:

FROM curlimages/curl:7.81.0 as downloader RUN curl https://github.com/kubernetes-sigs/aws-iam-authenticator/releases/download/v0.5.5/aws-iam-authenticator_0.5.5_linux_amd64 --output /tmp/aws-iam-authenticator RUN chmod +x /tmp/aws-iam-authenticator FROM quay.io/sighup/gatekeeper-policy-manager:v1.0.3 COPY --from=downloader --chown=root:root /tmp/aws-iam-authenticator /usr/local/bin/

or should we need this need to be added to original image

Prasanna-543 commented 1 year ago

do we need to add aws-cli too?

ralgozino commented 1 year ago

Correct.

From the kubeconfig you pasted it seems that it is configured to use the aws command to authenticate, so you will need the aws cli for authenticating instead of the aws-iam-authenticator.

in other words, you will need to build your own image starting from gpm's image and including the aws cli binary. Another option would be modifying the kubeconfig to use another auth mechanism, but I don't know if that will be possible in your environment.

Prasanna-543 commented 1 year ago

how do we add aws-cli? i found the below one in google :

FROM alpine:latest RUN apk --no-cache add python3 py3-pip RUN pip3 install --upgrade pip \ && pip3 install --no-cache-dir awscli

and added to the dockerfile in this repo and attached the image to Policy-manger deployment pod is showing crashloopBackoff:

gatekeeper-policy-manager-ui-5cc7545fb6-4n5gd 0/1 CrashLoopBackOff 1 (21s ago) 24s

when i describe pod: Warning BackOff 1s (x7 over 28s) kubelet Back-off restarting failed container

State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error

I can see the image and imageID, volumes and volumes-mounts, they are fine!

can you help me how to add aws-cli ?

ralgozino commented 1 year ago

The following Dockerfile should work:

FROM quay.io/sighup/gatekeeper-policy-manager:v1.0.3

# Add awscli to GPM image
USER root
WORKDIR /tmp
ADD "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" /tmp
RUN apt-get update && apt-get install -y unzip && rm -rf /var/lib/apt/lists/*
RUN unzip awscli-exe-linux-x86_64.zip && aws/install && rm -rf aws && rm awscli-exe-linux-x86_64.zip

# Go back to the original image settings
WORKDIR /app
USER 999

Edit: changed USER gpm to USER 999

ralgozino commented 1 year ago

Hey @Prasanna-543 were you able to make the custom image? Is it GPM working?

Regards,

Prasanna-543 commented 1 year ago

The image was built successfully but Is there a possibility that building an image may change with different Architecture?

ralgozino commented 1 year ago

Sorry, I don't think I understand the question. The image will be different from the "official" one because you are adding stuff to it.

Could you please elaborate a little more? Is GPM not working with the built image?

Prasanna-543 commented 1 year ago

Hi @ralgozino , I can build the image but facing below error Error: container has runAsNonRoot and image has non-numeric user (gpm), cannot verify user is non-root (pod: "gatekeeper-policy-manager-ui-d59d54b7f-bs6dg_gatekeeper-system(container: gatekeeper)

ralgozino commented 1 year ago

Hi @Prasanna-543 , change the line:

USER gpm

to

USER 999

in the Dockerfile and rebuild the image

Prasanna-543 commented 1 year ago

yeah trying

Prasanna-543 commented 1 year ago
image

when i select cluster in context this is the error and when i don't select any context below is the error

image

I have tried with new image, i checked the pod and logs too there no errors but UI is showing same

ralgozino commented 1 year ago

It seems that you have no selected context, see the dropdown at the top right? try choosing one context.

Prasanna-543 commented 1 year ago

i have edited above reply! It just disappears(context) when i select

ralgozino commented 1 year ago

You should see some logs on the GPM pods about what is going on. If you still no see logs try the following 2 things please:

  1. Open the browser console (F12) and you should see a request in red with a 500 error (or similar), check what is the response to that request.
  2. Put the backend in debug mode, by setting the Environment Variable GPM_LOG_LEVEL to DEBUG, you should see more detailed logs in the pod with this change.
Prasanna-543 commented 1 year ago

I have the GPM_LOG_LEVEL is DEBUG before and now

[2023-04-21 08:13:41 +0000] [1] [INFO] Starting gunicorn 20.1.0 [2023-04-21 08:13:41 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1) [2023-04-21 08:13:41 +0000] [1] [INFO] Using worker: gthread [2023-04-21 08:13:41 +0000] [7] [INFO] Booting worker with pid: 7 [2023-04-21 08:13:41 +0000] [8] [INFO] Booting worker with pid: 8 [2023-04-21 08:13:44 +0000] [7] [INFO] gunicorn log level is set to: DEBUG [2023-04-21 08:13:44 +0000] [7] [INFO] application log level is set to: DEBUG [2023-04-21 08:13:44 +0000] [7] [INFO] RUNNING WITH AUTHENTICATION DISABLED [2023-04-21 08:13:44 +0000] [7] [INFO] Attempting init with KUBECONFIG from path '~/.kube/config' [2023-04-21 08:13:44 +0000] [7] [INFO] KUBECONFIG '~/.kube/config' successfuly loaded. [2023-04-21 08:13:44 +0000] [7] [DEBUG] GET /health [2023-04-21 08:13:44 +0000] [7] [DEBUG] GET /health [2023-04-21 08:13:44 +0000] [7] [DEBUG] GET /health [2023-04-21 08:13:44 +0000] [7] [DEBUG] Ignoring connection epipe [2023-04-21 08:13:44 +0000] [7] [DEBUG] Ignoring connection epipe [2023-04-21 08:13:44 +0000] [7] [DEBUG] Closing connection. [2023-04-21 08:13:44 +0000] [8] [INFO] gunicorn log level is set to: DEBUG [2023-04-21 08:13:44 +0000] [8] [INFO] application log level is set to: DEBUG [2023-04-21 08:13:44 +0000] [8] [INFO] RUNNING WITH AUTHENTICATION DISABLED [2023-04-21 08:13:44 +0000] [8] [INFO] Attempting init with KUBECONFIG from path '~/.kube/config' [2023-04-21 08:13:44 +0000] [8] [INFO] KUBECONFIG '~/.kube/config' successfuly loaded. [2023-04-21 08:13:45 +0000] [7] [DEBUG] GET / [2023-04-21 08:13:45 +0000] [8] [DEBUG] GET / [2023-04-21 08:13:45 +0000] [7] [DEBUG] Closing connection. [2023-04-21 08:13:45 +0000] [8] [DEBUG] Closing connection. ... [2023-04-21 08:14:01 +0000] [7] [DEBUG] GET /constraints/arn:aws:eks:*cluster [2023-04-21 08:14:01 +0000] [7] [DEBUG] GET /static/js/main.079229dd.js [2023-04-21 08:14:01 +0000] [7] [DEBUG] GET /static/css/main.e9dfd109.css [2023-04-21 08:14:02 +0000] [8] [DEBUG] GET /api/v1/contexts/ [2023-04-21 08:14:02 +0000] [8] [DEBUG] GET /api/v1/auth/ [2023-04-21 08:14:02 +0000] [7] [DEBUG] GET /static/media/github-logo.2384f056f07cd6da5d2a11e846a50566.svg [2023-04-21 08:14:02 +0000] [7] [DEBUG] GET /static/js/icon.heart.6a5439c3.chunk.js [2023-04-21 08:14:02 +0000] [7] [DEBUG] GET /static/js/icon.arrow_right.b4dff9f3.chunk.js [2023-04-21 08:14:02 +0000] [8] [DEBUG] GET /static/js/icon.popout.415e5814.chunk.js [2023-04-21 08:14:02 +0000] [8] [DEBUG] GET /static/media/Poppins-Medium.9e1bb626874ed49aa343.ttf [2023-04-21 08:14:02 +0000] [7] [DEBUG] GET /static/js/icon.arrow_down.64fbca8c.chunk.js [2023-04-21 08:14:02 +0000] [7] [DEBUG] GET /static/media/Poppins-Bold.404e299be26d78e66794.ttf [2023-04-21 08:14:03 +0000] [7] [DEBUG] GET /static/media/Poppins-Regular.8081832fc5cfbf634aa6.ttf [2023-04-21 08:14:03 +0000] [8] [DEBUG] GET /favicon.ico

and i'm using mac and i don't know how f12 works here

ralgozino commented 1 year ago

and i'm using mac and i don't know how f12 works here

Right click anywhere on the page -> Inspect. When the inspector opens up, go the constraints view for example, a check in "Network" tab of the inspector for requests in RED color. See what the response to that request is.

Another way to test is to with curl:

Test first the contexts endpoint (replace http://localhost:8080 with the address of GPM)

curl http://localhost:8080/api/v1/contexts/

You should see something like this:

[[{"name":"kind-kind","context":{"cluster":"kind-kind","user":"kind-kind"}}],{"name":"kind-kind","context":{"cluster":"kind-kind","user":"kind-kind"}}]

Then try the constraints endpoint:

curl http://localhost:8080/api/v1/constraints/

And let me know what you get as a response.

Prasanna-543 commented 1 year ago
image
Prasanna-543 commented 1 year ago

curl https://GPM_host/api/v1/contexts/

302 Found

302 Found


curl https://GPM_host/api/v1/constraints/

302 Found

302 Found

image image

the errors codes are 302 and 500

ralgozino commented 1 year ago

Mmmmm, there's something in your network that is messing up CORS breaking the frontend communication with the backend. Are you behind a corporate proxy or something similar?

Can you try the same curl commands but add the -L flag? i.e.:

curl -L http://GPM_host/api/v1/contexts/

and

curl -L http://GPM_host/api/v1/constraints/

Let's see if you still get the CORS error with curl.

A workaround to see if you get it to work is to disable the CORS check in the backend, to do that add an environment variable APP_ENV with the value development.

Prasanna-543 commented 1 year ago

yes i'm using company's laptop and needs vpn connection always! and many restrictions will be there!

Prasanna-543 commented 1 year ago

Page Expired, please close your browser and start a new request.

in both curl's having same
ralgozino commented 1 year ago

Some questions:

  1. Are you using OIDC?

  2. Can you tell me what you see if you click on this request?

    image
  3. Is this request that has CORS problems being done to GPM or to another host? I don't remember GPM doing requests like that:

    image
Prasanna-543 commented 1 year ago
  1. OIDC was not enabled in the values file but the service account included in deployment has iam role annotation where its has OIDC eks service and without multi-cluster with he same service-account, it worked fine

  2. image
  3. all other hosts are working fine, only GPM

ralgozino commented 1 year ago
  1. Interesting

  2. Please go to the Network tab, then click on the request with error 500 and then on the Preview tab, see:

    image
  3. OK, but the request that starts with authorize?... to what host / URL is being done?

Prasanna-543 commented 1 year ago

when i open one of the 500 request code i found this: {"action":"Please verify your kubeconfig file and location","description":"Invalid kube-config file. Expected object with name arn:aws:eks:region:iacc_id:cluster in /home/gpm/.kube/config/contexts list","error":"Can't connect to cluster due to an invalid kubeconfig file"}

even the cluster full-name is not loading

image

the request that starts with it's been pointing to manifest.json in above line name in the image

image
ralgozino commented 1 year ago

This is much better, the actual error is this:

Invalid kube-config file. Expected object with name arn:aws: eks:us-east-1:***:cluster in /home/gpm/.kube/config/contexts list'

I believe is caused by this bug in the Python Kubernetes client library: https://github.com/kubernetes-client/python/issues/1193

Could you please:

  1. Confirm that the kubeconfig that you are using does not have a current context set? you can check with kubectl config current-contexts or just inspect the file.
  2. If it is empty, set it to one of the available contexts
Prasanna-543 commented 1 year ago

@ralgozino

kubectl config current-context arn:aws:eks:us-east-1:ID:cluster/Cluster_name

and current-context has been set already

and do u think this will have any effect https://github.com/sighupio/gatekeeper-policy-manager/blob/ad6259d757d6e57920cbeaac9579a221f4ab5132/chart/values.yaml#L110

ralgozino commented 1 year ago

@ralgozino and can you remove 84**** number in the above comment, because i don't want it to be public!

done

and do u think this will have any effect

https://github.com/sighupio/gatekeeper-policy-manager/blob/ad6259d757d6e57920cbeaac9579a221f4ab5132/chart/values.yaml#L110

I don't think that is relevant to the problem.

Have you tried setting the env var APP_ENV to development? I'm going out of ideas.

Prasanna-543 commented 1 year ago

not tried, i will try and let u know

Prasanna-543 commented 1 year ago

its' not working!

ralgozino commented 1 year ago

Sorry to hear that. I'd need to try to replicate the issue myself, but I believe that is something particular to your environment. Probably the corporate VPN or an HTTP proxy in the middle changing something and breaking the front-end <> backend communication.

The last thing we can try is building the image from the main branch, which has updated dependencies. Maybe we are lucky and the problem goes away.

Change the FROM quay.io/sighup/gatekeeper-policy-manager:v1.0.3 line in the Dockerfile to FROM quay.io/sighup/gatekeeper-policy-manager:bf1d36477f9291a06e7109b9193dbbe6546cbd37 and rebuild the image.

Hope that helps 🀞

Prasanna-543 commented 1 year ago
image

this time error details was an extra part!

ralgozino commented 1 year ago

Yes!, we improved the error messages in the frontend a few days ago πŸ˜ƒ

It keeps saying though that there's something wrong with the kubeconfig file. Are you 100% sure that the kubeconfig works? can you test it somehow?

The very last test we can do, is using the development version of GPM that changes the Python backend for a Go backend, we can discard issues with the Python library this way. To do so, change the FROM line like before to use this image instead: FROM quay.io/sighup/gatekeeper-policy-manager:20751b146c9093e7ca191b770671f7db869bf62d

If the Go backend has the same issue, I would not know what else to try 😞

Prasanna-543 commented 1 year ago

Yes the kubeconfig works, because we are using locally the same one in (/.kube/config). I just copy pasted the cluster details and conttexts.

the image is getting build from above source you have given but it got build with previous oneπŸ˜‘ FROM quay.io/sighup/gatekeeper-policy-manager:20751b146c9093e7ca191b770671f7db869bf62d

USER root WORKDIR /tmp ADD "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" /tmp RUN apt-get update && apt-get install -y unzip && rm -rf /var/lib/apt/lists/* RUN unzip awscli-exe-linux-x86_64.zip && aws/install && rm -rf aws && rm awscli-exe-linux-x86_64.zip

WORKDIR /app USER 999

ERROR: => ERROR [4/6] RUN apt-get update && apt-get install -y unzip && rm -rf /var/lib/apt/lists/* 0.3s

[4/6] RUN apt-get update && apt-get install -y unzip && rm -rf /var/lib/apt/lists/*:

8 0.309 runc run failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory


executor failed running [/bin/sh -c apt-get update && apt-get install -y unzip && rm -rf /var/lib/apt/lists/*]: exit code: 1

Prasanna-543 commented 1 year ago

Is there a way that we can use local config file?

ralgozino commented 1 year ago

Yes the kubeconfig works, because we are using locally the same one in (/.kube/config). I just copy pasted the cluster details and conttexts.

What do you mean by this? Did you edit the kubeconfig file manually? please try with the same copy that you are using locally.

I forgot that the docker image for the Go version is starting from scratch (there's no OS) so that is why it is failing to build. For testing the go version of GPM with the aws binary you can use the following Dockerfile:

FROM public.ecr.aws/amazonlinux/amazonlinux:2
ARG EXE_FILENAME=awscli-exe-linux-x86_64.zip
ADD "https://awscli.amazonaws.com/$EXE_FILENAME" /tmp
# COPY $EXE_FILENAME .
RUN yum update -y \
  && yum install -y unzip \
  && unzip /tmp/$EXE_FILENAME \
  && ./aws/install 

COPY --from=quay.io/sighup/gatekeeper-policy-manager:20751b146c9093e7ca191b770671f7db869bf62d /app /app
WORKDIR /app
ENTRYPOINT ["./gpm"]
kecebon9 commented 10 months ago

i have same issue, but i could make it work when i rename the context with kubectl

bash-4.2$ kubectl config rename-context arn:aws:eks:abcd:123456:cluster/cluster1
Context "arn:aws:eks:abcd:123456:cluster/cluster1" renamed to "cluster1".

so the url was looks like http://localhost:8080/constraints/arn:aws:eks:abcd:123456:cluster/cluster1 (Not found) and after rename it with kubectl it will become http://localhost:8080/constraints/cluster1 and it become accessible, seems the url with arn is not working correctly.

ralgozino commented 10 months ago

thank you @kecebon9 ! this is great feedback. I think that URL-encoding the context name (or at least escaping the forward slashes /) should fix the issue. I will do some tests.