hashicorp / consul-k8s

First-class support for Consul Service Mesh on Kubernetes
https://www.consul.io/docs/k8s
Mozilla Public License 2.0
669 stars 322 forks source link

API-Gateway pods is Init state and only changes to running state once we delete the service api-gateway. #3934

Open tejnar opened 6 months ago

tejnar commented 6 months ago

Question

API-Gateway pods is Init state and only changes to running state only once we delete the svc api-gateway.

CLI Commands (consul-k8s, consul-k8s-control-plane, helm)

Helm Configuration

Attached values.yaml file which is being used for deploying consul to EKS. values.yaml.txt

Chart Details:

version: 1.4.1 appVersion: 1.18.1

Logs

[api-gw]$ kubectl -n mesh-client get all
NAME READY STATUS RESTARTS AGE
pod/api-gateway-68bcd79b4d-p62bm 0/1 Init:CrashLoopBackOff 5 (2m24s ago) 17m
pod/consul-consul-connect-injector-8cf9849c4-ksxd9 1/1 Running 0 30h
pod/consul-consul-connect-injector-8cf9849c4-nm6zq 1/1 Running 0 3d9h
pod/consul-consul-server-0 1/1 Running 0 3d9h
pod/consul-consul-server-1 1/1 Running 0 37h
pod/consul-consul-server-2 1/1 Running 0 3d14h
pod/consul-consul-webhook-cert-manager-5dc74f9bbb-ltssk 1/1 Running 0 30h

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/api-gateway LoadBalancer 172.20.137.66 api-gateway.elb.us-east-1.amazonaws.com 80:32370/TCP 111m
service/consul-consul-connect-injector ClusterIP 172.20.22.178 443/TCP 3d14h
service/consul-consul-server ClusterIP None 8500/TCP,8502/TCP,8301/TCP,8301/UDP,8302/TCP,8302/UDP,8300/TCP,8600/TCP,8600/UDP 3d14h
service/consul-consul-ui NodePort 172.20.63.191 80:30904/TCP 3d14h

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/api-gateway 0/1 1 0 3d
deployment.apps/consul-consul-connect-injector 2/2 2 2 3d14h
deployment.apps/consul-consul-webhook-cert-manager 1/1 1 1 3d14h

NAME DESIRED CURRENT READY AGE
replicaset.apps/api-gateway-68bcd79b4d 1 1 0 3d
replicaset.apps/consul-consul-connect-injector-8cf9849c4 2 2 2 3d14h
replicaset.apps/consul-consul-webhook-cert-manager-5dc74f9bbb 1 1 1 3d14h

NAME READY AGE
statefulset.apps/consul-consul-server 3/3 3d14h

Current understanding and Expected behavior

We use spot instances in our cluster and api-gateway pod can be migrated to any other node in the cluster. My expectation is it should bring the api-gateway pod to running state, since there is an associated service(api-gateway) which is already running. I've also defined HTTPRoute as defined in documentation (https://developer.hashicorp.com/consul/tutorials/kubernetes/kubernetes-api-gateway#deploy-api-gateway).

Once I delete the service(api-gateway), it brings the pod to running state and it was working as expected. Also able to get the response from the services deployed inside the eks cluster.

This issue happens only when the service is exposed an loadbalancer and for nodePort it works as expected.

Environment details

EKS version : 1.29 with Calico-cni enabled

Additional Context

Modify the connect-inject-deployment.yaml to use hostNetwork: true

pawellegowski89 commented 5 months ago

After adding a CRD - API gateway and then deleting it in consul, the role remains and adding such a CRD again causes an error.

The error occurs even without any intervention, if we shutdown and up the environment, the API gateway will no longer be running, but will hang on INIT, trying to re-add an existing role, i.e. the same error again:

Reconciler error {"controller": "gateway", "controllerGroup": "gateway.networking.k8s.io", "controllerKind": "Gateway", "Gateway": {"name":"mesh-api-gateway","namespace":"data"}, "namespace": "data", "name": "mesh-api-gateway", "reconcileID": "739cd7fb-540e-46f2-b6dd-653baf933f1a", "error": "Unexpected response code: 500 (Invalid Role: A Role with Name \"managed-gateway-acl-role-mesh-api-gateway\" already exists)"}

Manually removing the role in UI helps, but it is only a workaround

pawellegowski89 commented 1 month ago

In version chart 1.5.3 this works fine.