gocd / helm-chart

Official GoCD Helm Chart Repository
Apache License 2.0
22 stars 21 forks source link

Elastic Agent Plugin can no longer create new agents after pod restart (or 1 hour) #98

Closed andcri closed 9 months ago

andcri commented 9 months ago

Hello, I m using the latest version of this plugin trough the gocd helm chart, when I deploy the helm chart for the first time everything works fine but when I restart the pod (or the pod gets restarted by a helm update) I get the following error

2024-01-05 10:36:54,589  INFO [167@MessageListener for ServerPingListener] p.c.g.c.e.k.c.g.c.e.KubernetesPlugin:72 [plugin-cd.go.contrib.elasticagent.kubernetes] - Did not find any running kubernetes pods.
2024-01-05 10:37:54,579 ERROR [167@MessageListener for ServerPingListener] p.c.g.c.e.k.c.g.c.e.KubernetesPlugin:127 [plugin-cd.go.contrib.elasticagent.kubernetes] - Error occurred while trying to list kubernetes pods:
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://172.20.0.1:443/api/v1/namespaces/gocd/pods. Message: Unauthorized. Received status: Status(apiVersion=v1, code=401, details=null, kind=Status, message=Unauthorized, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Unauthorized, status=Failure, additionalProperties={}).

Is this the expected behavior when running in a cluster? do I need to add extra configuration in order to avoid this? Thanks

chadlwilson commented 9 months ago

It's not expected that if it works fine via the Helm chart the first time that it stops working thereafter with no other changes made on your side of things. I assume by "works fine", you mean can create new pods etc and then after the restart nothing works at all, it cannot create new pods etc?

There should be many more errors fro logs than that if it cannot create pods etc, I'd have thought.

Which Kubernetes version, Kubernetes type (EKS? GKE? a local single node deployment? etc?) and GoCD Helm chart version are you using?

andcri commented 9 months ago

Thanks for the reply.

By works fine I mean that it creates agents pods whenever I trigger the example pipeline in gocd, the pipeline completes and then the agent pod shuts down. When I restart the gocd-server pod and trigger the example pipline in gocd nothing happens and I get the logs shown.

I m using EKS and the k8s version is 1.25, the chart version that I m using is the latest one: 2.5.1

This is the full logs that I get from the gocd-server pod, after that it repeats the same error

jvm 1    | 2024-01-05 10:36:54,589  INFO [167@MessageListener for ServerPingListener] p.c.g.c.e.k.c.g.c.e.KubernetesPlugin:72 [plugin-cd.go.contrib.elasticagent.kubernetes] - Did not find any running kubernetes pods.
jvm 1    | 2024-01-05 10:37:54,579 ERROR [167@MessageListener for ServerPingListener] p.c.g.c.e.k.c.g.c.e.KubernetesPlugin:127 [plugin-cd.go.contrib.elasticagent.kubernetes] - Error occurred while trying to list kubernetes pods:
jvm 1    | io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://172.20.0.1:443/api/v1/namespaces/gocd/pods. Message: Unauthorized. Received status: Status(apiVersion=v1, code=401, details=null, kind=Status, message=Unauthorized, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Unauthorized, status=Failure, additionalProperties={}).
jvm 1    |     at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:507)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:450)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:418)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:97)
jvm 1    |     at cd.go.contrib.elasticagent.KubernetesAgentInstances.refreshAll(KubernetesAgentInstances.java:156)
jvm 1    |     at cd.go.contrib.elasticagent.KubernetesPlugin.refreshInstancesForCluster(KubernetesPlugin.java:140)
jvm 1    |     at cd.go.contrib.elasticagent.KubernetesPlugin.refreshInstancesForAllClusters(KubernetesPlugin.java:122)
jvm 1    |     at cd.go.contrib.elasticagent.KubernetesPlugin.handle(KubernetesPlugin.java:90)
jvm 1    |     at com.thoughtworks.go.plugin.infra.DefaultPluginManager.lambda$submitTo$0(DefaultPluginManager.java:134)
jvm 1    |     at com.thoughtworks.go.plugin.infra.FelixGoPluginOSGiFramework.executeActionOnTheService(FelixGoPluginOSGiFramework.java:205)
jvm 1    |     at com.thoughtworks.go.plugin.infra.FelixGoPluginOSGiFramework.doOn(FelixGoPluginOSGiFramework.java:164)
jvm 1    |     at com.thoughtworks.go.plugin.infra.DefaultPluginManager.submitTo(DefaultPluginManager.java:131)
jvm 1    |     at com.thoughtworks.go.plugin.access.PluginRequestHelper.submitRequest(PluginRequestHelper.java:49)
jvm 1    |     at com.thoughtworks.go.plugin.access.elastic.v5.ElasticAgentExtensionV5.serverPing(ElasticAgentExtensionV5.java:152)
jvm 1    |     at com.thoughtworks.go.plugin.access.elastic.ElasticAgentExtension.serverPing(ElasticAgentExtension.java:63)
jvm 1    |     at com.thoughtworks.go.plugin.access.elastic.ElasticAgentPluginRegistry.serverPing(ElasticAgentPluginRegistry.java:51)
jvm 1    |     at com.thoughtworks.go.server.messaging.elasticagents.ServerPingListener.onMessage(ServerPingListener.java:31)
jvm 1    |     at com.thoughtworks.go.server.messaging.elasticagents.ServerPingListener.onMessage(ServerPingListener.java:21)
jvm 1    |     at com.thoughtworks.go.server.messaging.activemq.JMSMessageListenerAdapter.runImpl(JMSMessageListenerAdapter.java:83)
jvm 1    |     at com.thoughtworks.go.server.messaging.activemq.JMSMessageListenerAdapter.run(JMSMessageListenerAdapter.java:63)
jvm 1    |     at java.base/java.lang.Thread.run(Unknown Source)
jvm 1    | Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://172.20.0.1:443/api/v1/namespaces/gocd/pods. Message: Unauthorized. Received status: Status(apiVersion=v1, code=401, details=null, kind=Status, message=Unauthorized, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Unauthorized, status=Failure, additionalProperties={}).
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:660)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:640)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:589)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
jvm 1    |     at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:140)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
jvm 1    |     at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:52)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
jvm 1    |     at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:136)
jvm 1    |     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
jvm 1    |     ... 1 common frames omitted
chadlwilson commented 9 months ago

It kind of looks like the service account token you're using somehow doesn't have the permissions for the namespace it's trying to look for agent pods within.

chadlwilson commented 9 months ago

OK, I have confirmed that this is because of changes in the way "bound service account tokens" work in Kubernetes 1.24+. They are now bound to the pod lifecycle, rather than the service account itself so they used to basically last forever, but now do not.

What this means is that the "magic" the GoCD Helm chart does on first start to populate the service account token into the "cluster profile" populates your pod with an auto-mounted token bound to the pod. It only does this once. When the pod restarts, this token is invalidated by the Kubernetes API and is no longer valid. It may also not work for longer than 1 hour, since the plugin and GoCD cannot re-read the value.

image

Essentially, this makes the automation in the Helm chart deceptive as it won't work for long.

To fix this right now, you need to do something like mentioned in gocd/kubernetes-elastic-agents#328 to create a long-lived token, extract it from Kubernetes and then replace the value in your cluster profile with this.

Additionally, if you don't manually set the CA Certificate Data, you might experience a different bug I have fixed in this release included in the latest Helm chart version (2.5.2).

In any case, this needs a better fix as the Helm chart is currently misleading. Will move this issue there as the main problem is with the misleading behaviour of the Helm chart which makes you think you don't need to do anything with the service account tokens to keep it working, since on first install it will work for a while without the user doing anything.

chadlwilson commented 9 months ago

The root of this problem is now fixed with chart 2.6.0 (which defaults to plugin v4.0.0-505) if you want to give it a go.

Note that the server configuration only happens once, so you'd either need to start with a fresh server (probably not ideal), or upgrade the plugin via the Helm chart, and then clear out the older configuration values from the Cluster Profile so the new plugin version falls back to the defaults that the Helm chart has always ensured are set up.

andcri commented 9 months ago

Thank you very much for the quick fix and the explanation, works perfectly now 👍