Elastic Agent Plugin can no longer create new agents after pod restart (or 1 hour)

Hello, I m using the latest version of this plugin trough the gocd helm chart, when I deploy the helm chart for the first time everything works fine but when I restart the pod (or the pod gets restarted by a helm update) I get the following error

2024-01-05 10:36:54,589  INFO [167@MessageListener for ServerPingListener] p.c.g.c.e.k.c.g.c.e.KubernetesPlugin:72 [plugin-cd.go.contrib.elasticagent.kubernetes] - Did not find any running kubernetes pods.
2024-01-05 10:37:54,579 ERROR [167@MessageListener for ServerPingListener] p.c.g.c.e.k.c.g.c.e.KubernetesPlugin:127 [plugin-cd.go.contrib.elasticagent.kubernetes] - Error occurred while trying to list kubernetes pods:
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://172.20.0.1:443/api/v1/namespaces/gocd/pods. Message: Unauthorized. Received status: Status(apiVersion=v1, code=401, details=null, kind=Status, message=Unauthorized, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Unauthorized, status=Failure, additionalProperties={}).

Is this the expected behavior when running in a cluster? do I need to add extra configuration in order to avoid this? Thanks

It's not expected that if it works fine via the Helm chart the first time that it stops working thereafter with no other changes made on your side of things. I assume by "works fine", you mean can create new pods etc and then after the restart nothing works at all, it cannot create new pods etc?

There should be many more errors fro logs than that if it cannot create pods etc, I'd have thought.

Which Kubernetes version, Kubernetes type (EKS? GKE? a local single node deployment? etc?) and GoCD Helm chart version are you using?

Thanks for the reply.

By works fine I mean that it creates agents pods whenever I trigger the example pipeline in gocd, the pipeline completes and then the agent pod shuts down. When I restart the gocd-server pod and trigger the example pipline in gocd nothing happens and I get the logs shown.

I m using EKS and the k8s version is 1.25, the chart version that I m using is the latest one: 2.5.1

This is the full logs that I get from the gocd-server pod, after that it repeats the same error

jvm 1    | 2024-01-05 10:36:54,589  INFO [167@MessageListener for ServerPingListener] p.c.g.c.e.k.c.g.c.e.KubernetesPlugin:72 [plugin-cd.go.contrib.elasticagent.kubernetes] - Did not find any running kubernetes pods.
jvm 1    | 2024-01-05 10:37:54,579 ERROR [167@MessageListener for ServerPingListener] p.c.g.c.e.k.c.g.c.e.KubernetesPlugin:127 [plugin-cd.go.contrib.elasticagent.kubernetes] - Error occurred while trying to list kubernetes pods:
jvm 1    | io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://172.20.0.1:443/api/v1/namespaces/gocd/pods. Message: Unauthorized. Received status: Status(apiVersion=v1, code=401, details=null, kind=Status, message=Unauthorized, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Unauthorized, status=Failure, additionalProperties={}).
jvm 1    |     at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:507)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:450)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:418)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:97)
jvm 1    |     at cd.go.contrib.elasticagent.KubernetesAgentInstances.refreshAll(KubernetesAgentInstances.java:156)
jvm 1    |     at cd.go.contrib.elasticagent.KubernetesPlugin.refreshInstancesForCluster(KubernetesPlugin.java:140)
jvm 1    |     at cd.go.contrib.elasticagent.KubernetesPlugin.refreshInstancesForAllClusters(KubernetesPlugin.java:122)
jvm 1    |     at cd.go.contrib.elasticagent.KubernetesPlugin.handle(KubernetesPlugin.java:90)
jvm 1    |     at com.thoughtworks.go.plugin.infra.DefaultPluginManager.lambda$submitTo$0(DefaultPluginManager.java:134)
jvm 1    |     at com.thoughtworks.go.plugin.infra.FelixGoPluginOSGiFramework.executeActionOnTheService(FelixGoPluginOSGiFramework.java:205)
jvm 1    |     at com.thoughtworks.go.plugin.infra.FelixGoPluginOSGiFramework.doOn(FelixGoPluginOSGiFramework.java:164)
jvm 1    |     at com.thoughtworks.go.plugin.infra.DefaultPluginManager.submitTo(DefaultPluginManager.java:131)
jvm 1    |     at com.thoughtworks.go.plugin.access.PluginRequestHelper.submitRequest(PluginRequestHelper.java:49)
jvm 1    |     at com.thoughtworks.go.plugin.access.elastic.v5.ElasticAgentExtensionV5.serverPing(ElasticAgentExtensionV5.java:152)
jvm 1    |     at com.thoughtworks.go.plugin.access.elastic.ElasticAgentExtension.serverPing(ElasticAgentExtension.java:63)
jvm 1    |     at com.thoughtworks.go.plugin.access.elastic.ElasticAgentPluginRegistry.serverPing(ElasticAgentPluginRegistry.java:51)
jvm 1    |     at com.thoughtworks.go.server.messaging.elasticagents.ServerPingListener.onMessage(ServerPingListener.java:31)
jvm 1    |     at com.thoughtworks.go.server.messaging.elasticagents.ServerPingListener.onMessage(ServerPingListener.java:21)
jvm 1    |     at com.thoughtworks.go.server.messaging.activemq.JMSMessageListenerAdapter.runImpl(JMSMessageListenerAdapter.java:83)
jvm 1    |     at com.thoughtworks.go.server.messaging.activemq.JMSMessageListenerAdapter.run(JMSMessageListenerAdapter.java:63)
jvm 1    |     at java.base/java.lang.Thread.run(Unknown Source)
jvm 1    | Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://172.20.0.1:443/api/v1/namespaces/gocd/pods. Message: Unauthorized. Received status: Status(apiVersion=v1, code=401, details=null, kind=Status, message=Unauthorized, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Unauthorized, status=Failure, additionalProperties={}).
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:660)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:640)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:589)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
jvm 1    |     at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:140)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
jvm 1    |     at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:52)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
jvm 1    |     at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:136)
jvm 1    |     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
jvm 1    |     ... 1 common frames omitted

It kind of looks like the service account token you're using somehow doesn't have the permissions for the namespace it's trying to look for agent pods within.

Are there any values from the chart that you override on installation (for the server, that is). Especially anything related to the storage mounts/volumes.
Did you go in and modify any of the values in Elastic Agents Configurations (either the main cluster profile, or the individual elastic agent profile?
How did you go about granting the right permissions for the service account? As noted in gocd/kubernetes-elastic-agents#328 there are a few ways to do this. I can't recall if 1.25 still is doing all the magic to allow this to work without k8s admin work, so anything you did here would be good to know.
Are there any weird changes being shown to the GoCD server config after the restart?

The Helm chart tries to bootstrap the server with an initial config that makes sense, however something may be being changed after the first run, or the restart is somehow removing some important config or not using the right service account token. You may be able to see these changes made to the server config from the Stage Details view for a build you are trying to run - when logged in as an admin look for these "Config Changed" links on the Stage history:

OK, I have confirmed that this is because of changes in the way "bound service account tokens" work in Kubernetes 1.24+. They are now bound to the pod lifecycle, rather than the service account itself so they used to basically last forever, but now do not.

What this means is that the "magic" the GoCD Helm chart does on first start to populate the service account token into the "cluster profile" populates your pod with an auto-mounted token bound to the pod. It only does this once. When the pod restarts, this token is invalidated by the Kubernetes API and is no longer valid. It may also not work for longer than 1 hour, since the plugin and GoCD cannot re-read the value.

Essentially, this makes the automation in the Helm chart deceptive as it won't work for long.

To fix this right now, you need to do something like mentioned in gocd/kubernetes-elastic-agents#328 to create a long-lived token, extract it from Kubernetes and then replace the value in your cluster profile with this.

Additionally, if you don't manually set the CA Certificate Data, you might experience a different bug I have fixed in this release included in the latest Helm chart version (2.5.2).

In any case, this needs a better fix as the Helm chart is currently misleading. Will move this issue there as the main problem is with the misleading behaviour of the Helm chart which makes you think you don't need to do anything with the service account tokens to keep it working, since on first install it will work for a while without the user doing anything.

The root of this problem is now fixed with chart 2.6.0 (which defaults to plugin v4.0.0-505) if you want to give it a go.

Note that the server configuration only happens once, so you'd either need to start with a fresh server (probably not ideal), or upgrade the plugin via the Helm chart, and then clear out the older configuration values from the Cluster Profile so the new plugin version falls back to the defaults that the Helm chart has always ensured are set up.

Thank you very much for the quick fix and the explanation, works perfectly now 👍

gocd / helm-chart

Elastic Agent Plugin can no longer create new agents after pod restart (or 1 hour) #98