Closed ivanpiyvikov closed 1 month ago
Whet does the job console show for the job? Did the pod register and start running tasks? Did it get rescheduled?
What do the plugin specific logs on the server say?(There is a log file specific to the plugin in the logs folder, not just the server log file)
What does kubernetes say happened to the pod? Did kubernetes kill the pod or the container?
Is this a new setup or did something change before this started happening? (Cluster upgrade? GoCD upgrade?)
kublet-output.log
root@ip----~]# journalctl -u kubelet | grep pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192
Jul 23 06:59:08 kubelet[2099]: I0723 06:59:08.015737 2099 kubelet.go:2432] "SyncLoop ADD" source="api" pods=["gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"]
Jul 23 06:59:08 kubelet[2099]: I0723 06:59:08.015781 2099 topology_manager.go:215] "Topology Admit Handler" podUID="084b0538-4d29-4e2e-b709-9cd654b2d764" podNamespace="gocd-agents" podName="pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08 kubelet[2099]: I0723 06:59:08.016393 2099 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08 kubelet[2099]: I0723 06:59:08.150651 2099 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume startedfor volume \"kube-api-access-fzmrk\" (UniqueName: \"kubernetes.io/projected/084b0538-4d29-4e2e-b709-9cd654b2d764-kube-api-access-fzmrk\") pod \"pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192\" (UID: \"084b0538-4d29-4e2e-b709-9cd654b2d764\") " pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08 kubelet[2099]: I0723 06:59:08.150709 2099 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume startedfor volume \"efs-gocd-agent-pv\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-03b36c3b9ea54c807\") pod \"pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192\" (UID: \"084b0538-4d29-4e2e-b709-9cd654b2d764\") " pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08 kubelet[2099]: I0723 06:59:08.251191 2099 reconciler_common.go:220] "operationExecutor.MountVolume started for volume \"kube-api-access-fzmrk\" (UniqueName: \"kubernetes.io/projected/084b0538-4d29-4e2e-b709-9cd654b2d764-kube-api-access-fzmrk\") pod \"pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192\" (UID: \"084b0538-4d29-4e2e-b709-9cd654b2d764\") " pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08 kubelet[2099]: I0723 06:59:08.251232 2099 reconciler_common.go:220] "operationExecutor.MountVolume started for volume \"efs-gocd-agent-pv\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-03b36c3b9ea54c807\") pod \"pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192\" (UID: \"084b0538-4d29-4e2e-b709-9cd654b2d764\") " pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08 kubelet[2099]: I0723 06:59:08.252357 2099 operation_generator.go:664] "MountVolume.MountDevice succeeded for volume \"efs-gocd-agent-pv\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-03b36c3b9ea54c807\") pod \"pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192\" (UID: \"084b0538-4d29-4e2e-b709-9cd654b2d764\") device mount path \"/var/lib/kubelet/plugins/kubernetes.io/csi/efs.csi.aws.com/b8b581699bf8984fa0f74c5cbb85276d36cf340579facca7d82ca0ccae7f4351/globalmount\"" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08 kubelet[2099]: I0723 06:59:08.267058 2099 operation_generator.go:721] "MountVolume.SetUp succeeded for volume \"kube-api-access-fzmrk\" (UniqueName: \"kubernetes.io/projected/084b0538-4d29-4e2e-b709-9cd654b2d764-kube-api-access-fzmrk\") pod \"pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192\" (UID: \"084b0538-4d29-4e2e-b709-9cd654b2d764\") " pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08 kubelet[2099]: I0723 06:59:08.958949 2099 operation_generator.go:721] "MountVolume.SetUp succeeded for volume \"efs-gocd-agent-pv\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-03b36c3b9ea54c807\") pod \"pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192\" (UID: \"084b0538-4d29-4e2e-b709-9cd654b2d764\") " pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:09 kubelet[2099]: I0723 06:59:09.223359 2099 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:09 kubelet[2099]: I0723 06:59:09.653326 2099 kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192" event={"ID":"084b0538-4d29-4e2e-b709-9cd654b2d764","Type":"ContainerStarted","Data":"d1c6bcb4fd9589396a538ce2181a281461fbe1f127c725a65185ea97cbe1430d"}
Jul 23 06:59:09 kubelet[2099]: I0723 06:59:09.653369 2099 kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192" event={"ID":"084b0538-4d29-4e2e-b709-9cd654b2d764","Type":"ContainerStarted","Data":"26872e3d74b7c730cded16bbbf98d38d38c0767f8f9344c3156cab063fa48409"}
Jul 23 06:59:09 kubelet[2099]: I0723 06:59:09.667171 2099 pod_startup_latency_tracker.go:104] "Observed pod startup duration" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192" podStartSLOduration=2.667160723 podStartE2EDuration="2.667160723s" podCreationTimestamp="2024-07-23 06:59:07 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-07-23 06:59:09.66655598 +0000 UTC m=+1537985.519794742" watchObservedRunningTime="2024-07-23 06:59:09.667160723 +0000 UTC m=+1537985.520399474"
Jul 23 07:09:32 kubelet[2099]: I0723 07:09:32.516360 2099 kubelet.go:2448] "SyncLoop DELETE" source="api" pods=["gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"]
Jul 23 07:09:32 kubelet[2099]: I0723 07:09:32.516497 2099 kuberuntime_container.go:779] "Killing container with a grace period" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192" podUID="084b0538-4d29-4e2e-b709-9cd654b2d764" containerName="gocd-agent-container-a4f3e55d-b06f-4d8b-9f8f-0b62c6012d1c" containerID="containerd://d1c6bcb4fd9589396a538ce2181a281461fbe1f127c725a65185ea97cbe1430d" gracePeriod=30
Jul 23 07:09:33 kubelet[2099]: I0723 07:09:33.929956 2099 kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192" event={"ID":"084b0538-4d29-4e2e-b709-9cd654b2d764","Type":"ContainerDied","Data":"d1c6bcb4fd9589396a538ce2181a281461fbe1f127c725a65185ea97cbe1430d"}
Jul 23 07:09:33 kubelet[2099]: I0723 07:09:33.929982 2099 kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192" event={"ID":"084b0538-4d29-4e2e-b709-9cd654b2d764","Type":"ContainerDied","Data":"26872e3d74b7c730cded16bbbf98d38d38c0767f8f9344c3156cab063fa48409"}
Jul 23 07:09:33 kubelet[2099]: I0723 07:09:33.950686 2099 util.go:48] "No ready sandbox for pod can be found. Need to start a new one" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 07:09:34 kubelet[2099]: I0723 07:09:34.931927 2099 util.go:48] "No ready sandbox for pod can be found. Need to start a new one" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 07:09:34 kubelet[2099]: I0723 07:09:34.962970 2099 kubelet.go:2448] "SyncLoop DELETE" source="api" pods=["gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"]
Jul 23 07:09:34 kubelet[2099]: I0723 07:09:34.969509 2099 kubelet.go:2442] "SyncLoop REMOVE" source="api" pods=["gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"]
[root@ip----~]#
plugin-output.log
2024-07-23 06:59:07,953 INFO [228@MessageListener for CreateAgentListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 06:59:07,970 INFO [228@MessageListener for CreateAgentListener] KubernetesPlugin:72 - [Create Agent] Creating K8s pod with spec: Pod(apiVersion=v1, kind=Pod, metadata=ObjectMeta(annotations={Image=, MaxMemory=, MaxCPU=, Environment=, PodConfiguration=apiVersion: v1
kind: Pod
metadata:
name: pod-gocd-prefix-{{ POD_POSTFIX }}
namespace: gocd-agents
labels:
app: web
spec:
restartPolicy: Never
serviceAccountName: gocd-agent-sa
containers:
- name: gocd-agent-container-{{ CONTAINER_POSTFIX }}
image: accHERE.dkr.ecr.eu-west-1.amazonaws.com/gocd-agent-ubuntu:2.7.0-adf4049
env:
volumeMounts:
- name: persistent-storage
mountPath: /godata/pipelines
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: efs-gocd-agent-claim, SpecifiedUsingPodConfiguration=, PodSpecType=yaml, RemoteFile=, RemoteFileType=json, Privileged=false, Elastic-Agent-Job-Identifier={"pipeline_name":"isp_detc_qacore","pipeline_counter":754,"pipeline_label":"detc-manifest-1721148118","stage_name":"create_stack","stage_counter":"2","job_name":"create_stack","job_id":166948}}, creationTimestamp=2024-07-23T06:59:07.970616045Z, deletionGracePeriodSeconds=null, deletionTimestamp=null, finalizers=[], generateName=null, generation=null, labels={app=web, Elastic-Agent-Created-By=cd.go.contrib.elasticagent.kubernetes, Elastic-Agent-Job-Id=166948, kind=kubernetes-elastic-agent}, managedFields=[], name=pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192, namespace=gocd-agents, ownerReferences=[], resourceVersion=null, selfLink=null, uid=null, additionalProperties={}), spec=PodSpec(activeDeadlineSeconds=null, affinity=null, automountServiceAccountToken=null, containers=[Container(args=[], command=[], env=[EnvVar(name=GO_EA_SERVER_URL, value=https://gocd.prod.isp.starsops.com/go, valueFrom=null, additionalProperties={}), EnvVar(name=GO_EA_AUTO_REGISTER_KEY, value=fb6145e1-b880-49c1-a603-dfe824fcda50, valueFrom=null, additionalProperties={}), EnvVar(name=GO_EA_AUTO_REGISTER_ELASTIC_AGENT_ID, value=pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192, valueFrom=null, additionalProperties={}), EnvVar(name=GO_EA_AUTO_REGISTER_ELASTIC_PLUGIN_ID, value=cd.go.contrib.elasticagent.kubernetes, valueFrom=null, additionalProperties={})], envFrom=[], image=accHERE.dkr.ecr.eu-west-1.amazonaws.com/gocd-agent-ubuntu:2.7.0-adf4049, imagePullPolicy=null, lifecycle=null, livenessProbe=null, name=gocd-agent-container-a4f3e55d-b06f-4d8b-9f8f-0b62c6012d1c, ports=[], readinessProbe=null, resizePolicy=[], resources=null, restartPolicy=null, securityContext=null, startupProbe=null, stdin=null, stdinOnce=null, terminationMessagePath=null, terminationMessagePolicy=null, tty=null, volumeDevices=[], volumeMounts=[VolumeMount(mountPath=/godata/pipelines, mountPropagation=null, name=persistent-storage, readOnly=null, subPath=null, subPathExpr=null, additionalProperties={})], workingDir=null, additionalProperties={})], dnsConfig=null, dnsPolicy=null, enableServiceLinks=null, ephemeralContainers=[], hostAliases=[], hostIPC=null, hostNetwork=null, hostPID=null, hostUsers=null, hostname=null, imagePullSecrets=[], initContainers=[], nodeName=null, nodeSelector={}, os=null, overhead={}, preemptionPolicy=null, priority=null, priorityClassName=null, readinessGates=[], resourceClaims=[], restartPolicy=Never, runtimeClassName=null, schedulerName=null, schedulingGates=[], securityContext=null, serviceAccount=null, serviceAccountName=gocd-agent-sa, setHostnameAsFQDN=null, shareProcessNamespace=null, subdomain=null, terminationGracePeriodSeconds=null, tolerations=[], topologySpreadConstraints=[], volumes=[Volume(awsElasticBlockStore=null, azureDisk=null, azureFile=null, cephfs=null, cinder=null, configMap=null, csi=null, downwardAPI=null, emptyDir=null, ephemeral=null, fc=null, flexVolume=null, flocker=null, gcePersistentDisk=null, gitRepo=null, glusterfs=null, hostPath=null, iscsi=null, name=persistent-storage, nfs=null, persistentVolumeClaim=PersistentVolumeClaimVolumeSource(claimName=efs-gocd-agent-claim, readOnly=null, additionalProperties={}), photonPersistentDisk=null, portworxVolume=null, projected=null, quobyte=null, rbd=null, scaleIO=null, secret=null, storageos=null, vsphereVolume=null, additionalProperties={})], additionalProperties={}), status=null, additionalProperties={}).
2024-07-23 06:59:14,829 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:00:14,829 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:00:14,830 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - Recycling kubernetes client on timer...
2024-07-23 07:01:14,909 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:02:14,830 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:03:14,830 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:04:14,831 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:05:14,832 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:06:14,832 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:07:14,833 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:08:14,833 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:09:14,834 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:10:14,833 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:10:14,833 WARN [230@MessageListener for ServerPingListener] KubernetesPlugin:97 - [Server Ping] Was expecting a containers with IDs [pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192], but it was missing! Removing missing agents from config.
2024-07-23 07:11:14,834 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:12:14,829 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - Recycling kubernetes client on timer...
2024-07-23 07:12:14,902 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:13:14,835 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:14:14,835 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:15:14,836 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:16:14,835 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:17:14,837 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:18:14,836 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:19:14,838 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:20:14,839 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:21:14,838 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:22:14,839 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:23:14,839 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:23:14,840 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - Recycling kubernetes client on timer...
2024-07-23 07:24:14,906 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:25:14,840 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:26:14,841 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:27:14,842 INFO [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
Thanks!
What does the job console show for the job? -> I would like to send the job output log to email if it is possible.
You should be able to GPG-encrypt with my key at https://github.com/chadlwilson/chadlwilson/blob/main/gpg-public-key.asc if you'd prefer, but usually better to just redact. The bits I am interested in are the "yellow" stuff regarding the server's view of what happened to the job, the job allocation to a pod, and timestamps, really - not the specific task details when running.
OK, so what I can see from that is that the pod was deleted from an API call based on
Jul 23 07:09:32 kubelet[2099]: I0723 07:09:32.516360 2099 kubelet.go:2448] "SyncLoop DELETE" source="api" pods=["gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"]
So the working theory here is perhaps that the GoCD server EKS elastic agent plugin deleted it? I'm not sure what other things could be source="api"
unless you have some other security-oriented software in your cluster doing so, or a manual kubectl
call etc.
That log for plugin-output.log
sadly isn't telling us clearly why it happened. There's a log
2024-07-23 07:10:14,833 WARN [230@MessageListener for ServerPingListener] KubernetesPlugin:97 - [Server Ping] Was expecting a containers with IDs [pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192], but it was missing! Removing missing agents from config.
That is 45 seconds later noting that it didn't receive a ping, but that is presumably after the pod has been deleted/killed so possibly just the plugin cleaning up. However this does seem a bit weird, as if the plugin decided to deleted the pod due to non-response, it shouldn't still be waiting for a ping. Hmm.
What do the logs for the actual agent pod say, up until it is deleted/killed/stopped? You can redact the task-specific lines if you say it is running tasks, but it should at least tell us what was happening and if there were any comms issues back to the GoCD server that might cause it to lose contact, or run out of memory or something?
Is there anything in the kube events for the pod, or the lifecycle changes for the pod that look interesting? e.g OOMKilleds or health or liveness check failures or pod sandbox problems? (some of those would normally be in the kubelet log too, but I can't recall OTOH so excuse any ignorance)
As a sanity check:
gocd-agent-ubuntu:2.7.0-adf4049
image? Is that a child image of https://hub.docker.com/r/gocd/gocd-agent-ubuntu-24.04 or similar? or some entirely custom agent image?Hi Chad,
Thank you for your detailed analysis and the pointers. After a thorough investigation, we found the root cause of the problem. The issue was due to both dev and prod GoCD servers using the same namespace in the cluster, causing conflicts and unexpected pod deletions. Now everything looks like it is working as expected.
Thanks for your help!
Best regards, Ivan Piyvikov
Yes, that makes sense. This could be better supported via the ideas at https://github.com/gocd/kubernetes-elastic-agents/issues/118
Now we've encountered a new issue where the containers exceed their memory limits despite the resource constraints set within the Kubernetes configuration. No matter what we do—setting a limit at the namespace level, on the containers, etc.—the memory usage continues to exceed the specified limits. I suspect that this issue may be related to the Java application (tomcat).
I assume you're launching tomcat inside a test or task?
Java apps inside containers are tricky, especially with forked processes. While java processes can be 'container aware' (Java 11+) where there are multiple such processes that are not aware of each other and may request excessive memory. You'll need to control their heap usage when you launch them with -Xmx
options, and also allow a buffer for 'off heap' (aka native) memory usage which is generally unconstrained on java apps (includes JVM code cache, native buffers used by java, and also pure native memory needed for the JVM and any native libraries used by tomcat and your app)
This isn't really specific to GoCD, more the nature of multi-process containers with extra complexities for heap-based (garbage collected) applications and the way the JVM allocates (and doesn't always release) heap memory.
There are some resources online for tuning containers for JVM ergonomics, but it does depend a lot on your application and JVM usage.
We have identified the issue. There were duplicate metrics from different services left over from monitoring tests. This caused confusion in Grafana when summing the metrics, making it appear that there was a problem with the limits.
Thanks again!
We are experiencing an issue where agents are terminated after 10 minutes while jobs are still running. The setup is as follows:
Steps Taken:
EKS-Cluster-Profile:
Despite these adjustments, the issue persists. We would appreciate any guidance or suggestions for further troubleshooting.