gocd / kubernetes-elastic-agents

Kubernetes Elastic agent plugin for GoCD
https://www.gocd.org
Apache License 2.0
34 stars 32 forks source link

With GoCD Version: 24.2.0 and Kubernetes Elastic Agent plugin: 4.1.0-541, agents are killed after 10 minutes, even when the job has not finished execution yet. #422

Closed ivanpiyvikov closed 1 month ago

ivanpiyvikov commented 1 month ago

We are experiencing an issue where agents are terminated after 10 minutes while jobs are still running. The setup is as follows:

Steps Taken:

EKS-Cluster-Profile:

apiVersion: v1
kind: Pod
metadata:
  name: pod-gocd-prefix-{{ POD_POSTFIX }}
  namespace: gocd-agents
  labels:
    app: web
spec:
  restartPolicy: Never
  serviceAccountName: gocd-agent-sa
  containers:
    - name: gocd-agent-container-{{ CONTAINER_POSTFIX }}
      image: *.dkr.ecr.eu-west-1.amazonaws.com/gocd-agent-ubuntu:2.7.0-adf4049
      env:
      volumeMounts:
        - name: persistent-storage
          mountPath: /godata/pipelines
  volumes:
    - name: persistent-storage
      persistentVolumeClaim:
        claimName: efs-gocd-agent-claim

Despite these adjustments, the issue persists. We would appreciate any guidance or suggestions for further troubleshooting.

chadlwilson commented 1 month ago

Whet does the job console show for the job? Did the pod register and start running tasks? Did it get rescheduled?

What do the plugin specific logs on the server say?(There is a log file specific to the plugin in the logs folder, not just the server log file)

What does kubernetes say happened to the pod? Did kubernetes kill the pod or the container?

Is this a new setup or did something change before this started happening? (Cluster upgrade? GoCD upgrade?)

ivanpiyvikov commented 1 month ago
  1. What does the job console show for the job? -> I would like to send the job output log to email if it is possible.
  2. Did the pod register and start running tasks? -> Yes
  3. Did it get rescheduled? -> In the policy we have added: Restart policy: Never. When there is no restart policy it keeps killing and creating new pods.
  4. What do the plugin-specific logs on the server say? -> The kublet and plugin outputs are below.
  5. What does Kubernetes say happened to the pod? Did Kubernetes kill the pod or the container? -> It is terminating the pod.
  6. Is this a new setup or did something change before this started happening? -> It is a new setup.

kublet-output.log

root@ip----~]# journalctl -u kubelet | grep pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192
Jul 23 06:59:08  kubelet[2099]: I0723 06:59:08.015737    2099 kubelet.go:2432] "SyncLoop ADD" source="api" pods=["gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"]
Jul 23 06:59:08  kubelet[2099]: I0723 06:59:08.015781    2099 topology_manager.go:215] "Topology Admit Handler" podUID="084b0538-4d29-4e2e-b709-9cd654b2d764" podNamespace="gocd-agents" podName="pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08  kubelet[2099]: I0723 06:59:08.016393    2099 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08  kubelet[2099]: I0723 06:59:08.150651    2099 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume startedfor volume \"kube-api-access-fzmrk\" (UniqueName: \"kubernetes.io/projected/084b0538-4d29-4e2e-b709-9cd654b2d764-kube-api-access-fzmrk\") pod \"pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192\" (UID: \"084b0538-4d29-4e2e-b709-9cd654b2d764\") " pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08  kubelet[2099]: I0723 06:59:08.150709    2099 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume startedfor volume \"efs-gocd-agent-pv\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-03b36c3b9ea54c807\") pod \"pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192\" (UID: \"084b0538-4d29-4e2e-b709-9cd654b2d764\") " pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08  kubelet[2099]: I0723 06:59:08.251191    2099 reconciler_common.go:220] "operationExecutor.MountVolume started for volume \"kube-api-access-fzmrk\" (UniqueName: \"kubernetes.io/projected/084b0538-4d29-4e2e-b709-9cd654b2d764-kube-api-access-fzmrk\") pod \"pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192\" (UID: \"084b0538-4d29-4e2e-b709-9cd654b2d764\") " pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08  kubelet[2099]: I0723 06:59:08.251232    2099 reconciler_common.go:220] "operationExecutor.MountVolume started for volume \"efs-gocd-agent-pv\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-03b36c3b9ea54c807\") pod \"pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192\" (UID: \"084b0538-4d29-4e2e-b709-9cd654b2d764\") " pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08  kubelet[2099]: I0723 06:59:08.252357    2099 operation_generator.go:664] "MountVolume.MountDevice succeeded for volume \"efs-gocd-agent-pv\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-03b36c3b9ea54c807\") pod \"pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192\" (UID: \"084b0538-4d29-4e2e-b709-9cd654b2d764\") device mount path \"/var/lib/kubelet/plugins/kubernetes.io/csi/efs.csi.aws.com/b8b581699bf8984fa0f74c5cbb85276d36cf340579facca7d82ca0ccae7f4351/globalmount\"" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08  kubelet[2099]: I0723 06:59:08.267058    2099 operation_generator.go:721] "MountVolume.SetUp succeeded for volume \"kube-api-access-fzmrk\" (UniqueName: \"kubernetes.io/projected/084b0538-4d29-4e2e-b709-9cd654b2d764-kube-api-access-fzmrk\") pod \"pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192\" (UID: \"084b0538-4d29-4e2e-b709-9cd654b2d764\") " pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:08  kubelet[2099]: I0723 06:59:08.958949    2099 operation_generator.go:721] "MountVolume.SetUp succeeded for volume \"efs-gocd-agent-pv\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-03b36c3b9ea54c807\") pod \"pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192\" (UID: \"084b0538-4d29-4e2e-b709-9cd654b2d764\") " pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:09  kubelet[2099]: I0723 06:59:09.223359    2099 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 06:59:09  kubelet[2099]: I0723 06:59:09.653326    2099 kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192" event={"ID":"084b0538-4d29-4e2e-b709-9cd654b2d764","Type":"ContainerStarted","Data":"d1c6bcb4fd9589396a538ce2181a281461fbe1f127c725a65185ea97cbe1430d"}
Jul 23 06:59:09  kubelet[2099]: I0723 06:59:09.653369    2099 kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192" event={"ID":"084b0538-4d29-4e2e-b709-9cd654b2d764","Type":"ContainerStarted","Data":"26872e3d74b7c730cded16bbbf98d38d38c0767f8f9344c3156cab063fa48409"}
Jul 23 06:59:09  kubelet[2099]: I0723 06:59:09.667171    2099 pod_startup_latency_tracker.go:104] "Observed pod startup duration" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192" podStartSLOduration=2.667160723 podStartE2EDuration="2.667160723s" podCreationTimestamp="2024-07-23 06:59:07 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-07-23 06:59:09.66655598 +0000 UTC m=+1537985.519794742" watchObservedRunningTime="2024-07-23 06:59:09.667160723 +0000 UTC m=+1537985.520399474"
Jul 23 07:09:32  kubelet[2099]: I0723 07:09:32.516360    2099 kubelet.go:2448] "SyncLoop DELETE" source="api" pods=["gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"]
Jul 23 07:09:32  kubelet[2099]: I0723 07:09:32.516497    2099 kuberuntime_container.go:779] "Killing container with a grace period" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192" podUID="084b0538-4d29-4e2e-b709-9cd654b2d764" containerName="gocd-agent-container-a4f3e55d-b06f-4d8b-9f8f-0b62c6012d1c" containerID="containerd://d1c6bcb4fd9589396a538ce2181a281461fbe1f127c725a65185ea97cbe1430d" gracePeriod=30
Jul 23 07:09:33  kubelet[2099]: I0723 07:09:33.929956    2099 kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192" event={"ID":"084b0538-4d29-4e2e-b709-9cd654b2d764","Type":"ContainerDied","Data":"d1c6bcb4fd9589396a538ce2181a281461fbe1f127c725a65185ea97cbe1430d"}
Jul 23 07:09:33  kubelet[2099]: I0723 07:09:33.929982    2099 kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192" event={"ID":"084b0538-4d29-4e2e-b709-9cd654b2d764","Type":"ContainerDied","Data":"26872e3d74b7c730cded16bbbf98d38d38c0767f8f9344c3156cab063fa48409"}
Jul 23 07:09:33  kubelet[2099]: I0723 07:09:33.950686    2099 util.go:48] "No ready sandbox for pod can be found. Need to start a new one" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 07:09:34  kubelet[2099]: I0723 07:09:34.931927    2099 util.go:48] "No ready sandbox for pod can be found. Need to start a new one" pod="gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"
Jul 23 07:09:34  kubelet[2099]: I0723 07:09:34.962970    2099 kubelet.go:2448] "SyncLoop DELETE" source="api" pods=["gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"]
Jul 23 07:09:34  kubelet[2099]: I0723 07:09:34.969509    2099 kubelet.go:2442] "SyncLoop REMOVE" source="api" pods=["gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"]
[root@ip----~]#

plugin-output.log

2024-07-23 06:59:07,953 INFO  [228@MessageListener for CreateAgentListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 06:59:07,970 INFO  [228@MessageListener for CreateAgentListener] KubernetesPlugin:72 - [Create Agent] Creating K8s pod with spec: Pod(apiVersion=v1, kind=Pod, metadata=ObjectMeta(annotations={Image=, MaxMemory=, MaxCPU=, Environment=, PodConfiguration=apiVersion: v1
kind: Pod
metadata:
  name: pod-gocd-prefix-{{ POD_POSTFIX }}
  namespace: gocd-agents
  labels:
    app: web
spec:
  restartPolicy: Never
  serviceAccountName: gocd-agent-sa
  containers:
    - name: gocd-agent-container-{{ CONTAINER_POSTFIX }}
      image: accHERE.dkr.ecr.eu-west-1.amazonaws.com/gocd-agent-ubuntu:2.7.0-adf4049
      env:
      volumeMounts:
        - name: persistent-storage
          mountPath: /godata/pipelines
  volumes:
    - name: persistent-storage
      persistentVolumeClaim:
        claimName: efs-gocd-agent-claim, SpecifiedUsingPodConfiguration=, PodSpecType=yaml, RemoteFile=, RemoteFileType=json, Privileged=false, Elastic-Agent-Job-Identifier={"pipeline_name":"isp_detc_qacore","pipeline_counter":754,"pipeline_label":"detc-manifest-1721148118","stage_name":"create_stack","stage_counter":"2","job_name":"create_stack","job_id":166948}}, creationTimestamp=2024-07-23T06:59:07.970616045Z, deletionGracePeriodSeconds=null, deletionTimestamp=null, finalizers=[], generateName=null, generation=null, labels={app=web, Elastic-Agent-Created-By=cd.go.contrib.elasticagent.kubernetes, Elastic-Agent-Job-Id=166948, kind=kubernetes-elastic-agent}, managedFields=[], name=pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192, namespace=gocd-agents, ownerReferences=[], resourceVersion=null, selfLink=null, uid=null, additionalProperties={}), spec=PodSpec(activeDeadlineSeconds=null, affinity=null, automountServiceAccountToken=null, containers=[Container(args=[], command=[], env=[EnvVar(name=GO_EA_SERVER_URL, value=https://gocd.prod.isp.starsops.com/go, valueFrom=null, additionalProperties={}), EnvVar(name=GO_EA_AUTO_REGISTER_KEY, value=fb6145e1-b880-49c1-a603-dfe824fcda50, valueFrom=null, additionalProperties={}), EnvVar(name=GO_EA_AUTO_REGISTER_ELASTIC_AGENT_ID, value=pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192, valueFrom=null, additionalProperties={}), EnvVar(name=GO_EA_AUTO_REGISTER_ELASTIC_PLUGIN_ID, value=cd.go.contrib.elasticagent.kubernetes, valueFrom=null, additionalProperties={})], envFrom=[], image=accHERE.dkr.ecr.eu-west-1.amazonaws.com/gocd-agent-ubuntu:2.7.0-adf4049, imagePullPolicy=null, lifecycle=null, livenessProbe=null, name=gocd-agent-container-a4f3e55d-b06f-4d8b-9f8f-0b62c6012d1c, ports=[], readinessProbe=null, resizePolicy=[], resources=null, restartPolicy=null, securityContext=null, startupProbe=null, stdin=null, stdinOnce=null, terminationMessagePath=null, terminationMessagePolicy=null, tty=null, volumeDevices=[], volumeMounts=[VolumeMount(mountPath=/godata/pipelines, mountPropagation=null, name=persistent-storage, readOnly=null, subPath=null, subPathExpr=null, additionalProperties={})], workingDir=null, additionalProperties={})], dnsConfig=null, dnsPolicy=null, enableServiceLinks=null, ephemeralContainers=[], hostAliases=[], hostIPC=null, hostNetwork=null, hostPID=null, hostUsers=null, hostname=null, imagePullSecrets=[], initContainers=[], nodeName=null, nodeSelector={}, os=null, overhead={}, preemptionPolicy=null, priority=null, priorityClassName=null, readinessGates=[], resourceClaims=[], restartPolicy=Never, runtimeClassName=null, schedulerName=null, schedulingGates=[], securityContext=null, serviceAccount=null, serviceAccountName=gocd-agent-sa, setHostnameAsFQDN=null, shareProcessNamespace=null, subdomain=null, terminationGracePeriodSeconds=null, tolerations=[], topologySpreadConstraints=[], volumes=[Volume(awsElasticBlockStore=null, azureDisk=null, azureFile=null, cephfs=null, cinder=null, configMap=null, csi=null, downwardAPI=null, emptyDir=null, ephemeral=null, fc=null, flexVolume=null, flocker=null, gcePersistentDisk=null, gitRepo=null, glusterfs=null, hostPath=null, iscsi=null, name=persistent-storage, nfs=null, persistentVolumeClaim=PersistentVolumeClaimVolumeSource(claimName=efs-gocd-agent-claim, readOnly=null, additionalProperties={}), photonPersistentDisk=null, portworxVolume=null, projected=null, quobyte=null, rbd=null, scaleIO=null, secret=null, storageos=null, vsphereVolume=null, additionalProperties={})], additionalProperties={}), status=null, additionalProperties={}).
2024-07-23 06:59:14,829 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:00:14,829 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:00:14,830 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - Recycling kubernetes client on timer...
2024-07-23 07:01:14,909 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:02:14,830 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:03:14,830 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:04:14,831 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:05:14,832 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:06:14,832 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:07:14,833 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:08:14,833 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:09:14,834 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 1.
2024-07-23 07:10:14,833 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:10:14,833 WARN  [230@MessageListener for ServerPingListener] KubernetesPlugin:97 - [Server Ping] Was expecting a containers with IDs [pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192], but it was missing! Removing missing agents from config.
2024-07-23 07:11:14,834 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:12:14,829 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - Recycling kubernetes client on timer...
2024-07-23 07:12:14,902 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:13:14,835 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:14:14,835 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:15:14,836 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:16:14,835 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:17:14,837 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:18:14,836 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:19:14,838 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:20:14,839 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:21:14,838 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:22:14,839 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:23:14,839 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:23:14,840 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - Recycling kubernetes client on timer...
2024-07-23 07:24:14,906 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:25:14,840 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:26:14,841 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-07-23 07:27:14,842 INFO  [230@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
chadlwilson commented 1 month ago

Thanks!

What does the job console show for the job? -> I would like to send the job output log to email if it is possible.

You should be able to GPG-encrypt with my key at https://github.com/chadlwilson/chadlwilson/blob/main/gpg-public-key.asc if you'd prefer, but usually better to just redact. The bits I am interested in are the "yellow" stuff regarding the server's view of what happened to the job, the job allocation to a pod, and timestamps, really - not the specific task details when running.

OK, so what I can see from that is that the pod was deleted from an API call based on

Jul 23 07:09:32  kubelet[2099]: I0723 07:09:32.516360    2099 kubelet.go:2448] "SyncLoop DELETE" source="api" pods=["gocd-agents/pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192"]

So the working theory here is perhaps that the GoCD server EKS elastic agent plugin deleted it? I'm not sure what other things could be source="api" unless you have some other security-oriented software in your cluster doing so, or a manual kubectl call etc.

That log for plugin-output.log sadly isn't telling us clearly why it happened. There's a log

2024-07-23 07:10:14,833 WARN  [230@MessageListener for ServerPingListener] KubernetesPlugin:97 - [Server Ping] Was expecting a containers with IDs [pod-gocd-prefix-4a6f695e-12e0-479d-9c5b-d59410c32192], but it was missing! Removing missing agents from config.

That is 45 seconds later noting that it didn't receive a ping, but that is presumably after the pod has been deleted/killed so possibly just the plugin cleaning up. However this does seem a bit weird, as if the plugin decided to deleted the pod due to non-response, it shouldn't still be waiting for a ping. Hmm.

What do the logs for the actual agent pod say, up until it is deleted/killed/stopped? You can redact the task-specific lines if you say it is running tasks, but it should at least tell us what was happening and if there were any comms issues back to the GoCD server that might cause it to lose contact, or run out of memory or something?

Is there anything in the kube events for the pod, or the lifecycle changes for the pod that look interesting? e.g OOMKilleds or health or liveness check failures or pod sandbox problems? (some of those would normally be in the kubelet log too, but I can't recall OTOH so excuse any ignorance)

As a sanity check:

ivanpiyvikov commented 1 month ago

Hi Chad,

Thank you for your detailed analysis and the pointers. After a thorough investigation, we found the root cause of the problem. The issue was due to both dev and prod GoCD servers using the same namespace in the cluster, causing conflicts and unexpected pod deletions. Now everything looks like it is working as expected.

Thanks for your help!

Best regards, Ivan Piyvikov

chadlwilson commented 1 month ago

Yes, that makes sense. This could be better supported via the ideas at https://github.com/gocd/kubernetes-elastic-agents/issues/118

ivanpiyvikov commented 1 month ago

Now we've encountered a new issue where the containers exceed their memory limits despite the resource constraints set within the Kubernetes configuration. No matter what we do—setting a limit at the namespace level, on the containers, etc.—the memory usage continues to exceed the specified limits. I suspect that this issue may be related to the Java application (tomcat).

chadlwilson commented 1 month ago

I assume you're launching tomcat inside a test or task?

Java apps inside containers are tricky, especially with forked processes. While java processes can be 'container aware' (Java 11+) where there are multiple such processes that are not aware of each other and may request excessive memory. You'll need to control their heap usage when you launch them with -Xmx options, and also allow a buffer for 'off heap' (aka native) memory usage which is generally unconstrained on java apps (includes JVM code cache, native buffers used by java, and also pure native memory needed for the JVM and any native libraries used by tomcat and your app)

This isn't really specific to GoCD, more the nature of multi-process containers with extra complexities for heap-based (garbage collected) applications and the way the JVM allocates (and doesn't always release) heap memory.

There are some resources online for tuning containers for JVM ergonomics, but it does depend a lot on your application and JVM usage.

ivanpiyvikov commented 1 month ago

We have identified the issue. There were duplicate metrics from different services left over from monitoring tests. This caused confusion in Grafana when summing the metrics, making it appear that there was a problem with the limits.

Thanks again!