timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-66209] Jenkins Controller crashing with exitCode: 1 #1597

Open timja opened 3 years ago

timja commented 3 years ago

Since last 4 -5 months, we have been experiencing Jenkins Controller outages(across multiple versions). The kubernetes pods fails with exitcode 1 and self heals:

O/p from kube describe:

Containers:
jenkins:
Container ID: docker://f494b5939813d0582d7901894a16740178d6c905ff7f87db0bdbbacb63a64367
Image: xxxxxxxxxx.dkr.ecr.ap-southeast-2.amazonaws.com/jenkins-controller:2.277.4-lts-alpine
Image ID: docker-pullable://xxxxxxxx5.dkr.ecr.ap-southeast-2.amazonaws.com/jenkins-controller@sha256:56687bb853764312fe1f28d3b8c161738022f4e80da1b9b81ba82ba929426d1d
Ports: 8080/TCP, 50000/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--httpPort=8080
State: Running
Started: Fri, 23 Jul 2021 10:24:42 +1000
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 23 Jul 2021 09:38:37 +1000
Finished: Fri, 23 Jul 2021 10:24:40 +1000
Ready: True
Restart Count: 2
Limits:
cpu: 3
memory: 22Gi
Requests:
cpu: 2
memory: 20Gi
Liveness: http-get http://:http/login delay=0s timeout=5s period=20s #success=1 #failure=100
Readiness: http-get http://:http/login delay=0s timeout=5s period=10s #success=1 #failure=100
Startup: http-get http://:http/login delay=0s timeout=5s period=100s #success=1 #failure=5
Environment:
POD_NAME: jenkins-blue-0 (v1:metadata.name)
JAVA_OPTS:
JENKINS_OPTS:
JENKINS_SLAVE_AGENT_PORT: 50000
JAVA_OPTS: XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:MaxRAMPercentage=50.0 -Xloggc:/var/jenkins_home/log/gc%t.log -XX:NumberOfGCLogFiles=5 -XX:+UseGCLogFileRotation -XX:GCLogFileSize=20m -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCCause -XX:+PrintTenuringDistribution -XX:+PrintReferenceGC -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:+AlwaysPreTouch -Duser.timezone=Australia/Sydney
CASC_JENKINS_CONFIG: /var/jenkins_home/casc_configs
Mounts:
/var/jenkins_config from jenkins-config (ro)
/var/jenkins_home from jenkins-home (rw)
/var/run/secrets/kubernetes.io/serviceaccount from jenkins-controller-sa-token-67wfs (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
plugins:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit:
jenkins-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: jenkins-blue
Optional: false
jenkins-home:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: jenkins-blue
ReadOnly: false
sc-config-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit:
jenkins-controller-sa-token-67wfs:
Type: Secret (a volume populated by a Secret)
SecretName: jenkins-controller-sa-token-67wfs
Optional: false
QoS Class: Burstable
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 45m (x9 over 91m) kubelet, ip-10-90-111-59.ap-southeast-2.compute.internal Liveness probe failed: Get "http://10.90.110.107:8080/login": dial tcp 10.90.110.107:8080: connect: connection refused
Warning Unhealthy 45m (x19 over 91m) kubelet, ip-10-90-111-59.ap-southeast-2.compute.internal Readiness probe failed: Get "http://10.90.110.107:8080/login": dial tcp 10.90.110.107:8080: connect: connection refused

 

There isnt much in Jenkins core logs, except the error:

webroot: EnvVars.masterEnvVars.get("JENKINS_HOME")
2021-07-22 23:38:54.897+0000 [id=1] INFO winstone.Logger#logInternal: Beginning extraction from war file
Running from: /usr/share/jenkins/jenkins.war
2021-07-22 23:38:54.709+0000 [id=1] INFO org.eclipse.jetty.util.log.Log#initialized: Logging initialized @916ms to org.eclipse.jetty.util.log.JavaUtilLog
exitCode: 0
Scanning success.
exitCode: 1
exitCode: 1
exitCode: 1
exitCode: 1
Terminated Kubernetes instance for agent jenkins/test-automation-virtual-device-virtual-device-e2e-390818--hmjsm
Disconnected computer test-automation-virtual-device-virtual-device-e2e-390818–hmjsm

 

 From last one week, we are nightly restating jenkins controller but it hasnt helped.
 
 


Originally reported by nikhilp, imported from: Jenkins Controller crashing with exitCode: 1
  • status: Open
  • priority: Minor
  • resolution: Unresolved
  • imported: 2022/01/10
timja commented 3 years ago

timja:

Do you have any metrics that show the controller CPU and memory usage over time?
Any performance issues?