Open ldimaggi opened 6 years ago
This problem seems to be resolved.
My mistake - this problem is still happening.
The sequence that I am seeing is that the Jenkins pod fails to start - and then encounters a quota limit - seeing this problem about 100% of the time today (October 22):
3:03:46 PM | jenkins | Deployment Config | Warning | Failed Create | Error creating: pods "jenkins-1-8k57n" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi 3:03:30 PM | jenkins-1-5sc4d | Pod | Warning | Unhealthy | Readiness probe failed: HTTP probe failed with statuscode: 50314 times in the last 3 minutes 3:00:19 PM | jenkins-1-znrw5 | Pod | Warning | Failed Mount | Unable to mount volumes for pod "jenkins-1-znrw5_ldimaggi-jenkins(77875b5a-d625-11e8-867f-02d7377a4b17)": timeout expired waiting for volumes to attach or mount for pod "ldimaggi-jenkins"/"jenkins-1-znrw5". list of unmounted volumes=[jenkins-home jenkins-config jenkins-token-j259c]. list of unattached volumes=[jenkins-home jenkins-config jenkins-token-j259c] 2:59:34 PM | jenkins-1-5sc4d | Pod | Normal | Started | Started container 2:59:34 PM | jenkins-1-5sc4d | Pod | Normal | Created | Created container
Faced this everytime I did reset environment
Putting relevent event logs exceeded_quota.log
@ldimaggi I am removing the intermittent label. The issue is consistent.
@ldimaggi I experience this too with a user provisioned on the starter-us-east-2a
cluster, today.
Happened to me today as well. Here is the oc get -w ev
output
2018-11-01 14:44:14 +1000 AEST 2018-11-01 14:44:14 +1000 AEST 1 jenkins-3 ReplicationController Warning FailedCreate replication-controller (combined from similar events): Error creating: pods "jenkins-3-nkqjn" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
2018-11-01 14:44:14 +1000 AEST 2018-11-01 14:44:14 +1000 AEST 2 jenkins-3 ReplicationController Warning FailedCreate replication-controller (combined from similar events): Error creating: pods "jenkins-3-wnjjf" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
2018-11-01 14:44:14 +1000 AEST 2018-11-01 14:44:14 +1000 AEST 3 jenkins-3 ReplicationController Warning FailedCreate replication-controller (combined from similar events): Error creating: pods "jenkins-3-g4pw5" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
2018-11-01 14:44:15 +1000 AEST 2018-11-01 14:44:14 +1000 AEST 4 jenkins-3 ReplicationController Warning FailedCreate replication-controller (combined from similar events): Error creating: pods "jenkins-3-cvt2c" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
2018-11-01 14:44:15 +1000 AEST 2018-11-01 14:44:14 +1000 AEST 5 jenkins-3 ReplicationController Warning FailedCreate replication-controller (combined from similar events): Error creating: pods "jenkins-3-6z95h" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
2018-11-01 14:44:15 +1000 AEST 2018-11-01 14:44:14 +1000 AEST 6 jenkins-3 ReplicationController Warning FailedCreate replication-controller (combined from similar events): Error creating: pods "jenkins-3-2mljb" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
2018-11-01 14:44:49 +1000 AEST 2018-11-01 14:44:49 +1000 AEST 1 jenkins-3-6cgv7 Pod Warning FailedMount kubelet, ip-172-31-65-255.us-east-2.compute.internal Unable to mount volumes for pod "jenkins-3-6cgv7_sunil-thaha-jenkins(94348b16-dd90-11e8-867f-02d7377a4b17)": timeout expired waiting for volumes to attach or mount for pod "sunil-thaha-jenkins"/"jenkins-3-6cgv7". list of unmounted volumes=[jenkins-home jenkins-config jenkins-token-z3pbn]. list of unattached volumes=[jenkins-home jenkins-config jenkins-token-z3pbn]
And then the mount failure
2018-11-01 14:44:52 +1000 AEST 2018-11-01 14:44:52 +1000 AEST 1 jenkins-3-swccd Pod Warning FailedMount kubelet, ip-172-31-65-255.us-east-2.compute.internal Unable to mount volumes for pod "jenkins-3-swccd_sunil-thaha-jenkins(95b11a9c-dd90-11e8-867f-02d7377a4b17)": timeout expired waiting for volumes to attach or mount for pod "sunil-thaha-jenkins"/"jenkins-3-swccd". list of unmounted volumes=[jenkins-home jenkins-config jenkins-token-z3pbn]. list of unattached volumes=[jenkins-home jenkins-config jenkins-token-z3pbn]
@JohnStrunk could this be related to storage as well?
@JohnStrunk could this be related to storage as well?
Unlikely. There are 3 "volumes" that failed to attach, and only 1 of them (jenkins-home) is gluster. I believe jenkins-config is a ConfigMap, and jenkens-token-z3pbn is a Secret.
I was it in yesterday's logs https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1b-released/988/
only jenkins-home
is mentioned
oc get events --sort-by='.lastTimestamp'
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
12m 12m 1 jenkins.1563e09788572b37 DeploymentConfig Normal ReplicationControllerScaled deploymentconfig-controller Scaled replication controller "jenkins-1" from 0 to 1
12m 12m 1 jenkins-1-k2wk6.1563e0979f6acbff Pod Normal SuccessfulMountVolume kubelet, ip-172-22-48-77.ec2.internal MountVolume.SetUp succeeded for volume "jenkins-token-q8x5c"
12m 12m 1 jenkins-1-k2wk6.1563e097b2d253a1 Pod Normal SuccessfulMountVolume kubelet, ip-172-22-48-77.ec2.internal MountVolume.SetUp succeeded for volume "jenkins-config"
12m 12m 1 jenkins-1.1563e0978efc2801 ReplicationController Normal SuccessfulCreate replication-controller Created pod: jenkins-1-k2wk6
12m 12m 1 jenkins-1-k2wk6.1563e0978f766fdf Pod Normal Scheduled default-scheduler Successfully assigned jenkins-1-k2wk6 to ip-172-22-48-77.ec2.internal
5m 10m 3 jenkins-1-k2wk6.1563e0b4379d4112 Pod Warning FailedMount kubelet, ip-172-22-48-77.ec2.internal Unable to mount volumes for pod "jenkins-1-k2wk6_osio-ci-e2e-003-jenkins(081fc8ae-e011-11e8-9da1-12bf27cff69a)": timeout expired waiting for volumes to attach/mount for pod "osio-ci-e2e-003-jenkins"/"jenkins-1-k2wk6". list of unattached/unmounted volumes=[jenkins-home]
3m 3m 1 jenkins-1-k2wk6.1563e11207889c2c Pod Normal SuccessfulMountVolume kubelet, ip-172-22-48-77.ec2.internal MountVolume.SetUp succeeded for volume "1b81dd21-bb90-4a63-a134-973210c1bf2c-07-d1"
3m 3m 1 jenkins-1-k2wk6.1563e117849d79f0 Pod spec.containers{jenkins} Normal Pulling kubelet, ip-172-22-48-77.ec2.internal pulling image "fabric8/jenkins-openshift:v03b76a3"
3m 3m 1 jenkins-1-k2wk6.1563e1180e7a288d Pod spec.containers{jenkins} Normal Pulled kubelet, ip-172-22-48-77.ec2.internal Successfully pulled image "fabric8/jenkins-openshift:v03b76a3"
3m 3m 1 jenkins-1-k2wk6.1563e1183db4d164 Pod spec.containers{jenkins} Normal Created kubelet, ip-172-22-48-77.ec2.internal Created container
3m 3m 1 jenkins-1-k2wk6.1563e118622ee2ad Pod spec.containers{jenkins} Normal Started kubelet, ip-172-22-48-77.ec2.internal Started container
2m 2m 2 jenkins-1-k2wk6.1563e124a9d60f55 Pod spec.containers{jenkins} Warning Unhealthy kubelet, ip-172-22-48-77.ec2.internal Readiness probe failed: Get http://10.128.10.29:8080/login: dial tcp 10.128.10.29:8080: getsockopt: connection refused
11s 1m 7 jenkins-1-k2wk6.1563e1333d525023 Pod spec.containers{jenkins} Warning Unhealthy kubelet, ip-172-22-48-77.ec2.internal Readiness probe failed: HTTP probe failed with statuscode: 503
@ppitonak Looks like the mount took a while but did succeed. This is probably the chown issue for which there is a pending fix. I can't really verify however since the PV seems to have been deleted.
I don't know anything about the readiness probe issue.
Have not been seeing this issue recently. Will close in a couple of days if no one else is seeing to too.
There is multiple problem in this issue, one of them is in there https://github.com/openshiftio/openshift.io/issues/4598 one of them is as @JohnStrunk mentioned is the issue with the slow PV chown/mount issue which should be faster after the update and after rolling this https://github.com/openshiftio/openshift.io/issues/4568 (we decrease the amount consumed by jenkins on the gluster storage so less time to do all those chowns)
The fix for slow mount is now deployed on all clusters, let's see if it will make a difference
This problem was being seen on December 19-20 by the E2E tests - this resulted in multiple failed test runs.
On December 20, I was able to recreate this problem manually/randomly - but not 100% of the time.
Failed Create
Error creating: pods "jenkins-1-w6t7z" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
9:42:16 AM
Deployment Config
The problem seems to be most common on this cluster: starter-us-east-2
I am not sure if it is related but starter-us-east-2 is the one that is getting more load compared to other clusters nowadays.
Just noticed that this is still happening:
@sthaha is it related to the issue you were debugging yesterday with adiyta?
On Thu, Dec 20, 2018, 17:33 Len DiMaggio <notifications@github.com wrote:
Just noticed that this is still happening:
[image: screenshot from 2018-12-20 11-33-16] https://user-images.githubusercontent.com/642621/50297581-1d2c1f80-044b-11e9-9932-28d475c8ef6d.png
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openshiftio/openshift.io/issues/4451#issuecomment-449057712, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGCpN-aEe8SCl56tpfoGelBY8ig3G1Dks5u67vxgaJpZM4XkmH2 .
The pace/rate of retires seems to be much faster now:
1:19:24 PM | jenkins-1-x4h49 | Pod | Warning | Failed Scheduling |
persistentvolumeclaim "jenkins-home" not found 6041 times in the last 18 minutes
And...
12557 times in the last 20 minutes
Fir a user account provisioned on starter-us-east-2, this error is being seen ~90% of the time after an environment reset. Are other users also seeing this frequency?
@pbergene did we ever escalate this issue. I feel that we are missing data that would allow us to understand the underlying cause.
@gorkem it looks like a mix of various issues and errors. Is there a common root cause, can re reproduce?
The situation is definitely worse than it has been in the past.
Is this the root cause: https://github.com/openshiftio/openshift.io/issues/4668 ?
@pbergene I was thinking about the root cause for 'persistentvolumeclaim "jenkins-home" not found '. I do not think we have been able to identify it.
That's this issue: https://github.com/openshiftio/openshift.io/issues/4475
Issue Overview
User provisioned on starter-us-east-2 seeing 100% failure starting Jenkins after env reset
Expected Behaviour
Jenkins should start after a user resets the environment.
Current Behaviour
After an environment reset, Jenkins is left in this state:
Steps To Reproduce
Additional Information
The user is able to manually start the deployment from the Jenkins project overview tab.