User provisioned on starter-us-east-2 seeing frequent/recurring/sometimes constant failure starting Jenkins after env reset

ldimaggi commented 6 years ago

Issue Overview

User provisioned on starter-us-east-2 seeing 100% failure starting Jenkins after env reset

Expected Behaviour

Jenkins should start after a user resets the environment.

Current Behaviour

After an environment reset, Jenkins is left in this state: screenshot from 2018-10-17 09-53-51

Steps To Reproduce

Reset a user's environment
Observe that the Jenkins is not deployed or started

Additional Information

The user is able to manually start the deployment from the Jenkins project overview tab.

ldimaggi commented 6 years ago

This problem seems to be resolved.

ldimaggi commented 6 years ago

My mistake - this problem is still happening.

ldimaggi commented 6 years ago

The sequence that I am seeing is that the Jenkins pod fails to start - and then encounters a quota limit - seeing this problem about 100% of the time today (October 22):

kishansagathiya commented 6 years ago

Faced this everytime I did reset environment

Putting relevent event logs exceeded_quota.log

kishansagathiya commented 6 years ago

@ldimaggi I am removing the intermittent label. The issue is consistent.

pmacik commented 6 years ago

@ldimaggi I experience this too with a user provisioned on the starter-us-east-2a cluster, today.

sthaha commented 6 years ago

Happened to me today as well. Here is the oc get -w ev output

2018-11-01 14:44:14 +1000 AEST   2018-11-01 14:44:14 +1000 AEST   1         jenkins-3   ReplicationController             Warning   FailedCreate   replication-controller   (combined from similar events): Error creating: pods "jenkins-3-nkqjn" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
2018-11-01 14:44:14 +1000 AEST   2018-11-01 14:44:14 +1000 AEST   2         jenkins-3   ReplicationController             Warning   FailedCreate   replication-controller   (combined from similar events): Error creating: pods "jenkins-3-wnjjf" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
2018-11-01 14:44:14 +1000 AEST   2018-11-01 14:44:14 +1000 AEST   3         jenkins-3   ReplicationController             Warning   FailedCreate   replication-controller   (combined from similar events): Error creating: pods "jenkins-3-g4pw5" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
2018-11-01 14:44:15 +1000 AEST   2018-11-01 14:44:14 +1000 AEST   4         jenkins-3   ReplicationController             Warning   FailedCreate   replication-controller   (combined from similar events): Error creating: pods "jenkins-3-cvt2c" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
2018-11-01 14:44:15 +1000 AEST   2018-11-01 14:44:14 +1000 AEST   5         jenkins-3   ReplicationController             Warning   FailedCreate   replication-controller   (combined from similar events): Error creating: pods "jenkins-3-6z95h" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
2018-11-01 14:44:15 +1000 AEST   2018-11-01 14:44:14 +1000 AEST   6         jenkins-3   ReplicationController             Warning   FailedCreate   replication-controller   (combined from similar events): Error creating: pods "jenkins-3-2mljb" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
2018-11-01 14:44:49 +1000 AEST   2018-11-01 14:44:49 +1000 AEST   1         jenkins-3-6cgv7   Pod                 Warning   FailedMount   kubelet, ip-172-31-65-255.us-east-2.compute.internal   Unable to mount volumes for pod "jenkins-3-6cgv7_sunil-thaha-jenkins(94348b16-dd90-11e8-867f-02d7377a4b17)": timeout expired waiting for volumes to attach or mount for pod "sunil-thaha-jenkins"/"jenkins-3-6cgv7". list of unmounted volumes=[jenkins-home jenkins-config jenkins-token-z3pbn]. list of unattached volumes=[jenkins-home jenkins-config jenkins-token-z3pbn]

And then the mount failure

2018-11-01 14:44:52 +1000 AEST   2018-11-01 14:44:52 +1000 AEST   1         jenkins-3-swccd   Pod                 Warning   FailedMount   kubelet, ip-172-31-65-255.us-east-2.compute.internal   Unable to mount volumes for pod "jenkins-3-swccd_sunil-thaha-jenkins(95b11a9c-dd90-11e8-867f-02d7377a4b17)": timeout expired waiting for volumes to attach or mount for pod "sunil-thaha-jenkins"/"jenkins-3-swccd". list of unmounted volumes=[jenkins-home jenkins-config jenkins-token-z3pbn]. list of unattached volumes=[jenkins-home jenkins-config jenkins-token-z3pbn]

gorkem commented 6 years ago

@JohnStrunk could this be related to storage as well?

JohnStrunk commented 6 years ago

@JohnStrunk could this be related to storage as well?

Unlikely. There are 3 "volumes" that failed to attach, and only 1 of them (jenkins-home) is gluster. I believe jenkins-config is a ConfigMap, and jenkens-token-z3pbn is a Secret.

ppitonak commented 6 years ago

I was it in yesterday's logs https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1b-released/988/

only jenkins-home is mentioned

oc get events --sort-by='.lastTimestamp'
LAST SEEN   FIRST SEEN   COUNT     NAME                               KIND                    SUBOBJECT                  TYPE      REASON                        SOURCE                                  MESSAGE
12m         12m          1         jenkins.1563e09788572b37           DeploymentConfig                                   Normal    ReplicationControllerScaled   deploymentconfig-controller             Scaled replication controller "jenkins-1" from 0 to 1
12m         12m          1         jenkins-1-k2wk6.1563e0979f6acbff   Pod                                                Normal    SuccessfulMountVolume         kubelet, ip-172-22-48-77.ec2.internal   MountVolume.SetUp succeeded for volume "jenkins-token-q8x5c" 
12m         12m          1         jenkins-1-k2wk6.1563e097b2d253a1   Pod                                                Normal    SuccessfulMountVolume         kubelet, ip-172-22-48-77.ec2.internal   MountVolume.SetUp succeeded for volume "jenkins-config" 
12m         12m          1         jenkins-1.1563e0978efc2801         ReplicationController                              Normal    SuccessfulCreate              replication-controller                  Created pod: jenkins-1-k2wk6
12m         12m          1         jenkins-1-k2wk6.1563e0978f766fdf   Pod                                                Normal    Scheduled                     default-scheduler                       Successfully assigned jenkins-1-k2wk6 to ip-172-22-48-77.ec2.internal
5m          10m          3         jenkins-1-k2wk6.1563e0b4379d4112   Pod                                                Warning   FailedMount                   kubelet, ip-172-22-48-77.ec2.internal   Unable to mount volumes for pod "jenkins-1-k2wk6_osio-ci-e2e-003-jenkins(081fc8ae-e011-11e8-9da1-12bf27cff69a)": timeout expired waiting for volumes to attach/mount for pod "osio-ci-e2e-003-jenkins"/"jenkins-1-k2wk6". list of unattached/unmounted volumes=[jenkins-home]
3m          3m           1         jenkins-1-k2wk6.1563e11207889c2c   Pod                                                Normal    SuccessfulMountVolume         kubelet, ip-172-22-48-77.ec2.internal   MountVolume.SetUp succeeded for volume "1b81dd21-bb90-4a63-a134-973210c1bf2c-07-d1" 
3m          3m           1         jenkins-1-k2wk6.1563e117849d79f0   Pod                     spec.containers{jenkins}   Normal    Pulling                       kubelet, ip-172-22-48-77.ec2.internal   pulling image "fabric8/jenkins-openshift:v03b76a3"
3m          3m           1         jenkins-1-k2wk6.1563e1180e7a288d   Pod                     spec.containers{jenkins}   Normal    Pulled                        kubelet, ip-172-22-48-77.ec2.internal   Successfully pulled image "fabric8/jenkins-openshift:v03b76a3"
3m          3m           1         jenkins-1-k2wk6.1563e1183db4d164   Pod                     spec.containers{jenkins}   Normal    Created                       kubelet, ip-172-22-48-77.ec2.internal   Created container
3m          3m           1         jenkins-1-k2wk6.1563e118622ee2ad   Pod                     spec.containers{jenkins}   Normal    Started                       kubelet, ip-172-22-48-77.ec2.internal   Started container
2m          2m           2         jenkins-1-k2wk6.1563e124a9d60f55   Pod                     spec.containers{jenkins}   Warning   Unhealthy                     kubelet, ip-172-22-48-77.ec2.internal   Readiness probe failed: Get http://10.128.10.29:8080/login: dial tcp 10.128.10.29:8080: getsockopt: connection refused
11s         1m           7         jenkins-1-k2wk6.1563e1333d525023   Pod                     spec.containers{jenkins}   Warning   Unhealthy                     kubelet, ip-172-22-48-77.ec2.internal   Readiness probe failed: HTTP probe failed with statuscode: 503

JohnStrunk commented 6 years ago

@ppitonak Looks like the mount took a while but did succeed. This is probably the chown issue for which there is a pending fix. I can't really verify however since the PV seems to have been deleted.

I don't know anything about the readiness probe issue.

ldimaggi commented 5 years ago

Have not been seeing this issue recently. Will close in a couple of days if no one else is seeing to too.

chmouel commented 5 years ago

There is multiple problem in this issue, one of them is in there https://github.com/openshiftio/openshift.io/issues/4598 one of them is as @JohnStrunk mentioned is the issue with the slow PV chown/mount issue which should be faster after the update and after rolling this https://github.com/openshiftio/openshift.io/issues/4568 (we decrease the amount consumed by jenkins on the gluster storage so less time to do all those chowns)

gorkem commented 5 years ago

The fix for slow mount is now deployed on all clusters, let's see if it will make a difference

ldimaggi commented 5 years ago

This problem was being seen on December 19-20 by the E2E tests - this resulted in multiple failed test runs.

On December 20, I was able to recreate this problem manually/randomly - but not 100% of the time.

Failed Create  
Error creating: pods "jenkins-1-w6t7z" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
9:42:16 AM  
Deployment Config

ldimaggi commented 5 years ago

The problem seems to be most common on this cluster: starter-us-east-2

gorkem commented 5 years ago

I am not sure if it is related but starter-us-east-2 is the one that is getting more load compared to other clusters nowadays.

ldimaggi commented 5 years ago

Just noticed that this is still happening:

screenshot from 2018-12-20 11-33-16

chmouel commented 5 years ago

@sthaha is it related to the issue you were debugging yesterday with adiyta?

On Thu, Dec 20, 2018, 17:33 Len DiMaggio <notifications@github.com wrote:

Just noticed that this is still happening:

[image: screenshot from 2018-12-20 11-33-16] https://user-images.githubusercontent.com/642621/50297581-1d2c1f80-044b-11e9-9932-28d475c8ef6d.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openshiftio/openshift.io/issues/4451#issuecomment-449057712, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGCpN-aEe8SCl56tpfoGelBY8ig3G1Dks5u67vxgaJpZM4XkmH2 .

ldimaggi commented 5 years ago

The pace/rate of retires seems to be much faster now:

1:19:24 PM | jenkins-1-x4h49 | Pod | Warning | Failed Scheduling | 
persistentvolumeclaim "jenkins-home" not found 6041 times in the last 18 minutes

And...

12557 times in the last 20 minutes

ldimaggi commented 5 years ago

screenshot from 2019-01-07 20-51-57

ldimaggi commented 5 years ago

Fir a user account provisioned on starter-us-east-2, this error is being seen ~90% of the time after an environment reset. Are other users also seeing this frequency? screenshot from 2019-01-08 11-31-32

gorkem commented 5 years ago

@pbergene did we ever escalate this issue. I feel that we are missing data that would allow us to understand the underlying cause.

pbergene commented 5 years ago

@gorkem it looks like a mix of various issues and errors. Is there a common root cause, can re reproduce?

ldimaggi commented 5 years ago

The situation is definitely worse than it has been in the past.

Is this the root cause: https://github.com/openshiftio/openshift.io/issues/4668 ?

gorkem commented 5 years ago

@pbergene I was thinking about the root cause for 'persistentvolumeclaim "jenkins-home" not found '. I do not think we have been able to identify it.

ldimaggi commented 5 years ago

That's this issue: https://github.com/openshiftio/openshift.io/issues/4475

openshiftio / openshift.io