Open ppitonak opened 5 years ago
@ppitonak I am not able to reproduce this issue from my account.
Happened again yesterday, again on us-east-1a with the same account.
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1647/ Nov 29, 2018 6:35:00 PM
Out of 22 runs, 3 failed with this error. The job runs every 2 hours.
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1656/ Nov 30, 2018 12:35:00 PM
... again the same cluster
I've seen this error on all clusters 1-2 times during weekend, for example:
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1683/console https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1b-released/1159/console https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2a-released/1152/console
There is one thing commong in oc logs
oc get all
NAME READY STATUS RESTARTS AGE
pod/jenkins-1-deploy 0/1 DeadlineExceeded 0 3h
Usually, there are no events in the log, maybe because the jobs run only every 4 hours. However, in one job, I've seen quite a lot of log events like this:
Error creating: pods "jenkins-1-j5j6z" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
This seems like idle is trying to unidle, but pod did not come up because of resource quota and old pod stuck in DeadlineExceeded
state
what is the next step?
This issue still occurs and affects the E2E tests so I'd also like to see some suggestions what we could do about it.
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-released/1169/console
@chmouel @stevengutz @ldimaggi
This log looks also interesting http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1724/oc-jenkins-logs.txt
Extract:
oc get all
NAME READY STATUS RESTARTS AGE
pod/jenkins-1-xz7b7 0/1 Terminating 0 1m
.......
kubelet, ip-172-21-55-225.ec2.internal Killing container with id docker://jenkins:Need to kill Pod
1m 1m 1 jenkins-1-xz7b7.156da5ae712564e6 Pod Normal Scheduled default-scheduler Successfully assigned osio-ci-e2e-002-jenkins/jenkins-1-xz7b7 to ip-172-21-50-152.ec2.internal
1m 1m 1 jenkins-1.156da5ae6ef05c05 ReplicationController Normal SuccessfulCreate replication-controller Created pod: jenkins-1-xz7b7
1m 1m 1 jenkins-1.156da5ae77087b9d ReplicationController Normal SuccessfulDelete replication-controller Deleted pod: jenkins-1-xz7b7
1m 22m 2 jenkins.156da48c68a94215 DeploymentConfig Normal ReplicationControllerScaled deploymentconfig-controller Scaled replication controller "jenkins-1" from 1 to 0
1m 1m 1 jenkins-1.156da5aea63ff670 ReplicationController Warning FailedCreate replication-controller Error creating: pods "jenkins-1-r9xt7" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
@ljelinkova This is reaching resource quota for sure. But it's not showing any other pod, so don't exactly know how it's reaching resource quota. This is something we need to debug.
@piyush-garg The number of failed e2e tests is increasing so give it a high priority please.
Happened twice in last 40 minutes: https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/4272/console https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2a-released/1181/console
Jenkins seems to unidle properly and start the job: But failed to promote the builld: And accessing the Jenkins UI directly resulted in what is this issue about:
Jenkins pod is still running and there are no unusual OpenShift events.
pod/jenkins-1-sv9bp 1/1 Running 0 21m
This part of Jenkins pod log could be useful:
INFO: Terminating Kubernetes instance for agent jenkins-slave-2b956-qb9ft
Dec 06, 2018 3:18:50 PM jenkins.slaves.DefaultJnlpSlaveReceiver channelClosed
WARNING: Computer.threadPoolForRemoting [#32] for jenkins-slave-2b956-qb9ft terminated
java.nio.channels.ClosedChannelException
at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209)
at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222)
at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181)
at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283)
at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503)
at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213)
at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800)
at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173)
at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314)
at hudson.remoting.Channel.close(Channel.java:1450)
at hudson.remoting.Channel.close(Channel.java:1403)
at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:821)
at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:105)
at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:737)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Another failure, the same story. Three successful runs before this one so probably not caused by something broken in previous run. https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-beta/4274/
I have more information about what is going on.
WARNING: Computer.threadPoolForRemoting [#34] for jenkins-slave-fzpvb-tvc99 terminated
SEVERE https://jenkins.api.prod-preview.openshift.io/api/jenkins/start - Failed to load resource: the server responded with a status of 500 (Internal Server Error)
pod/jenkins-1-q9gfr 1/1 Running 0 1h
Then https://github.com/openshiftio/openshift.io/issues/3802 goes on stage.
pod/jenkins-1-q9gfr 1/1 Running 0 3h
pod/jenkins-1-q9gfr 1/1 Running 0 19h
result:
When I click the " See additional details in jenkins", I see the following error in browser instead of Jenkins UI.
{"Errors":[{"code":"500","detail":"Error when starting Jenkins: 2: openshift client error: got status 401 Unauthorized (401) from https://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/ppitonak-preview-jenkins/deploymentconfigs/jenkins"}]}
@ppitonak from where we are seeing the following log?
{"Errors":[{"code":"500","detail":"Error when starting Jenkins: 2: openshift client error: got status 401 Unauthorized (401) from https://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/ppitonak-preview-jenkins/deploymentconfigs/jenkins"}]}
@ppitonak yeah, can see now with prod-preview Jenkins.
The token was updated. Should be fixed by now. Please reopen if it's still not working for you.
Reopening because the fix did not work and it may not be the root cause of this issue.
@ljelinkova Is it still happening on prod-preview or east-2a cluster?
I can access the Jenkins which is on east-2a
The issue is still that we are getting an invalid token for free-stg :
We (idler) do get the clusters from auth with the /api/clusters
rest path,
We get the token on prodpreview for two clusters, free-stg
and east-2a
Using manually the token provided by auth to east-2a work :
% oc login https://api.starter-us-east-2a.openshift.com/ --token=XXXXX
Logged into "https://api.starter-us-east-2a.openshift.com:443" as "devtools-sre" using the token provided.
You have access to 9120 projects, the list has been suppressed. You can list all projects with 'oc projects'
The token for stg is invalid :
% oc login https://api.free-stg.openshift.com/ --token=XXXX
error: The token provided is invalid or expired.
so we really need to get a proper token to stg to work,
having said that we should gracefully handle those in idler instead of trying billions times and then running out of memory in the pods, I have logged this issue here to track it :
New token deployed by sre team, and it seems to work, we don't see the 401 errors anymore on idler logs
@ppitonak @ljelinkova Can you please check if the issue has been resolved now? Thanks
I just did a start of e2e tests and seems like Jenkins is working fine.
URL's
http://artifacts.ci.centos.org/devtools/e2e/devtools-saas-openshiftio-e2e-smoketest-released/393/
http://artifacts.ci.centos.org/devtools/e2e/devtools-saas-openshiftio-e2e-smoketest-beta/387/
@piyush-garg we are monitoring it closely but it's too soon to judge. Let's give it couple of hours before closing this issue.
Thanks, @ppitonak No problem. Just from the last run seems like it is fixed. But close it whenever you are sure about it. Thanks a lot.
This issue has NOT been caused by expired token. We have new occurences:
https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/4372/console https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/4377/console https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-beta/4373/console
The same in production.
Do you have the link to the openshift logs?
I do have some issue currently on prod with gluster timing out:
compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
45s 45s 1 jenkins-5-w4p6m.156f3e3c44242486 Pod Warning FailedMount kubelet, ip-172-31-66-195.us-east-2.compute.internal Unable to mount volumes for pod "jenkins-5-w4p6m_cboudjna-jenkins(6ea9ed2e-fd29-11e8-a5d6-02d7377a4b17)": timeout expired waiting for volumes to attach or mount for pod "cboudjna-jenkins"/"jenkins-5-w4p6m". list of unmounted volumes=[jenkins-home jenkins-config jenkins-token-lbwxh]. list of unattached volumes=[jenkins-home jenkins-config jenkins-token-lbwxh]
34s 34s 1 jenkins-5-xnbjf.156f3e3eaeaa0631 Pod Warning FailedMount kubelet, ip-172-31-66-159.us-east-2.compute.internal Unable to mount volumes for pod "jenkins-5-xnbjf_cboudjna-jenkins(74ce6aff-fd29-11e8-a5d6-02d7377a4b17)": timeout expired waiting for volumes to attach or mount for pod "cboudjna-jenkins"/"jenkins-5-xnbjf". list of unmounted volumes=[jenkins-home jenkins-config jenkins-token-lbwxh]. list of unattached volumes=[jenkins-home jenkins-config jenkins-token-lbwxh]
if that's the log, this seems to correlate although the error seems different:
3m 58m 24950 jenkins-1-hrm5s.156f0a4d8afbbfd4 Pod Warning FailedScheduling default-scheduler persistentvolumeclaim "jenkins-home" not found
Can you show me where is the output of oc logs dc/jenkins
please, I'd like to know why readiness probe failed,
The deployment config section starts with
---------- Jenkins deployment config ---------------
oc get -o yaml dc/jenkins
If you want us to add any other command to the script, just specify what shall we add.
@ljelinkova thanks, can you please get us the output of :
(in jenkins namespace)
oc logs dc/jenkins
oc get ev
that would be much helpful to debug those, thanks
We do have
---------- Get events ---------------
oc get events --sort-by='.lastTimestamp'
and I'll add oc logs dc/jenkins
@chmouel
oc logs dc/jenkins
how is this different from oc logs jenkins-1-somehash
?
okay right we have all the information, so definitively the issue is with :
3m 58m 24950 jenkins-1-hrm5s.156f0a4d8afbbfd4 Pod Warning FailedScheduling default-scheduler persistentvolumeclaim "jenkins-home" not found
I think we saw that error before?
you (@ppitonak or someone else chatted to @sthaha about it, when jenkins-home
don't get created, it means the tenant creation or recreation hasn't been done properly and the volume is not there,
Somehow there is an error hapenning earlier (at tenant creation) where fabric8-tenant fails,
@alexeykazakov @sbose78 do you know how that flow is working ? fabric8-tenant should report somewhere when it cannot instantiate a resource ? My bet is on a gluster timeout not handled properly in fabric8-tenant,
(and as well with those sev1 issues there is multiple issues in a catch all one)
okay right we have all the information, so definitively the issue is with :
3m 58m 24950 jenkins-1-hrm5s.156f0a4d8afbbfd4 Pod Warning FailedScheduling default-scheduler persistentvolumeclaim "jenkins-home" not found
I think we saw that error before?
you (@ppitonak or someone else chatted to @sthaha about it, when
jenkins-home
don't get created, it means the tenant creation or recreation hasn't been done properly and the volume is not there,
The list of issues mentioning jenkins-home
is quite long, here are just open ones:
https://github.com/openshiftio/openshift.io/issues?q=jenkins-home+is%3Aopen
I believe it's then a duplicate from https://github.com/openshiftio/openshift.io/issues/4121 there is some logic in the test when resetting the tenant that need to be reviewed properly, nobody addressed aslak comment here :
https://github.com/openshiftio/openshift.io/issues/4121#issuecomment-410648540
there is a race somewhere in the test that makes the jenkins home not creating,
I would imagine this race is showing up when gluster is being 'slow' (which is something i have seen in my logs when doing some tests)
cc: @MatousJobanek
I was trying o find anything relevant in the tenant logs - related to this issue - unfortunately, I haven't found anything (which doesn't mean that there is nothing :-))
there is a race somewhere in the test that makes the jenkins home not creating,
if the Reset, Clean & Apply is fast enough and OpenShift slow enough, then it can happen that some objects are still present in OS while creation of the new ones with the same name is requested. I see two options for fixing this:
terminating
state and wait until it is completely gone. However, I'm afraid that in case of PVC it can happen that it seems to be removed, but when tenant tries to create a new one with the same name, then it fails anywaymjobanek-jenkins
namespace (or any other one), then tenant should try to create new namespaces with different base name: mjobanek1-jenkins
as the safest way seems to be the second solution. Currently, I'm validating this against local minishift instance.
I am also facing the same issue as mentioned above, not able to mount volume. Attaching screenshot
@Preeticp that's another issue, this is not an issue where the volume wasn't created but where it's timing out mounting out from gluster, from my experience the dc would come up after a while, there is a planned that hopefully would fix this but I am not sure where is the issue about it cc @pbergene
@MatousJobanek
I have seen this script floating around https://github.com/sborenst/ansible_agnostic_deployer/blob/development/ansible/configs/ocp-workshop/files/wack_terminating_project.sh not sure if this call can be useful for us when resetting a tenant:
curl -s -k -H "Content-Type: application/json" -X PUT --data-binary @$tmp http://127.0.0.1:28001/api/v1/namespaces/$project/finalize > /dev/null
could be useful to force terminate,
for
if there is either a conflict or 403 returned from OS for PVC in mjobanek-jenkins namespace (or any other one), then tenant should try to create new namespaces with different base name: mjobanek1-jenkins
I am not sure i understand you are suggesting, would that break some associations ?
I have seen this script floating around https://github.com/sborenst/ansible_agnostic_deployer/blob/development/ansible/configs/ocp-workshop/files/wack_terminating_project.sh not sure if this call can be useful for us when resetting a tenant:
I'm not sure if force-terminating and expecting that the original PVC should be immediately correctly removed is the best solution.
I am not sure i understand you are suggesting, would that break some associations ?
it shouldn't break anything. we are already using this logic for cases when there are two users with the same name. TLDR explanation would be:
Currently, the namespace names are constructed by getting the first part of the OS username (for mjobanek@redhat.com
it is mjobanek
) and joining with the namespace suffix (for jenkins it is mjobanek-jenkins
). If there was another account mjobanek@ibm.com
then it would conflict with the first one so to solve it we change the base-name for namespaces to mjobanek2
which means mjobanek2-jenkins
for the jenkins namespace. This is only internal linking and no other component/service should be affected by that.
Our case could be solved using the same logic - when the creation of the namespaces fails because of either 409 or 403, then tenant service could increment the suffix of the base-name (mjobanek3
) and try to create new namespaces with it.
@MatousJobanek sounds good then thanks for taking the time to describe how it works, I think that looks good then!
I'm not sure if force-terminating and expecting that the original PVC should be immediately correctly removed is the best solution.
I think the state terminating mean the volume drive is 'recycling' the gluster volumes and it is stuck into it at that time, perhaps @jfchevrette or @pbergene can give us some advice here.
What's the latest status on this?
@ljelinkova @ppitonak Any updates about how the tests are performing now?
@piyush-garg observed twice in last 24 hours but there is a problem in Launcher at the moment so many builds didn't get that far.
https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/4548/console https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-released/1248/console
I have unassigned myself, this is not something we can solve from the build side, as stated before this is or the test or fabric8-tenant that can fix this,
Issue Overview
When user navigates to http://jenkins.openshift.io, he sees error page instead of Jenkins UI
Expected Behaviour
Jenkins UI is displayed
Current Behaviour
Error message
Steps To Reproduce
Additional Information
We saw it twice on us-east-1a-beta
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1634/ Nov 28, 2018 4:35:00 PM UTC https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1635/ Nov 28, 2018 6:35:00 PM UTC
We saw similar bug for
api.openshift.io
: http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-logintest-us-east-2-released/14962/01-01-afterEach.png Nov 8, 2018 4:28:00 AM UTC http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-logintest-us-east-2a-released/14953/01-01-afterEach.png Nov 8, 2018 4:32:00 AM UTC