jenkins.http’s server IP address could not be found.

ppitonak commented 5 years ago

Issue Overview

When user navigates to http://jenkins.openshift.io, he sees error page instead of Jenkins UI

Expected Behaviour

Jenkins UI is displayed

Current Behaviour

Error message 05-04-jenkins-direct-log

Steps To Reproduce

create a space, create a new app, wait for pipeline to start (might not be necessary)
open new tab and open http://jenkins.openshift.io

Additional Information

We saw it twice on us-east-1a-beta

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1634/ Nov 28, 2018 4:35:00 PM UTC https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1635/ Nov 28, 2018 6:35:00 PM UTC

piyush-garg commented 5 years ago

@ppitonak I am not able to reproduce this issue from my account.

ppitonak commented 5 years ago

Happened again yesterday, again on us-east-1a with the same account.

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1647/ Nov 29, 2018 6:35:00 PM

Out of 22 runs, 3 failed with this error. The job runs every 2 hours.

ppitonak commented 5 years ago

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1656/ Nov 30, 2018 12:35:00 PM

... again the same cluster

ljelinkova commented 5 years ago

I've seen this error on all clusters 1-2 times during weekend, for example:

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1683/console https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1b-released/1159/console https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2a-released/1152/console

There is one thing commong in oc logs

oc get all
NAME                   READY     STATUS             RESTARTS   AGE
pod/jenkins-1-deploy   0/1       DeadlineExceeded   0          3h

Usually, there are no events in the log, maybe because the jobs run only every 4 hours. However, in one job, I've seen quite a lot of log events like this:

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1683/oc-jenkins-logs.txt

 Error creating: pods "jenkins-1-j5j6z" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi

piyush-garg commented 5 years ago

This seems like idle is trying to unidle, but pod did not come up because of resource quota and old pod stuck in DeadlineExceeded state

ppitonak commented 5 years ago

what is the next step?

ljelinkova commented 5 years ago

This issue still occurs and affects the E2E tests so I'd also like to see some suggestions what we could do about it.

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-released/1169/console

ljelinkova commented 5 years ago

@chmouel @stevengutz @ldimaggi

ljelinkova commented 5 years ago

This log looks also interesting http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1724/oc-jenkins-logs.txt

Extract:

oc get all
NAME                  READY     STATUS        RESTARTS   AGE
pod/jenkins-1-xz7b7   0/1       Terminating   0          1m
.......
kubelet, ip-172-21-55-225.ec2.internal   Killing container with id docker://jenkins:Need to kill Pod
1m          1m           1         jenkins-1-xz7b7.156da5ae712564e6    Pod                                                   Normal    Scheduled                     default-scheduler                        Successfully assigned osio-ci-e2e-002-jenkins/jenkins-1-xz7b7 to ip-172-21-50-152.ec2.internal
1m          1m           1         jenkins-1.156da5ae6ef05c05          ReplicationController                                 Normal    SuccessfulCreate              replication-controller                   Created pod: jenkins-1-xz7b7
1m          1m           1         jenkins-1.156da5ae77087b9d          ReplicationController                                 Normal    SuccessfulDelete              replication-controller                   Deleted pod: jenkins-1-xz7b7
1m          22m          2         jenkins.156da48c68a94215            DeploymentConfig                                      Normal    ReplicationControllerScaled   deploymentconfig-controller              Scaled replication controller "jenkins-1" from 1 to 0
1m          1m           1         jenkins-1.156da5aea63ff670          ReplicationController                                 Warning   FailedCreate                  replication-controller                   Error creating: pods "jenkins-1-r9xt7" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi

ljelinkova commented 5 years ago

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1723/oc-jenkins-logs.txt

piyush-garg commented 5 years ago

@ljelinkova This is reaching resource quota for sure. But it's not showing any other pod, so don't exactly know how it's reaching resource quota. This is something we need to debug.

ljelinkova commented 5 years ago

@piyush-garg The number of failed e2e tests is increasing so give it a high priority please.

ppitonak commented 5 years ago

Happened twice in last 40 minutes: https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/4272/console https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2a-released/1181/console

Jenkins seems to unidle properly and start the job: osio_ip_1 But failed to promote the builld: osio_ip_2 And accessing the Jenkins UI directly resulted in what is this issue about: osio_ip_3

Jenkins pod is still running and there are no unusual OpenShift events.

pod/jenkins-1-sv9bp                       1/1       Running     0          21m

This part of Jenkins pod log could be useful:

INFO: Terminating Kubernetes instance for agent jenkins-slave-2b956-qb9ft
Dec 06, 2018 3:18:50 PM jenkins.slaves.DefaultJnlpSlaveReceiver channelClosed
WARNING: Computer.threadPoolForRemoting [#32] for jenkins-slave-2b956-qb9ft terminated
java.nio.channels.ClosedChannelException
    at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209)
    at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222)
    at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
    at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213)
    at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800)
    at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173)
    at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314)
    at hudson.remoting.Channel.close(Channel.java:1450)
    at hudson.remoting.Channel.close(Channel.java:1403)
    at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:821)
    at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:105)
    at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:737)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

ppitonak commented 5 years ago

Another failure, the same story. Three successful runs before this one so probably not caused by something broken in previous run. https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-beta/4274/

ppitonak commented 5 years ago

I have more information about what is going on.

Dec 6, 2018 14:40:00 build 4273 is started and succeeds, account is reset, nothing unusual in logs
Dec 6, 2018 15:40:00 build 4274 is started
Dec 6, 2018 15:43:56 new space and project is created and pipeline starts
Dec 6, 2018 15:45:23 e2e test opens pipelines page in OSIO
Dec 6, 2018 15:53:06 Jenkins pod log shows the stacktrace like in my previous comment - WARNING: Computer.threadPoolForRemoting [#34] for jenkins-slave-fzpvb-tvc99 terminated
Dec 6, 2018 15:53:23.458 pipeline stops on promote stage, e2e test clicks on "Input required" button
Dec 6, 2018 15:53:23.616 first of many similar errors in browser console SEVERE https://jenkins.api.prod-preview.openshift.io/api/jenkins/start - Failed to load resource: the server responded with a status of 500 (Internal Server Error)
Dec 6, 2018 16:03:23 e2e test timeout - "Promote" button is not clickable
Dec 6, 2018 16:03:24 e2e test navigates to Jenkins log directly by URL - results in what is reported in this issue (Chrome error ERR_NAME_NOT_RESOLVED)
Dec 6, 2018 16:03:28 e2e test gathers various logs - nothing unusual in last 15 minutes, Jenkins pod looks OK pod/jenkins-1-q9gfr 1/1 Running 0 1h
Dec 6, 2018 16:04:01 e2e test successfully resets the account

Then https://github.com/openshiftio/openshift.io/issues/3802 goes on stage.

Dec 6, 2018 16:40:00 build 4275 is started, fails with unrelated issue
Dec 6, 2018 17:40:00 build 4276 is started
Dec 6, 2018 18:05:00 "View log" link on pipeline page is still not available after 21 minutes
Dec 6, 2018 18:05:00 e2e test navigates to Jenkins log directly by URL - results in what is reported in this issue (Chrome error ERR_NAME_NOT_RESOLVED)
Dec 6, 2018 18:05:11 e2e test gathers various logs - nothing unusual in last 15 minutes, Jenkins pod has been live for 3 hours pod/jenkins-1-q9gfr 1/1 Running 0 3h
Dec 6, 2018 18:05:36 e2e test successfully resets the account
fast-forward in time
Dec 7, 2018 9:40:00 build 4292 is started
the same scenario as in build 4276
Dec 7, 2018 9:54:15 e2e test gathers various logs - nothing unusual in last 15 minutes, Jenkins pod has been live for 19 hours pod/jenkins-1-q9gfr 1/1 Running 0 19h

result:

the account is completely unusable
user needs to go to OpenShift console and manually delete all pipelines
(maybe not necessary) reset account

ppitonak commented 5 years ago

When I click the " See additional details in jenkins", I see the following error in browser instead of Jenkins UI.

osio_promote2

{"Errors":[{"code":"500","detail":"Error when starting Jenkins: 2: openshift client error: got status 401 Unauthorized (401) from https://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/ppitonak-preview-jenkins/deploymentconfigs/jenkins"}]}

hrishin commented 5 years ago

@ppitonak from where we are seeing the following log?

{"Errors":[{"code":"500","detail":"Error when starting Jenkins: 2: openshift client error: got status 401 Unauthorized (401) from https://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/ppitonak-preview-jenkins/deploymentconfigs/jenkins"}]}

hrishin commented 5 years ago

@ppitonak yeah, can see now with prod-preview Jenkins.

alexeykazakov commented 5 years ago

The token was updated. Should be fixed by now. Please reopen if it's still not working for you.

ljelinkova commented 5 years ago

Reopening because the fix did not work and it may not be the root cause of this issue.

hrishin commented 5 years ago

@ljelinkova Is it still happening on prod-preview or east-2a cluster?

I can access the Jenkins which is on east-2a

chmouel commented 5 years ago

The issue is still that we are getting an invalid token for free-stg :

We (idler) do get the clusters from auth with the /api/clusters rest path,

We get the token on prodpreview for two clusters, free-stg and east-2a

Using manually the token provided by auth to east-2a work :

 % oc login https://api.starter-us-east-2a.openshift.com/ --token=XXXXX                                                                                       
Logged into "https://api.starter-us-east-2a.openshift.com:443" as "devtools-sre" using the token provided.

You have access to 9120 projects, the list has been suppressed. You can list all projects with 'oc projects'

The token for stg is invalid :

 % oc login https://api.free-stg.openshift.com/ --token=XXXX
error: The token provided is invalid or expired.

so we really need to get a proper token to stg to work,

having said that we should gracefully handle those in idler instead of trying billions times and then running out of memory in the pods, I have logged this issue here to track it :

https://github.com/openshiftio/openshift.io/issues/4628

chmouel commented 5 years ago

New token deployed by sre team, and it seems to work, we don't see the 401 errors anymore on idler logs

piyush-garg commented 5 years ago

@ppitonak @ljelinkova Can you please check if the issue has been resolved now? Thanks

piyush-garg commented 5 years ago

I just did a start of e2e tests and seems like Jenkins is working fine.

URL's

http://artifacts.ci.centos.org/devtools/e2e/devtools-saas-openshiftio-e2e-smoketest-released/393/

http://artifacts.ci.centos.org/devtools/e2e/devtools-saas-openshiftio-e2e-smoketest-beta/387/

ppitonak commented 5 years ago

@piyush-garg we are monitoring it closely but it's too soon to judge. Let's give it couple of hours before closing this issue.

piyush-garg commented 5 years ago

Thanks, @ppitonak No problem. Just from the last run seems like it is fixed. But close it whenever you are sure about it. Thanks a lot.

ppitonak commented 5 years ago

This issue has NOT been caused by expired token. We have new occurences:

https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/4372/console https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/4377/console https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-beta/4373/console

ljelinkova commented 5 years ago

The same in production.

chmouel commented 5 years ago

Do you have the link to the openshift logs?

I do have some issue currently on prod with gluster timing out:

compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
45s         45s          1         jenkins-5-w4p6m.156f3e3c44242486   Pod                                 Warning   FailedMount                   kubelet, ip-172-31-66-195.us-east-2.compute.internal   Unable to mount volumes for pod "jenkins-5-w4p6m_cboudjna-jenkins(6ea9ed2e-fd29-11e8-a5d6-02d7377a4b17)": timeout expired waiting for volumes to attach or mount for pod "cboudjna-jenkins"/"jenkins-5-w4p6m". list of unmounted volumes=[jenkins-home jenkins-config jenkins-token-lbwxh]. list of unattached volumes=[jenkins-home jenkins-config jenkins-token-lbwxh]
34s         34s          1         jenkins-5-xnbjf.156f3e3eaeaa0631   Pod                                 Warning   FailedMount                   kubelet, ip-172-31-66-159.us-east-2.compute.internal   Unable to mount volumes for pod "jenkins-5-xnbjf_cboudjna-jenkins(74ce6aff-fd29-11e8-a5d6-02d7377a4b17)": timeout expired waiting for volumes to attach or mount for pod "cboudjna-jenkins"/"jenkins-5-xnbjf". list of unmounted volumes=[jenkins-home jenkins-config jenkins-token-lbwxh]. list of unattached volumes=[jenkins-home jenkins-config jenkins-token-lbwxh]

chmouel commented 5 years ago

if that's the log, this seems to correlate although the error seems different:

3m          58m          24950     jenkins-1-hrm5s.156f0a4d8afbbfd4                       Pod                                                              Warning   FailedScheduling    default-scheduler                                      persistentvolumeclaim "jenkins-home" not found

Can you show me where is the output of oc logs dc/jenkins please, I'd like to know why readiness probe failed,

ljelinkova commented 5 years ago

The deployment config section starts with

---------- Jenkins deployment config ---------------
oc get -o yaml dc/jenkins

If you want us to add any other command to the script, just specify what shall we add.

chmouel commented 5 years ago

@ljelinkova thanks, can you please get us the output of :

(in jenkins namespace)

oc logs dc/jenkins oc get ev

that would be much helpful to debug those, thanks

ljelinkova commented 5 years ago

We do have

---------- Get events ---------------
oc get events --sort-by='.lastTimestamp'

and I'll add oc logs dc/jenkins

ppitonak commented 5 years ago

@chmouel

oc logs dc/jenkins

how is this different from oc logs jenkins-1-somehash?

chmouel commented 5 years ago

okay right we have all the information, so definitively the issue is with :

3m          58m          24950     jenkins-1-hrm5s.156f0a4d8afbbfd4                       Pod                                                              Warning   FailedScheduling    default-scheduler                                      persistentvolumeclaim "jenkins-home" not found

I think we saw that error before?

you (@ppitonak or someone else chatted to @sthaha about it, when jenkins-home don't get created, it means the tenant creation or recreation hasn't been done properly and the volume is not there,

Somehow there is an error hapenning earlier (at tenant creation) where fabric8-tenant fails,

@alexeykazakov @sbose78 do you know how that flow is working ? fabric8-tenant should report somewhere when it cannot instantiate a resource ? My bet is on a gluster timeout not handled properly in fabric8-tenant,

chmouel commented 5 years ago

(and as well with those sev1 issues there is multiple issues in a catch all one)

ppitonak commented 5 years ago

okay right we have all the information, so definitively the issue is with :
3m          58m          24950     jenkins-1-hrm5s.156f0a4d8afbbfd4                       Pod                                                              Warning   FailedScheduling    default-scheduler                                      persistentvolumeclaim "jenkins-home" not found
I think we saw that error before?

you (@ppitonak or someone else chatted to @sthaha about it, when jenkins-home don't get created, it means the tenant creation or recreation hasn't been done properly and the volume is not there,

The list of issues mentioning jenkins-home is quite long, here are just open ones: https://github.com/openshiftio/openshift.io/issues?q=jenkins-home+is%3Aopen

chmouel commented 5 years ago

I believe it's then a duplicate from https://github.com/openshiftio/openshift.io/issues/4121 there is some logic in the test when resetting the tenant that need to be reviewed properly, nobody addressed aslak comment here :

https://github.com/openshiftio/openshift.io/issues/4121#issuecomment-410648540

there is a race somewhere in the test that makes the jenkins home not creating,

I would imagine this race is showing up when gluster is being 'slow' (which is something i have seen in my logs when doing some tests)

alexeykazakov commented 5 years ago

cc: @MatousJobanek

MatousJobanek commented 5 years ago

I was trying o find anything relevant in the tenant logs - related to this issue - unfortunately, I haven't found anything (which doesn't mean that there is nothing :-))

there is a race somewhere in the test that makes the jenkins home not creating,

if the Reset, Clean & Apply is fast enough and OpenShift slow enough, then it can happen that some objects are still present in OS while creation of the new ones with the same name is requested. I see two options for fixing this:

try to detect that there is such an object that is still in terminating state and wait until it is completely gone. However, I'm afraid that in case of PVC it can happen that it seems to be removed, but when tenant tries to create a new one with the same name, then it fails anyway
if there is either a conflict or 403 returned from OS for PVC in mjobanek-jenkins namespace (or any other one), then tenant should try to create new namespaces with different base name: mjobanek1-jenkins

as the safest way seems to be the second solution. Currently, I'm validating this against local minishift instance.

Preeticp commented 5 years ago

I am also facing the same issue as mentioned above, not able to mount volume. Attaching screenshot

jenkins

chmouel commented 5 years ago

@Preeticp that's another issue, this is not an issue where the volume wasn't created but where it's timing out mounting out from gluster, from my experience the dc would come up after a while, there is a planned that hopefully would fix this but I am not sure where is the issue about it cc @pbergene

chmouel commented 5 years ago

@MatousJobanek

I have seen this script floating around https://github.com/sborenst/ansible_agnostic_deployer/blob/development/ansible/configs/ocp-workshop/files/wack_terminating_project.sh not sure if this call can be useful for us when resetting a tenant:

curl -s -k -H "Content-Type: application/json" -X PUT --data-binary @$tmp http://127.0.0.1:28001/api/v1/namespaces/$project/finalize > /dev/null

could be useful to force terminate,

for

if there is either a conflict or 403 returned from OS for PVC in mjobanek-jenkins namespace (or any other one), then tenant should try to create new namespaces with different base name: mjobanek1-jenkins

I am not sure i understand you are suggesting, would that break some associations ?

MatousJobanek commented 5 years ago

I have seen this script floating around https://github.com/sborenst/ansible_agnostic_deployer/blob/development/ansible/configs/ocp-workshop/files/wack_terminating_project.sh not sure if this call can be useful for us when resetting a tenant:

I'm not sure if force-terminating and expecting that the original PVC should be immediately correctly removed is the best solution.

I am not sure i understand you are suggesting, would that break some associations ?

it shouldn't break anything. we are already using this logic for cases when there are two users with the same name. TLDR explanation would be: Currently, the namespace names are constructed by getting the first part of the OS username (for mjobanek@redhat.com it is mjobanek) and joining with the namespace suffix (for jenkins it is mjobanek-jenkins). If there was another account mjobanek@ibm.com then it would conflict with the first one so to solve it we change the base-name for namespaces to mjobanek2 which means mjobanek2-jenkins for the jenkins namespace. This is only internal linking and no other component/service should be affected by that. Our case could be solved using the same logic - when the creation of the namespaces fails because of either 409 or 403, then tenant service could increment the suffix of the base-name (mjobanek3) and try to create new namespaces with it.

chmouel commented 5 years ago

@MatousJobanek sounds good then thanks for taking the time to describe how it works, I think that looks good then!

I'm not sure if force-terminating and expecting that the original PVC should be immediately correctly removed is the best solution.

I think the state terminating mean the volume drive is 'recycling' the gluster volumes and it is stuck into it at that time, perhaps @jfchevrette or @pbergene can give us some advice here.

bmicklea commented 5 years ago

What's the latest status on this?

piyush-garg commented 5 years ago

@ljelinkova @ppitonak Any updates about how the tests are performing now?

ppitonak commented 5 years ago

@piyush-garg observed twice in last 24 hours but there is a problem in Launcher at the moment so many builds didn't get that far.

https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/4548/console https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-released/1248/console

chmouel commented 5 years ago

I have unassigned myself, this is not something we can solve from the build side, as stated before this is or the test or fabric8-tenant that can fix this,

openshiftio / openshift.io