Idler is going crazy when there is no dc/jenkins in -jenkins namespace

chmouel commented 6 years ago

Getting TOO_MANY_REDIRECTS when there is no dc in -jenkins namespace (manual reset for example)

hrishin commented 6 years ago

bananas ?

chmouel commented 6 years ago

weird that when trying this locally i am getting the right error handling :

ppitonak commented 6 years ago

This error breaks E2E tests quite often

http://artifacts.ci.centos.org/devtools/e2e/devtools-saas-openshiftio-e2e-smoketest-released/129/05-03-jenkins-direct-log.png http://artifacts.ci.centos.org/devtools/e2e/devtools-saas-openshiftio-e2e-smoketest-beta/127/05-03-jenkins-direct-log.png http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1b-released/653/05-01-jenkins-log-failed.png

@chmouel why did you decrease severity?

ppitonak commented 6 years ago

Could this be related?

chmouel commented 6 years ago

@ppitonak any chances you can give us a bit more context to this error message?

ppitonak commented 6 years ago

@chmouel everything looks the same as in too many redirects case but the actual content of http://jenkins.openshift.io is what you see on screenshot

E2E test does the following:

create a space
create a new Spring Boot Http app from booster
switch to codebases page
start new Che workspace
switch to pipelines page
wait for build to start - View log link doesn't appear in UI
navigate to jenkins.openshift.io - see the screenshot above

sthaha commented 6 years ago

Could this be related?

@chmouel seems to me like an openshift n/w issue again - 5432 AFAICT is the postgres port.

sthaha commented 6 years ago

@ppitonak could you please explain why this is a P1 ?

namespaces missing DC for Jenkins must be a very rare case and is related to E2E tests resetting workspaces. How are you validating if the reset environment has worked?

chmouel commented 6 years ago

But that issue only is about when resetting environment not sure if that can be postgresql related,

agree with you @sthaha oc delete all is only for people who have -edit rights on -jenkins and does that kind of things withouth going by the tenant reset call,

chmouel commented 6 years ago

I tried again to confirm and looking at the logs it is indeed the deletion that cause the issue :

% oc delete all --all -n $T-jenkins
replicationcontroller "jenkins-8" deleted
replicationcontroller "jenkins-9" deleted
service "bayesian-link" deleted
service "jenkins" deleted
service "jenkins-jnlp" deleted
deploymentconfig.apps.openshift.io "jenkins" deleted
route.route.openshift.io "jenkins" deleted

access jenkins.openshift.io

{"cluster":"https://api.starter-us-east-2.openshift.com/","component":"proxy","level":"info","msg":"found ns : \"cboudjna2-jenkins\", cluster: \"https://api.starter-us-east-2.openshift.com/\"","ns":"cboudjna2-jenkins","part":"token_json","request-hash":3682938397,"time":"2018-10-15T08:15:17Z"}
{"cluster":"https://api.starter-us-east-2.openshift.com/","component":"proxy","level":"info","msg":"Fetched OSO token from OSIO token","ns":"cboudjna2-jenkins","part":"token_json","request-hash":3682938397,"time":"2018-10-15T08:15:17Z"}
{"cluster":"https://api.starter-us-east-2.openshift.com/","component":"proxy","level":"error","msg":"Error when starting Jenkins: 2: openshift client error: got status 404 Not Found (404) from https://api.starter-us-east-2.openshift.com/oapi/v1/namespaces/cboudjna2-jenkins/deploymentconfigs/jenkins","ns":"cboudjna2-jenkins","part":"token_json","request-hash":3682938397,"time":"2018-10-15T08:15:17Z"}
{"component":"proxy","level":"info","msg":"returned: |key: \"\" |ns: \"cboudjna2-jenkins\" |fwd: false|","request-hash":3682938397,"time":"2018-10-15T08:15:17Z"}

chmouel commented 6 years ago

This code [here](https://github.com/fabric8-services/fabric8-jenkins-proxy/blob/47eb3e77936aef48428968d7781a6b8d95a2738a/internal/proxy/ui_requests.go#L66 : ) woudl do this :

        // we don't care about code here since only the state of jenkins pod -
        // running or not is what is relevant
        state, _, err := p.startJenkins(ns, clusterURL)
        if err != nil {
            nsLogger.Errorf("Error when starting Jenkins: %s", err)
            http.Redirect(w, r, redirectURL.String(), http.StatusTemporaryRedirect)
            return
        }

which in case of a NotFound would redirect forever, the redirect is done in case of timeout isnt it ? shouldnt we filter those between a 404 and others ?

@sthaha @kishansagathiya

sthaha commented 6 years ago

@chmouel good find! Any thoughts in case of 404 what the proxy should do?

chmouel commented 6 years ago

return a 404 ? user would know that "jenkins dc cannot be found" ?

chmouel commented 6 years ago

@ppitonak We are working on that issue,

But it came to my attention that it has been estimated as critical for running the E2E, from my understanding with the chat we had on mattermost, you don't do a tearUp or tearDown of the environment (i.e: resetting the env), can you please confirm ?

Because unless you do aoc delete --all all inside the $USER-jenkins tenant without a call to the fabric8-tenant service to recreate them, you would not see this issue, (as per my paste earlier)

That error of REDIRECT can occur in a different scenarios which we should track down (and then work as priority one). But please to help us with the debugging, provide us with the whole :

jenkins build log (if there is one)
oc get ev inside your jenkins namespace
developper console network activity if the error comes from jenkins
oc logs dc/jenkins

I am going to set this issue as p3 as it should be, please feel free to convince me otherwise,

ppitonak commented 6 years ago

@chmouel we reset the environment (i.e. click "Erase My OpenShift.io Environment on https://openshift.io/myuser/_cleanup) after test run

chmouel commented 6 years ago

@ppitonak okay, this could be as well a issue with the tenant service,

So can you please let us know if there is objects i.e: oc get all -n $USER-jenkins before running the start of a test, and run that would be very useful so we can make sure what is your issue, the oc get ev -n $USER-jenkins and oc logs jenkins/dc -n $USER-jenkins and the time the test has been started so we can correlate inside the idler logs,

Thanks a lot,

ppitonak commented 6 years ago

I will provide all data when I see this error next time.

ppitonak commented 6 years ago

We have a failed job

ppitonak commented 6 years ago

I am going to set this issue as p3 as it should be, please feel free to convince me otherwise,

Setting the priority back to P1 because it caused PR check to saas-openshiftio failure, affects whole OSIO team.

chmouel commented 6 years ago

EDIT: Removing my previous comment about not being run in jenkins namespace which actually do,

So after chatting, it seems that the tenant service don't recreate properly the jenkins namespace after a reset environmenet has been done, maybe it's the tenant service who has issue or the call by the tests to the UI wasn't done properly.

Either way it's completely different from this issue,

can we please continue on a new issue please? so we don't confuse the fabric8-tenant team where this needs to be assigned,

more details here https://chat.openshift.io/developers/pl/czt3dscodbde8qoabubxqxrxny

chmouel commented 6 years ago

Removing P1 from this issue as this should go to a new one,

chmouel commented 6 years ago

@ppitonak maybe related to your issue https://github.com/openshiftio/openshift.io/issues/4121#issuecomment-410648540

chmouel commented 6 years ago

We have merged https://github.com/fabric8-services/fabric8-jenkins-proxy/pull/334#issuecomment-432577825 and now errrors should be more explicit i.e: for our issue when there is no dc inside jenkins namespace we are now showing a 500 :

@ppitonak this may affects your tests (in a good way), you are not going to have a redirect loop but an error message, let us know how it goes,

openshiftio / openshift.io

Idler is going crazy when there is no dc/jenkins in -jenkins namespace #4180