openshiftio / openshift.io

Red Hat OpenShift.io is an end-to-end development environment for planning, building and deploying modern applications.
https://openshift.io
97 stars 66 forks source link

Investigate what is required to help fabric8-ui CI pipeline to get faster and better #1933

Closed rupalibehera closed 6 years ago

rupalibehera commented 6 years ago
ldimaggi commented 6 years ago

See related issues: https://github.com/openshiftio/openshift.io/issues/1926 https://github.com/openshiftio/openshift.io/issues/1904

michaelkleinhenz commented 6 years ago

Experiencing random failures on Planner UI builds as well. The failures come up in different stages.

Examples would be even before npm kicks in, while npm is running with either timeouts on resolving dependencies, getting timeouts on the functional tests (while interacting with the Chrome instance) or failures with no visible reason (the log just stops with no visible error or exit or even empty logs for the step).

michaelkleinhenz commented 6 years ago

This looks familiar: https://github.com/npm/npm/issues/9884

michaelkleinhenz commented 6 years ago

Example PR build fail: https://jenkins.cd.test.fabric8.io/blue/organizations/jenkins/fabric8-ui%2Ffabric8-planner/detail/PR-2419/3/pipeline

chmouel commented 6 years ago

If we have random failure, I think we need some kind of profiling :

jarifibrahim commented 6 years ago

Npm install failed without any error - https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-planner/job/PR-2417/11/console

michaelkleinhenz commented 6 years ago

This is now serious for us. We're blocked because builds are more failing than working. We need this infra to work reliably. A whole team wastes lot's of time hitting that rebuild button. We need to solve this.

maxandersen commented 6 years ago

I'm upgrading this to sev1 as it does not really matter where the issue is rooted - several teams are affected by "random failures".

If these issues are really caused by exhausted resources on the build server then lets get the resources increased to see if it gets better - that must be cheaper than getting stuck at this.

If it requires more investigation then lets please get it listed here and get it worked through.

chmouel commented 6 years ago

I guess the only way we have here is to increase the memory and cpu, i can go from :

image

to :

image

I don't see where is the cost associated tho,

I'll do that just after lunch unless someone object,

sbose78 commented 6 years ago

@chmouel , I'm guessing it gets billed to the account automatically. Inform the planner team, so that they can quickly try things out after you make the change? @pranavgore09 @michaelkleinhenz ?

jarifibrahim commented 6 years ago

For purely experimentation purpose, I ran the first 3 stages (which includes building and testing) of fabric8-planner build on Travis-CI [0].

Here's what I found out -

  1. The average running time (for the first 3 stages) is about 5 minutes on Travis-CI and about 18 minutes on jenkins.cd.test.fabric8.io download Picture 1 - Average Build time on jenkins.cd.test.fabric8.io download 1 Picture 2 - Build time of planner on Tavis-CI
  2. I ran 10 builds on Travis-CI and none of them failed. On the other hand, I have not yet been able to get 5 consecutive successful builds on jenkins.cd.test.fabric8.io

[0] - https://travis-ci.org/jarifibrahim/fabric8-planner (check only the travis-test branch. The other branch failed since it didn't have a .travis.yml.file)

bartoszmajsak commented 6 years ago

Is this really a whole build of fabric8-planner and you literally have 7 specs? Something is really odd here... also from the perspective of the time needed to execute it (even on travis).

rupalibehera commented 6 years ago

@chmouel , I am not sure if we can resize this same instance or we don't have enough privileges for that, should we ask @pbergene @jfchevrette they both have more access than us ?

sthaha commented 6 years ago

@chmouel I am curious what we are resizing here - the jenkins master or the slave? isn't the job run in the slave? Can we get an measurement of the cpu and the memory utilisation of the slave when the job is run?

rupalibehera commented 6 years ago

this is resizing of the complete cluster where other services also run configmapcontroller, content-repository, elasticsearch, exposecontroller, fabric8-docker-registry, hubot-mattermost, jenkins, jenkins-slave-*, kibana, mattermost-db, mattermost, nexus

jarifibrahim commented 6 years ago

@bartoszmajsak

Is this really a whole build of fabric8-planner and you literally have 7 specs? Something is really odd here... also from the perspective of the time needed to execute it (even on travis).

Yes, these are the UI smoke tests. Since these tests perform actions such as clicking on a button it takes some time.

chmouel commented 6 years ago

So after spending some times on it :

35736234-9a8811f8-0827-11e8-84bb-0ada28d29722 )

increasing the memory available on the cluster may help but if your job is taking a crazy amount of memory like insane (which we will track) then we will just add a limit on the containers run.

We will do the memory increase of the cluster later this afternoon (EU time)

Thanks,

chmouel commented 6 years ago

I have just increased the memory/cpu available, let me know if that make things better, we will montior the graphs as well :

35737097-a4a7a4c0-082a-11e8-93c1-4f5543afa816

xyntrix commented 6 years ago

Following up on open SEV1 items.

Thanks!

rupalibehera commented 6 years ago

@mmclanerh , I guess we can remove the SEV1 label , after upgrade there is a lot of improvement in the success rate of the PR Jobs, we can make this SEV2 and keep it open for some monitoring from build team side.

cc @maxandersen @michaelkleinhenz

rupalibehera commented 6 years ago

From some monitoring yesterday via GKE console I could see that UI pods were in error state below is the screenshot, due to high resource usage is my guess. ui-pod-slave

I will try reproduce this situation to have some more details.

pranavgore09 commented 6 years ago

yes, can see improvements in builds at https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-planner/job/PR-2426/ Rate of success for this build was much higher after the upgrade. few failures after build#36 are because of code change by me and were expected. But build passes through, sometimes I see that slaves are not available for ~15-20m.... but later it gets a chance to run and build continues. I could see one random failure for #40 https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-planner/job/PR-2426/40/ but later worked well

jarifibrahim commented 6 years ago

Planner builds are more stable now. The time taken to run the tests has reduced from ~6 minutes to ~3 minutes

rupalibehera commented 6 years ago

I was monitoring the Jenkins instance, I tried to reproduce the behavior which leads the slaves in error state. I triggered 5 different fabric8-planner ui PR builds simultaneously. My Observations below:

From the above observation I can conclude that :

Below are some screen shots of above observations:

cc: @pradeepto @chmouel @pranavgore09

pradeepto commented 6 years ago

Thanks @rupalibehera for the detailed report.

cc @jfchevrette @maxandersen

joshuawilson commented 6 years ago

Thanks @rupalibehera that confirms what I have long suspected.

pranavgore09 commented 6 years ago

thanks @rupalibehera for finding this

debloper commented 6 years ago

@rupalibehera nice and detailed observation. I'd like to amend some of the points you mentioned:

A host running 12 other critical services while also doing this isn't ideal at all. The spec mentioned is quite similar to our work laptop, but I'm not running 12 other live services on it, and I can't imagine how my system will handle it if I did so. Some sort of encapsulation/separation of concern has to be here, to not put all the eggs in one basket.

joshuawilson commented 6 years ago

@dgutride and I are actively working on cleaning up and optimizing the UI build. Once we have that worked out in fabric8-UI we will provide those changes/fixes to the other UI repos.

pbergene commented 6 years ago

Downgraded severity after no objections on triage.

michaelkleinhenz commented 6 years ago

New behaviour: I have crashed builds when moving around files:

https://jenkins.cd.test.fabric8.io/blue/organizations/jenkins/fabric8-ui%2Ffabric8-planner/detail/PR-2419/32/pipeline

npm ERR! path /home/jenkins/workspace/-ui_fabric8-planner_PR-2419-J5GLLZH4I7563GRZMERRKYWFHX4MIY25QPQMODLQRAHYHVB5LMQA@2/dist/node_modules/angular-tree-component/node_modules/lodash
npm ERR! code ENOENT
npm ERR! errno -2
npm ERR! syscall rename
npm ERR! enoent ENOENT: no such file or directory, rename '/home/jenkins/workspace/-ui_fabric8-planner_PR-2419-J5GLLZH4I7563GRZMERRKYWFHX4MIY25QPQMODLQRAHYHVB5LMQA@2/dist/node_modules/angular-tree-component/node_modules/lodash' -> '/home/jenkins/workspace/-ui_fabric8-planner_PR-2419-J5GLLZH4I7563GRZMERRKYWFHX4MIY25QPQMODLQRAHYHVB5LMQA@2/dist/node_modules/angular-tree-component/node_modules/.lodash.DELETE'
npm ERR! enoent This is related to npm not being able to find a file.
npm ERR! enoent 
npm ERR! A complete log of this run can be found in:
npm ERR!     /home/jenkins/.npm/_logs/2018-02-19T12_43_00_159Z-debug.log
script returned exit code 254

I can't reproduce this on local. And this one is reproducible on CI.

jarifibrahim commented 6 years ago

@michaelkleinhenz This is a known issue with npm https://github.com/npm/npm/issues/17444

michaelkleinhenz commented 6 years ago

@jarifibrahim that thread says removing package-lock.json solves the issue. I just don't say anymore about all of this. It is just so sad.

jarifibrahim commented 6 years ago

@michaelkleinhenz We should try Yarn https://yarnpkg.com/en/ for once.

michaelkleinhenz commented 6 years ago

Update: the above issue has been fixed by removing an re-generating the lock files.

michaelkleinhenz commented 6 years ago

New problem I encounter right now:

npm ERR! code Z_BUF_ERROR
npm ERR! errno -5
npm ERR! unexpected end of file

..while installing deps.

See https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-planner/job/PR-2419/39/console

maxandersen commented 6 years ago

just to check - this error only occurs on GKE jenkins, correct ?

its not happening when running locally nor when running builds in cico ?

maxandersen commented 6 years ago

is this a duplicate of https://github.com/openshiftio/openshift.io/issues/2235 - can one be closed over the other ?

sthaha commented 6 years ago

Since this issue seems to not occur anymore, I am going to close it . Please reopen it if that isn't the case.