Investigate what is required to help fabric8-ui CI pipeline to get faster and better

rupalibehera commented 6 years ago

The development time increases due to long running PR and not getting successful in multiple builds due to slave related issues or a genuine failure in the code or some other failure.
It would be nice if we get the pipelines a bit faster so the feedback cycle is quick.
CD team will need help from fabric8-ui team on this as none of us know or understand much of it.
https://github.com/fabric8-ui/fabric8-ui/
https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-ui/

ldimaggi commented 6 years ago

michaelkleinhenz commented 6 years ago

Experiencing random failures on Planner UI builds as well. The failures come up in different stages.

Examples would be even before npm kicks in, while npm is running with either timeouts on resolving dependencies, getting timeouts on the functional tests (while interacting with the Chrome instance) or failures with no visible reason (the log just stops with no visible error or exit or even empty logs for the step).

michaelkleinhenz commented 6 years ago

This looks familiar: https://github.com/npm/npm/issues/9884

michaelkleinhenz commented 6 years ago

Example PR build fail: https://jenkins.cd.test.fabric8.io/blue/organizations/jenkins/fabric8-ui%2Ffabric8-planner/detail/PR-2419/3/pipeline

chmouel commented 6 years ago

If we have random failure, I think we need some kind of profiling :

How can we get those infos from GKE
How can we get that info from the application (i.e: profiling memory usage and cpu on intepreter)

jarifibrahim commented 6 years ago

Npm install failed without any error - https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-planner/job/PR-2417/11/console

michaelkleinhenz commented 6 years ago

This is now serious for us. We're blocked because builds are more failing than working. We need this infra to work reliably. A whole team wastes lot's of time hitting that rebuild button. We need to solve this.

maxandersen commented 6 years ago

I'm upgrading this to sev1 as it does not really matter where the issue is rooted - several teams are affected by "random failures".

If these issues are really caused by exhausted resources on the build server then lets get the resources increased to see if it gets better - that must be cheaper than getting stuck at this.

If it requires more investigation then lets please get it listed here and get it worked through.

chmouel commented 6 years ago

I guess the only way we have here is to increase the memory and cpu, i can go from :

to :

I don't see where is the cost associated tho,

I'll do that just after lunch unless someone object,

sbose78 commented 6 years ago

@chmouel , I'm guessing it gets billed to the account automatically. Inform the planner team, so that they can quickly try things out after you make the change? @pranavgore09 @michaelkleinhenz ?

jarifibrahim commented 6 years ago

For purely experimentation purpose, I ran the first 3 stages (which includes building and testing) of fabric8-planner build on Travis-CI [0].

Here's what I found out -

The average running time (for the first 3 stages) is about 5 minutes on Travis-CI and about 18 minutes on jenkins.cd.test.fabric8.io Picture 1 - Average Build time on jenkins.cd.test.fabric8.io Picture 2 - Build time of planner on Tavis-CI
I ran 10 builds on Travis-CI and none of them failed. On the other hand, I have not yet been able to get 5 consecutive successful builds on jenkins.cd.test.fabric8.io

[0] - https://travis-ci.org/jarifibrahim/fabric8-planner (check only the travis-test branch. The other branch failed since it didn't have a .travis.yml.file)

bartoszmajsak commented 6 years ago

Is this really a whole build of fabric8-planner and you literally have 7 specs? Something is really odd here... also from the perspective of the time needed to execute it (even on travis).

rupalibehera commented 6 years ago

@chmouel , I am not sure if we can resize this same instance or we don't have enough privileges for that, should we ask @pbergene @jfchevrette they both have more access than us ?

sthaha commented 6 years ago

@chmouel I am curious what we are resizing here - the jenkins master or the slave? isn't the job run in the slave? Can we get an measurement of the cpu and the memory utilisation of the slave when the job is run?

rupalibehera commented 6 years ago

this is resizing of the complete cluster where other services also run configmapcontroller, content-repository, elasticsearch, exposecontroller, fabric8-docker-registry, hubot-mattermost, jenkins, jenkins-slave-*, kibana, mattermost-db, mattermost, nexus

jarifibrahim commented 6 years ago

@bartoszmajsak

Is this really a whole build of fabric8-planner and you literally have 7 specs? Something is really odd here... also from the perspective of the time needed to execute it (even on travis).

Yes, these are the UI smoke tests. Since these tests perform actions such as clicking on a button it takes some time.

chmouel commented 6 years ago

So after spending some times on it :

There is no memory limit on the containers it's literally be my guest mister container and use whatever you have memory available on the cluster,
we have a total memory of 15GB on the cluster currently (as seen on the screenshot i sent earlier)
We have peaks going on up to 15GB sometime on Job which I guess is one of your UI tests, consuming hips of memory :

35736234-9a8811f8-0827-11e8-84bb-0ada28d29722 )

increasing the memory available on the cluster may help but if your job is taking a crazy amount of memory like insane (which we will track) then we will just add a limit on the containers run.

In the meantime we need from you a memory trace and graph from the npm process before running the tests and make a comparaison of how much memory your jobs are consuming under travis, jenkins.cd and your machine). We are running blind without it and we won't take care of that for you guys,

We will do the memory increase of the cluster later this afternoon (EU time)

Thanks,

chmouel commented 6 years ago

I have just increased the memory/cpu available, let me know if that make things better, we will montior the graphs as well :

35737097-a4a7a4c0-082a-11e8-93c1-4f5543afa816

xyntrix commented 6 years ago

Following up on open SEV1 items.

Is this still a SEV1?
Has there been an improvement..? Should this move to SEV2?

Thanks!

rupalibehera commented 6 years ago

@mmclanerh , I guess we can remove the SEV1 label , after upgrade there is a lot of improvement in the success rate of the PR Jobs, we can make this SEV2 and keep it open for some monitoring from build team side.

cc @maxandersen @michaelkleinhenz

rupalibehera commented 6 years ago

From some monitoring yesterday via GKE console I could see that UI pods were in error state below is the screenshot, due to high resource usage is my guess. ui-pod-slave

I will try reproduce this situation to have some more details.

pranavgore09 commented 6 years ago

yes, can see improvements in builds at https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-planner/job/PR-2426/ Rate of success for this build was much higher after the upgrade. few failures after build#36 are because of code change by me and were expected. But build passes through, sometimes I see that slaves are not available for ~15-20m.... but later it gets a chance to run and build continues. I could see one random failure for #40 https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-planner/job/PR-2426/40/ but later worked well

jarifibrahim commented 6 years ago

Planner builds are more stable now. The time taken to run the tests has reduced from ~6 minutes to ~3 minutes

rupalibehera commented 6 years ago

I was monitoring the Jenkins instance, I tried to reproduce the behavior which leads the slaves in error state. I triggered 5 different fabric8-planner ui PR builds simultaneously. My Observations below:

The first and second PR CI build took much of the resources
The other 3 PR build could hardly acquire any resources
After triggering all the builds simultaneously the build went really really slow it took more time from what it would taken in general.
The first PR build pod was killed OOM error and later on all build pods went in to error state.
Some pods were cleaned automatically and some did not I manually removed them from the cluster, which leads to fill up the container cap and we get such message in the logs Total container cap of 5 reached, not provisioning: 5 running or errored in namespace c even if only 3 builds pods are running as two pods are in error state.

From the above observation I can conclude that :

UI PR's are resource intensive
I think reducing the container cap from 5 to 3 which means only 3 builds will execute simultaneously
Increasing some more resources from 22 GB to 30 GB if that is possible ;)
The slaves acquire resources dynamically

Below are some screen shots of above observations:

First build https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-planner/job/PR-2432/1/consoleFull and resource usage
When the first UI slave was in OOM state
When all slaves went it to error state
Second PR build which was provisioned
resource usage of remaining slaves

cc: @pradeepto @chmouel @pranavgore09

pradeepto commented 6 years ago

Thanks @rupalibehera for the detailed report.

cc @jfchevrette @maxandersen

joshuawilson commented 6 years ago

Thanks @rupalibehera that confirms what I have long suspected.

pranavgore09 commented 6 years ago

thanks @rupalibehera for finding this

debloper commented 6 years ago

@rupalibehera nice and detailed observation. I'd like to amend some of the points you mentioned:

UI PR's are VERY resource intensive (partially because of lack of optimizations)
I think reducing the container cap from 5 to 1 which means only 1 build will execute at a time
Increasing some more resources from 22 GB to 30 (3x10) GB and 2x cores if that is possible
The slaves acquire resources dynamically, so split subtasks among them for each build to get faster

A host running 12 other critical services while also doing this isn't ideal at all. The spec mentioned is quite similar to our work laptop, but I'm not running 12 other live services on it, and I can't imagine how my system will handle it if I did so. Some sort of encapsulation/separation of concern has to be here, to not put all the eggs in one basket.

joshuawilson commented 6 years ago

@dgutride and I are actively working on cleaning up and optimizing the UI build. Once we have that worked out in fabric8-UI we will provide those changes/fixes to the other UI repos.

pbergene commented 6 years ago

Downgraded severity after no objections on triage.

michaelkleinhenz commented 6 years ago

New behaviour: I have crashed builds when moving around files:

https://jenkins.cd.test.fabric8.io/blue/organizations/jenkins/fabric8-ui%2Ffabric8-planner/detail/PR-2419/32/pipeline

npm ERR! path /home/jenkins/workspace/-ui_fabric8-planner_PR-2419-J5GLLZH4I7563GRZMERRKYWFHX4MIY25QPQMODLQRAHYHVB5LMQA@2/dist/node_modules/angular-tree-component/node_modules/lodash
npm ERR! code ENOENT
npm ERR! errno -2
npm ERR! syscall rename
npm ERR! enoent ENOENT: no such file or directory, rename '/home/jenkins/workspace/-ui_fabric8-planner_PR-2419-J5GLLZH4I7563GRZMERRKYWFHX4MIY25QPQMODLQRAHYHVB5LMQA@2/dist/node_modules/angular-tree-component/node_modules/lodash' -> '/home/jenkins/workspace/-ui_fabric8-planner_PR-2419-J5GLLZH4I7563GRZMERRKYWFHX4MIY25QPQMODLQRAHYHVB5LMQA@2/dist/node_modules/angular-tree-component/node_modules/.lodash.DELETE'
npm ERR! enoent This is related to npm not being able to find a file.
npm ERR! enoent 
npm ERR! A complete log of this run can be found in:
npm ERR!     /home/jenkins/.npm/_logs/2018-02-19T12_43_00_159Z-debug.log
script returned exit code 254

I can't reproduce this on local. And this one is reproducible on CI.

jarifibrahim commented 6 years ago

@michaelkleinhenz This is a known issue with npm https://github.com/npm/npm/issues/17444

michaelkleinhenz commented 6 years ago

@jarifibrahim that thread says removing package-lock.json solves the issue. I just don't say anymore about all of this. It is just so sad.

jarifibrahim commented 6 years ago

@michaelkleinhenz We should try Yarn https://yarnpkg.com/en/ for once.

michaelkleinhenz commented 6 years ago

Update: the above issue has been fixed by removing an re-generating the lock files.

michaelkleinhenz commented 6 years ago

New problem I encounter right now:

npm ERR! code Z_BUF_ERROR
npm ERR! errno -5
npm ERR! unexpected end of file

..while installing deps.

See https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-planner/job/PR-2419/39/console

maxandersen commented 6 years ago

just to check - this error only occurs on GKE jenkins, correct ?

its not happening when running locally nor when running builds in cico ?

maxandersen commented 6 years ago

is this a duplicate of https://github.com/openshiftio/openshift.io/issues/2235 - can one be closed over the other ?

sthaha commented 6 years ago

Since this issue seems to not occur anymore, I am going to close it . Please reopen it if that isn't the case.

openshiftio / openshift.io

Investigate what is required to help fabric8-ui CI pipeline to get faster and better #1933