Closed rupalibehera closed 6 years ago
Experiencing random failures on Planner UI builds as well. The failures come up in different stages.
Examples would be even before npm kicks in, while npm is running with either timeouts on resolving dependencies, getting timeouts on the functional tests (while interacting with the Chrome instance) or failures with no visible reason (the log just stops with no visible error or exit or even empty logs for the step).
This looks familiar: https://github.com/npm/npm/issues/9884
If we have random failure, I think we need some kind of profiling :
Npm install failed without any error - https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-planner/job/PR-2417/11/console
This is now serious for us. We're blocked because builds are more failing than working. We need this infra to work reliably. A whole team wastes lot's of time hitting that rebuild button. We need to solve this.
I'm upgrading this to sev1 as it does not really matter where the issue is rooted - several teams are affected by "random failures".
If these issues are really caused by exhausted resources on the build server then lets get the resources increased to see if it gets better - that must be cheaper than getting stuck at this.
If it requires more investigation then lets please get it listed here and get it worked through.
I guess the only way we have here is to increase the memory and cpu, i can go from :
to :
I don't see where is the cost associated tho,
I'll do that just after lunch unless someone object,
@chmouel , I'm guessing it gets billed to the account automatically. Inform the planner team, so that they can quickly try things out after you make the change? @pranavgore09 @michaelkleinhenz ?
For purely experimentation purpose, I ran the first 3 stages (which includes building and testing) of fabric8-planner
build on Travis-CI [0].
Here's what I found out -
[0] - https://travis-ci.org/jarifibrahim/fabric8-planner (check only the travis-test branch. The other branch failed since it didn't have a .travis.yml.file)
Is this really a whole build of fabric8-planner
and you literally have 7 specs? Something is really odd here... also from the perspective of the time needed to execute it (even on travis).
@chmouel , I am not sure if we can resize this same instance or we don't have enough privileges for that, should we ask @pbergene @jfchevrette they both have more access than us ?
@chmouel I am curious what we are resizing here - the jenkins master or the slave? isn't the job run in the slave? Can we get an measurement of the cpu and the memory utilisation of the slave when the job is run?
this is resizing of the complete cluster where other services also run configmapcontroller, content-repository, elasticsearch, exposecontroller, fabric8-docker-registry, hubot-mattermost, jenkins, jenkins-slave-*, kibana, mattermost-db, mattermost, nexus
@bartoszmajsak
Is this really a whole build of fabric8-planner and you literally have 7 specs? Something is really odd here... also from the perspective of the time needed to execute it (even on travis).
Yes, these are the UI smoke tests. Since these tests perform actions such as clicking on a button it takes some time.
So after spending some times on it :
There is no memory limit on the containers it's literally be my guest mister container and use whatever you have memory available on the cluster,
we have a total memory of 15GB on the cluster currently (as seen on the screenshot i sent earlier)
We have peaks going on up to 15GB sometime on Job which I guess is one of your UI tests, consuming hips of memory :
)
increasing the memory available on the cluster may help but if your job is taking a crazy amount of memory like insane (which we will track) then we will just add a limit on the containers run.
We will do the memory increase of the cluster later this afternoon (EU time)
Thanks,
I have just increased the memory/cpu available, let me know if that make things better, we will montior the graphs as well :
Following up on open SEV1 items.
Thanks!
@mmclanerh , I guess we can remove the SEV1 label , after upgrade there is a lot of improvement in the success rate of the PR Jobs, we can make this SEV2 and keep it open for some monitoring from build team side.
cc @maxandersen @michaelkleinhenz
From some monitoring yesterday via GKE console I could see that UI pods were in error state below is the screenshot, due to high resource usage is my guess.
I will try reproduce this situation to have some more details.
yes, can see improvements in builds at https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-planner/job/PR-2426/ Rate of success for this build was much higher after the upgrade. few failures after build#36 are because of code change by me and were expected. But build passes through, sometimes I see that slaves are not available for ~15-20m.... but later it gets a chance to run and build continues. I could see one random failure for #40 https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-planner/job/PR-2426/40/ but later worked well
Planner builds are more stable now. The time taken to run the tests has reduced from ~6 minutes
to ~3 minutes
I was monitoring the Jenkins instance, I tried to reproduce the behavior which leads the slaves in error state. I triggered 5 different fabric8-planner ui PR builds simultaneously. My Observations below:
Total container cap of 5 reached, not provisioning: 5 running or errored in namespace c
even if only 3 builds pods are running as two pods are in error state.From the above observation I can conclude that :
Below are some screen shots of above observations:
cc: @pradeepto @chmouel @pranavgore09
Thanks @rupalibehera for the detailed report.
cc @jfchevrette @maxandersen
Thanks @rupalibehera that confirms what I have long suspected.
thanks @rupalibehera for finding this
@rupalibehera nice and detailed observation. I'd like to amend some of the points you mentioned:
A host running 12 other critical services while also doing this isn't ideal at all. The spec mentioned is quite similar to our work laptop, but I'm not running 12 other live services on it, and I can't imagine how my system will handle it if I did so. Some sort of encapsulation/separation of concern has to be here, to not put all the eggs in one basket.
@dgutride and I are actively working on cleaning up and optimizing the UI build. Once we have that worked out in fabric8-UI we will provide those changes/fixes to the other UI repos.
Downgraded severity after no objections on triage.
New behaviour: I have crashed builds when moving around files:
npm ERR! path /home/jenkins/workspace/-ui_fabric8-planner_PR-2419-J5GLLZH4I7563GRZMERRKYWFHX4MIY25QPQMODLQRAHYHVB5LMQA@2/dist/node_modules/angular-tree-component/node_modules/lodash
npm ERR! code ENOENT
npm ERR! errno -2
npm ERR! syscall rename
npm ERR! enoent ENOENT: no such file or directory, rename '/home/jenkins/workspace/-ui_fabric8-planner_PR-2419-J5GLLZH4I7563GRZMERRKYWFHX4MIY25QPQMODLQRAHYHVB5LMQA@2/dist/node_modules/angular-tree-component/node_modules/lodash' -> '/home/jenkins/workspace/-ui_fabric8-planner_PR-2419-J5GLLZH4I7563GRZMERRKYWFHX4MIY25QPQMODLQRAHYHVB5LMQA@2/dist/node_modules/angular-tree-component/node_modules/.lodash.DELETE'
npm ERR! enoent This is related to npm not being able to find a file.
npm ERR! enoent
npm ERR! A complete log of this run can be found in:
npm ERR! /home/jenkins/.npm/_logs/2018-02-19T12_43_00_159Z-debug.log
script returned exit code 254
I can't reproduce this on local. And this one is reproducible on CI.
@michaelkleinhenz This is a known issue with npm
https://github.com/npm/npm/issues/17444
@jarifibrahim that thread says removing package-lock.json solves the issue. I just don't say anymore about all of this. It is just so sad.
@michaelkleinhenz We should try Yarn https://yarnpkg.com/en/ for once.
Update: the above issue has been fixed by removing an re-generating the lock files.
New problem I encounter right now:
npm ERR! code Z_BUF_ERROR
npm ERR! errno -5
npm ERR! unexpected end of file
..while installing deps.
See https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-planner/job/PR-2419/39/console
just to check - this error only occurs on GKE jenkins, correct ?
its not happening when running locally nor when running builds in cico ?
is this a duplicate of https://github.com/openshiftio/openshift.io/issues/2235 - can one be closed over the other ?
Since this issue seems to not occur anymore, I am going to close it . Please reopen it if that isn't the case.
The development time increases due to long running PR and not getting successful in multiple builds due to slave related issues or a genuine failure in the code or some other failure.
It would be nice if we get the pipelines a bit faster so the feedback cycle is quick.
CD team will need help from fabric8-ui team on this as none of us know or understand much of it.
https://github.com/fabric8-ui/fabric8-ui/
https://jenkins.cd.test.fabric8.io/job/fabric8-ui/job/fabric8-ui/