Closed jodh-intel closed 6 years ago
I'm un-assigning this from me, as this is more a QA issue than a metrics issue - I just happen to have written the original when I found the problems when running metrics tests. See also #215
note to self: I realised I can add a check to count /run/vc/sbs/*' as well as for
/var/lib/vc/sbs/*` - let me go add that.
I liked the idea we had elsewhere of checking the proxy sock count (iirc), but also seem to remember that was going to maybe be tricky to do. I'll leave that one for now.
I will note though, if we are considering adding this to the default CI runs, that the PR in its current default configuration (5 cycles of 110 nginx containers) takes ~18minutes to run on my desk - a time overhead we are probably unwilling to add to the CIs. We can of course trim that - it is a balance of runtime vs likelyhood of uncovering issues. It may be better to take the sanity check parts of this script and integrate them into the CI as suggested over in #215 .
It would be interesting to know how long it takes to run under zuul, and on ppc64le and arm64.
/@Pennyzct, @nitkon, @cboylan.
If the CI is becomes too slow with this test included, maybe we could run it daily? But what is clear in my mind is that we must run it "regularly" to avoid silently introducing changes that break. We've done this twice now (and @grahamwhaley has found the issue twice - thanks @grahamwhaley ! :).
But we really need to know as soon as possible if we've introduced a bad change to avoid any more nightmare debug sessions.
Any thoughts on this @chavafg, @sboeuf, @bergwolf, @gnawux?
Running on master merges might be an option - we care less about how long they take to test. As long as it is obvious to somebody that a master merge has broken (that we get good notifications that is)
That's what worries me - we might not notice if master
broke until we come to make a new release, and then we're back to the situation we're in now: having to spend lots of time debugging / bisecting / etc to resolve the issue (and identify which PR introduced it).
Right. We already have Jenkins CI master jobs - the question is, what happens if one of those fails to build/pass? I'll pick on one just for example - here we can see that the agent on ubuntu 16.04 has been failing on master half the time. @chavafg - currently I guess we have no feedback hooks for the master branch builds? (not blaming btw, I suspect that is how we have always had it for eons). We should consider if we can get master branches to feed back - either to github (but where would it go - maybe ideally onto the last PR to be merged), or via email (which would need an MTA to be available to us somewhere I guess).
Ideally, we'd just run the soak test on the PR branch since that's where a problem is going to be introduced. So:
/me goes to see if there is a "buy reconditioned kit" option on https://www.top500.org... ;)
@grahamwhaley no, we currently do not have any feedback on master branches. We only have the badges on the main README.md [1], which show if the builds are ok or not. I am not sure what would be the best way to have feedback on the master branches.
Regarding adding this test, we can also run it in a new job, so it can run in parallel with the other ones. This way we don't have to wait another 18 minutes more.
[1] http://jenkins.katacontainers.io/job/kata-containers-runtime-ubuntu-16-04-master/badge/icon
Hi @chavafg
I am not sure what would be the best way to have feedback on the master branches.
It would be great if the master builds could send a pass/fail message to slack+irc+ML, but I don't know how feasible that is?
Regarding adding this test, we can also run it in a new job, so it can run in parallel with the other ones. This way we don't have to wait another 18 minutes more.
Awesome.
@jodh-intel will investigate how feasible is to post, but I was wondering if that would be too much spam? Maybe only post when the build was not successful?
I sort of agree @chavafg - I don't want to over-post too widely or to all the channels. I think if we can post to one channel that is guaranteed to reach the right people (which my gut instinct tells me would be the mailing list, or maybe a new mailing list just for this)... then I think that would do it. I know @jodh-intel is going to say 'the more people who see it and the more annoying it is the better' - but, I don't want it to turn into spam that folks filter to /dev/null (at the risk of them also filtering a whole channel).
On having a separate build for this (and/or other long running tests), on a per-PR basis - we could, but we'd then need one for every repo, and they would run longer than our current CIs, so we would more than 2x our CI hardware resource in my estimation - something we'd want to consider and check was OK before we did that I suspect (one reason I might lean towards the master builds, as we are already building them, they are just not producing much output).
Clearly, I don't want to spam people so I say we opt for the "Travis approach":
But we don't want to invest a lot of engineering time into creating this if we switch to zuul, which might already have such features built-in?
/cc @cboylan.
HI @grahamwhaley - any update on this? It would be awesome to have this running regularly, even if it does take a long time to run. A few of us have had to debug stray process issues (myself and yourself included I know) and it's a royal PITA (aka "very "$^$£!"%£"£! hard" :) so if we can minimise the delta (and thus the pain) by having it run on every PR so much the better.
I'd rather we just land it to run on every PR. That then becomes a forcing function to refine its invocation if it's taking too long. But atleast we'll be running it! ;)
We really need the soak test to be imported into Kata so we can run this as part of a CI run:
See https://github.com/kata-containers/runtime/pull/136 (and also https://github.com/kata-containers/agent/pull/196/).