Import soak test from Clear Containers for mount checks

jodh-intel commented 6 years ago

We really need the soak test to be imported into Kata so we can run this as part of a CI run:

https://github.com/clearcontainers/tests/blob/master/integration/stability/soak_parallel_rm.sh

See https://github.com/kata-containers/runtime/pull/136 (and also https://github.com/kata-containers/agent/pull/196/).

grahamwhaley commented 6 years ago

I'm un-assigning this from me, as this is more a QA issue than a metrics issue - I just happen to have written the original when I found the problems when running metrics tests. See also #215

grahamwhaley commented 6 years ago

note to self: I realised I can add a check to count /run/vc/sbs/*' as well as for/var/lib/vc/sbs/*` - let me go add that. I liked the idea we had elsewhere of checking the proxy sock count (iirc), but also seem to remember that was going to maybe be tricky to do. I'll leave that one for now.

grahamwhaley commented 6 years ago

I will note though, if we are considering adding this to the default CI runs, that the PR in its current default configuration (5 cycles of 110 nginx containers) takes ~18minutes to run on my desk - a time overhead we are probably unwilling to add to the CIs. We can of course trim that - it is a balance of runtime vs likelyhood of uncovering issues. It may be better to take the sanity check parts of this script and integrate them into the CI as suggested over in #215 .

jodh-intel commented 6 years ago

It would be interesting to know how long it takes to run under zuul, and on ppc64le and arm64.

/@Pennyzct, @nitkon, @cboylan.

If the CI is becomes too slow with this test included, maybe we could run it daily? But what is clear in my mind is that we must run it "regularly" to avoid silently introducing changes that break. We've done this twice now (and @grahamwhaley has found the issue twice - thanks @grahamwhaley ! :).

But we really need to know as soon as possible if we've introduced a bad change to avoid any more nightmare debug sessions.

Any thoughts on this @chavafg, @sboeuf, @bergwolf, @gnawux?

grahamwhaley commented 6 years ago

Running on master merges might be an option - we care less about how long they take to test. As long as it is obvious to somebody that a master merge has broken (that we get good notifications that is)

jodh-intel commented 6 years ago

That's what worries me - we might not notice if master broke until we come to make a new release, and then we're back to the situation we're in now: having to spend lots of time debugging / bisecting / etc to resolve the issue (and identify which PR introduced it).

grahamwhaley commented 6 years ago

Right. We already have Jenkins CI master jobs - the question is, what happens if one of those fails to build/pass? I'll pick on one just for example - here we can see that the agent on ubuntu 16.04 has been failing on master half the time. @chavafg - currently I guess we have no feedback hooks for the master branch builds? (not blaming btw, I suspect that is how we have always had it for eons). We should consider if we can get master branches to feed back - either to github (but where would it go - maybe ideally onto the last PR to be merged), or via email (which would need an MTA to be available to us somewhere I guess).

jodh-intel commented 6 years ago

Ideally, we'd just run the soak test on the PR branch since that's where a problem is going to be introduced. So:

Can we get faster CI systems? Maybe a zuul setup will be quicker than jenkins :)
Can we reduce the requirements of the stress test somehow I wonder without impacting its effectiveness?
Could we somehow run the soak test in parallel with other tests? Probably not, but worth asking anyway! ;)
Could we have the CI fork off a job on a separate monster CI system with a bazillion cores and bubbabytes of RAM so that we can run the soak in parallel with the "normal" CI but have the normal CI block waiting for the soak test on that other machine to finish?

/me goes to see if there is a "buy reconditioned kit" option on https://www.top500.org... ;)

chavafg commented 6 years ago

@grahamwhaley no, we currently do not have any feedback on master branches. We only have the badges on the main README.md [1], which show if the builds are ok or not. I am not sure what would be the best way to have feedback on the master branches.

Regarding adding this test, we can also run it in a new job, so it can run in parallel with the other ones. This way we don't have to wait another 18 minutes more.

[1] http://jenkins.katacontainers.io/job/kata-containers-runtime-ubuntu-16-04-master/badge/icon

jodh-intel commented 6 years ago

Hi @chavafg

I am not sure what would be the best way to have feedback on the master branches.

It would be great if the master builds could send a pass/fail message to slack+irc+ML, but I don't know how feasible that is?

Regarding adding this test, we can also run it in a new job, so it can run in parallel with the other ones. This way we don't have to wait another 18 minutes more.

Awesome.

chavafg commented 6 years ago

@jodh-intel will investigate how feasible is to post, but I was wondering if that would be too much spam? Maybe only post when the build was not successful?

grahamwhaley commented 6 years ago

I sort of agree @chavafg - I don't want to over-post too widely or to all the channels. I think if we can post to one channel that is guaranteed to reach the right people (which my gut instinct tells me would be the mailing list, or maybe a new mailing list just for this)... then I think that would do it. I know @jodh-intel is going to say 'the more people who see it and the more annoying it is the better' - but, I don't want it to turn into spam that folks filter to /dev/null (at the risk of them also filtering a whole channel).

On having a separate build for this (and/or other long running tests), on a per-PR basis - we could, but we'd then need one for every repo, and they would run longer than our current CIs, so we would more than 2x our CI hardware resource in my estimation - something we'd want to consider and check was OK before we did that I suspect (one reason I might lean towards the master builds, as we are already building them, they are just not producing much output).

jodh-intel commented 6 years ago

Clearly, I don't want to spam people so I say we opt for the "Travis approach":

send a mail when the build fails.
send a mail once the build passes again.

But we don't want to invest a lot of engineering time into creating this if we switch to zuul, which might already have such features built-in?

/cc @cboylan.

jodh-intel commented 6 years ago

HI @grahamwhaley - any update on this? It would be awesome to have this running regularly, even if it does take a long time to run. A few of us have had to debug stray process issues (myself and yourself included I know) and it's a royal PITA (aka "very "$^$£!"%£"£! hard" :) so if we can minimise the delta (and thus the pain) by having it run on every PR so much the better.

I'd rather we just land it to run on every PR. That then becomes a forcing function to refine its invocation if it's taking too long. But atleast we'll be running it! ;)

kata-containers / tests

Import soak test from Clear Containers for mount checks #195