metrics: boot time tests are hitting limits too often

grahamwhaley commented 6 years ago

Recently we have had a little spate of metrics CI failures, mostly on the boot time check (the check is saying it has crept up), but also sometimes on memory footprint (where at least it is saying it has gone down...).

It's quite hard to see from the Jenkins UI what is going on (in part as we don't have #60 up and running yet), and thus hard to see if we have had a bump, or are dying from the 'death of a thousand cuts', or if it is noise in the test setup.

I have some noddy scripts that curl the json results files down from the Jenkins server and calculate me an avg/min/max figure so I can try and set the upper/lower bounds on the metrics machine. I took those scripts and used them to plot out recent data for the runtime and tests repos so I could try and see what is going on.

For the runtime repo (x is metrics CI builds over time, and Y is time in seconds - yes, the specific machine we use is slow - but what we need is consistency ;-) ): runtime_boot

If we grab the average of those times, we get 8.45s - but, look, we have outliers. If we drop those we get 8.127s

for the tests repo: tests_boot

Avg. 8s, but dropping the outlier gives us 7.89s

Our ceiling on boot is currently set at 8.14s, which is feeling a little low. I set that a while back from empirical data, but looking at those graphs it seems our variance is too large for my liking. I'll see what I can find out in the next few days and try out some other packet machine types. In the meantime, don't be too concerned about the metrics CI complaining...

/cc @egernst @bergwolf

grahamwhaley commented 6 years ago

I went to see if the outliers were bogus or what. I grabbed the individual boot times (each test does 20 boots and takes the mean) for the last 5 test repo runs and plotted them (x is run, y is time in seconds):

test_boot_many

Oh dear. That would seem that in-run it is fairly consistent, but across runs it is not. That might have something to do with the type of server we are running on right now. I have access to try out the next couple of server sizes up on packet.net to see if helps with the noise (and ideally the boot speed itself ;-). I'll get on that, but what I'd really like to do is at the same time set up the deployment scripts (ansible and/or cloud-init most likely) so anybody can deploy a new metrics server more easily in the future. That will take me a little bit. In the meantime, I'll bump the metrics bounds so it stops failing so often on us.

grahamwhaley commented 6 years ago

Overnight we got a bunch of fails, some 'JSON files missing' ones I don't undertand (yet), and some hitting the bounds. The machine seems to be acting pretty noisy, so I've extended the bounds to try and avoid the 'false failures' that are hindering our merges, but that does mean we are probably being ineffective in our metrics checks. I'm working on looking at other machines to compare the noise level. Hopefully we'll get this sorted soon and re-instate a more refined metrics CI.

grahamwhaley commented 6 years ago

I ran up the metrics report on both packet.net t1.small and c1.small machines, and did 3 runs both on bare-metal and inside the metrics VM setups on each. I've attached the report files if anybody is interested in the details, but bottom line is that bare metal runs are substantially more stable and repeatable than nested VM runs. To that end, to make the metrics CI system more stable, and therefore useful, I'm going to back it out of using the nested VMs, and onto using baremetal. To avoid any 'dirty system' issues we might see from doing repeat runs on a system (as detailed around #39 etc.), I will adopt and adapt the cleanup scripts that were added for aarch64, where they have the same issue (as nesting VMs was not an option there).

This will take a short while, as I adapt the yet-to-be PRd Ansible scripts to bring up bare metal packet machines in the correct state (with keys etc.) to be attached to the Jenkins master.

report_c1_bare.pdf report_c1_vm.pdf report_t1_bare.pdf report_t1_vm.pdf

grahamwhaley commented 6 years ago

An update then. We have now deployed a baremetal metrics jenkins slave on packet.net. It is integrated into the Jenkins master, and running, but there seems to be something not quite right about either the job triggering and/or the results. The slave worked fine with a tests repo run http://jenkins.katacontainers.io/computer/x86_packet02/builds

but on the few instances it has triggered since, it is failing the yq checkcommits check (I have a feeling I may have noted this in another Issue already). Looking into it, it looks like the builds that are triggering are using a very old version of the tests repo code that does not encapsulate the JSON results into test-specific subsections, and hence the chekccommits jq query does not find the results. Here is a 'bad' snipped from a broken build:

},
    "Results": [
            {
        "total": {
            "Result": 3.688,
            "Units" : "s"

I may not have much time or connectivity this week to look at this. @chavafg , if you get a chance to have a poke around and figure out what is not happening with the triggering then that'd be great. I'll continue thinking on and nudging it when I can.

grahamwhaley commented 6 years ago

Update. @chavafg has checked the jenkins configs for metrics. and tweaked. Let's keep an eye on them to ensure they are now triggering before we close this issue.

grahamwhaley commented 6 years ago

Looks like the metrics CI is now triggering (hooray - thanks @chavafg ), but the builds are failing (boo!) http://jenkins.katacontainers.io/computer/x86_packet02/builds

I have a feeling the bare metal build slave may have gotten somehow corrupt :-( That is not meant to happen, so maybe there is something we are not cleaning up enough in our baremetal cleanup scripts. And/or maybe I have to redeploy the slave... I'll see if I can find a slot to look at the logs and dial into the slave if I can.

grahamwhaley commented 5 years ago

I'm going to close this one. With the last addition of the 'reboot task' to the metrics jenkins configs, things seem to have settled down.

kata-containers / ci

metrics: boot time tests are hitting limits too often #72