This issue is to reintroduce Timeline (#849) and keep track of some related work.

What is the timeline?

The timeline is a log file that makes system performance more transparent.

Each log entry has a timestamp with cycle (~nanosecond) precision, a short message, a long explanation, some optional counter values, and other relevant details.

The log level is dynamically adjusted for each iteration of the event loop (breath): most breaths are not logged, many are logged at a high level, and some are logged with fine detail. The log therefore provides samples of operation in units of whole breaths.

The timeline will typically contain one million events recorded over the past few minutes.

Who is it for?

The timeline is intended as a common reference for many different people:

Software developers optimizing their applications.
Network engineers troubleshooting production deployments.
Test engineers performing system stress tests.
Infrastructure engineers providing logs from automatic tests.

Having a common tool will hopefully make it easier for these people to communicate with each other.

I think of the timeline like the flight recorder ("black box") in an airplane. It is a buffer containing data about recent operation and it is robust enough to survive a crash. If a system crashes or behaves strangely during a test run then the first step for every interested party would be to review the timeline and compare their observations.

Example

Here is an excerpt from the timeline file showing a breath of the snabbmark basic1 synthetic benchmark:

numa core       cycles category         message
0    2            2540 engine           start breath(3129): total packets(1595790) bytes(95747400) etherbits(880876080)
0    2             252 engine           initialized breath
0    2           11608 engine           called app Source (type: apps.basic.basic_apps)
0    2             540 engine           pulled input packets
0    2             680 engine           called app Sink (type: apps.basic.basic_apps)
0    2             588 engine           called app Sink (type: apps.basic.basic_apps)
0    2           16888 engine           called app Tee (type: apps.basic.basic_apps)
0    2             184 engine           pushed output packets
0    2            9084 engine           called app Sink (type: apps.basic.basic_apps)
0    2             512 engine           called app Sink (type: apps.basic.basic_apps)
0    2             280 engine           pushed output packets
0    2             176 engine           pushed output packets
0    2             208 engine           end breath(3129): processed packets(510) bytesPerPacket(60)

The first message shows the breath starting along with the total number of packets/bytes/bits that have been processed. The last message shows the breath ending along with the number of packets processed and their average size. The messages in between show more detail: in this each callback made to an app in the engine configuration. The cycle count is shown as a delta compared with the previous log entry, which roughly estimates how long each event took to complete. The currently scheduled CPU core and NUMA node are also included as additional details.

See also longer example in which events are shown from newest to oldest.

Status

The timeline is currently living on a couple of branches:

snabbco/timeline contains the core logging mechanism (doc, source).
lukego/timeline-engine contains event definitions and instrumentation for the engine.

next action: @lukego needs to submit a version for upstreaming that is unlikely to cause problems for people. This requires ensuring the overhead is negligible and perhaps changing default settings e.g. smaller log file or even logging disabled by default.

Grand vision and parallel work

The timeline is one part of a grand vision for making it easy and productive to optimize and troubleshoot Snabb applications.

More related work:

Getting our Hydra CI system (hydra.snabb.co) to automatically execute large benchmarking campaigns and publish both metrics and timeline logs. @domenkozar is leading this effort.
Getting Hydra to also generate Rmarkdown reports (early draft example) to summarize and compare benchmark campaigns and highlight bad results that may indicate bugs. @lukego and @domenkozar are working on this.
Creating an interactive GUI (early example screenshots) for exploring timeline files and "finding the needles in the million-event haystack." Alexandre Bergel's team at ObjectProfile will pick up @lukego's Snabb Data Studio prototype and take the lead on this.
<insert your own timeline-related initiative here :-)>

More indirectly, I hope that timeline will provide the framework for writing fast applications. How do you optimize your application? Break it into pieces that each run for around a thousand CPU cycles and can easily be measured with timeline. This would have the double benefit of making system performance transparent and also dividing the work into the small loops that CPUs and tracing JITs love. We shall see!

Backend operational

The backend is up and running now! Hydra runs test campaigns automatically across a cluster of machines, archives the logs (including timeline files), and publishes an Rmarkdown report to highlight anomalies.

This hopefully provides the data feed that we need for developing and using the Snabb Studio front-end i.e. to identify the interesting results, download their timeline files, and use the GUI to understand what is going on.

Here is an overview of how everything works and how you can access it. I will provide both links and, mostly for convenience, screenshots. I will link to recent runs and code on my own development branches.

Code

bench.nix is a Nix script that runs the benchmarks and converts their output into CSV. This is built on shell fragments to call our existing benchmarks.

[report.Rmd]() is the Rmarkdown document that produces the report. This is an early draft and can be extended, tweaked, and pruned over time. Just now it seems reasonable to keep adding more graphs and then later remove the ones that are not useful in practice. I am trying to start the report with the most intuitive graphs and then move on to more exotic ones. (We also need an Rmarkdown document that presents a comparison between two or more Snabb branches.)

I have mostly learned R from two O'Reilly books: R Cookbook and R Graphics Cookbook. This is where I picked up using the "ggplot2" library for most graphs.

Hydra

I have defined a Hydra Jobset to run a benchmark campaign using software versions from specific git repos/branches:

jobset edit

This could be enabled to run automatically when any of the branches is updated. For the moment that is disabled and I run it manually.

The Jobset Evaluations lists the results of each execution of this jobset.

jobset

Each execution only rebuilds what is needed: for example if only the Rmarkdown document has been changed then the previous benchmark results will be reused, but if the Snabb code was changed then all the benchmarks would be rerun. This is the normal behavior for Nix i.e. it looks at the SHA256 hash of what it needs to run and checks if it has already done that in the past.

Clicking on one of these Evaluations takes me to a page that lists all of the Jobs that Hydra has built for this Jobset.

jobs

This example has 122 jobs: the CSV file, the R report, and each of the 120 individual benchmarks (20 runs each of 6 different benchmarks.)

Clicking through to the benchmark-report job page leads me to benchmark.html that is the actual report. This includes a series of graphs, the first of which is a line graph.

report

Here we can see that three results are interestingly slower than the others: basic1 test number 6 and iperf1500 tests number 8 and 15. I can use these numbers to identify which Hydra jobs contain the relevant logs, for example benchmark basic1 run 6 is job snabb-basic1-100e6-num-6.

basic1 job

Here I can click the links to download either log.txt, the text output from running the test, or state.tar.xz, a tarball that includes the timeline file for the Snabb process(es) that were executed.

These timeline files can then be loaded into the GUI to see what is going on. Getting this working smoothly is the next step.

Anticipating the questions we will want to answer with the GUI...

Is there one specific event that takes much longer in the slow runs than in other runs?
Or is slowness distributed more evently between all events?
And is the slowness spread across the whole trace or localized to just the start/middle/end?

... but perhaps the visualizations will speak for themselves and we will not have to come into the analysis with such specific questions :).

snabbco / snabb

Timeline: Performance analysis for everybody #916