Timeline process overview screenshot/teaser

lukego commented 7 years ago

I am deep down the Timeline (#916) analysis and visualization rabbit hole at the moment. I am close to having some standard visualizations to "productionize" so that they are easy to generate & always available from CI.

Here is the current draft of the first one: an overview of how a Snabb process has been behaving over time. The graph may seem overwhelming due to the number of metrics but my hope is that on closer inspection they are easy to understand and relate. (Click to zoom.)

timeline-panel

Quick reading:

Can say a few things about this process:

Blue line (bytes per packet) shows 4 seconds of idleness followed by steady 508B average packet size.
Red line (Gbps freed) shows throughput fluctuating at ~10 - 17 Gbps (rate at which packet.free() is called - not necessarily equal to ingress/egress.)
Cyan line (packets/breath) shows ingress queue lengths fluctuating between 10-30 packets per breath.
Green line (cycles/packet) shows efficiency rising as batch sizes increase.
Purple line (microseconds/breath) shows rapid breaths (<10us when 100us would be reasonable)

Overall this looks to me like a Snabb process whose performance is limited by the rate of ingress traffic. If it were busy then we should see more packets/breath (~100) and more microseconds per breath (due to doing more work.) Seems like we could estimate the point at which we would become CPU bound based on the ~750 cycles per packet cost when packets/breath is looking respectable.

I am thinking that this could be a candidate for the first graph that you look at when a Snabb process is misbehaving, and a common view that everybody in the support chain could refer to for troubleshooting and root cause analysis and so on. Then we could have a stock of more specialized graphs e.g. zooming in to individual apps, zooming out to groups of processes, etc.

This graph was automatically generated by a small R program that reads the shm folder of a Snabb process directly. The shm folder can be captured using e.g. tar either while the process is running or after it terminates (if SNABB_SHM_KEEP environment variable is set.) This should be safe to do even on busy production systems.

lukego commented 7 years ago

(This is actually a really awful graphic. However it should serve very well to contrast with a good one. I'm working on that now. The important missing ingredient above is showing the variance.)

lukego commented 7 years ago

So! Please indulge me in some experimenting aloud.

The weakness of the graph above is that it is only showing averages over one second. This leaves interesting questions unanswered. For example, what were the typical numbers of packets processed in each breath? The timeline contains detailed information that can be used to answer these questions. The challenge is to find a suitable visualization for showing more about the distribution of values than simply their average.

This is actually quite important in the timeline I am using as an example. The graphic shows that the packets/breath burst size average around 10-30 packets. However, check out this excerpt from the log entries:

   packets bytes cycles cycles_per_packet
1        6  3048   1869          311.5000
2      166 84328 121259          730.4759
3        0     0   1506               Inf
4       10  5080   2222          222.2000
5        0     0   1317               Inf
6        0     0  48941               Inf
7        8  4064   1814          226.7500
8        0     0   1407               Inf
9        0     0  33948               Inf
10      25 12700   2527          101.0800
11       0     0   1356               Inf
12      14  7112  21985         1570.3571
13       2  1016   1754          877.0000
14       0     0   1415               Inf
15     168 85344  93893          558.8869
16       0     0   1425               Inf
17       1   508  14519        14519.0000
18       2  1016   1751          875.5000
19       0     0   1285               Inf
20       0     0   1534               Inf
21       0     0   1299               Inf

Here we see that each breath is different. Most of the time there are no packets available but occasionally we get over 150 at the same time. So we can speculate that there is some burstiness in our I/O sources (in this case a VM running DPDK l2fwd and a ConnectX-4 100G NIC.)

Here is a first attempt to illustrate the distribution of values for each metric:

timeline-panel2

There is a lot of information here:

The average (mean) is plotted as before: solid lines connected by dots.
The median value is plotted as a dotted line.
The 25th to 75th percentile ranges are plotted in dark background ink.
The 5th to 95th percentile ranges are plotted in light background ink.

This puts some things into a different light.

Looking at packets/breath:

The median value (dotted line) is close to zero. This means that on most breaths there were no packets available to process i.e. we polled our I/O sources and came up empty.
The 95th percentile (top edge of the light blue area) is waaay above the mean. This suggests that when we do find packets available to process there are usually quite a lot of them (~100-200.)

This suggests a very specific way to approach performance optimization. There is no point breaking out the code profiler and trying to shave off CPU cycles: we are already processing the packets faster than they are arriving from the I/O sources. However we should look closely into why we are receiving packets in batches like this and how we can influence this (could it be a feedback loop involving our own internal batching too?)

Cool to build tools! I don't think this visualization is quite the right one but it seems to be on the right track. (I did find an absolutely gorgeous example of showing every quantile value in a band but it's a bit hard to reproduce.)

snabbco / snabb

Timeline process overview screenshot/teaser #1112