Open lukego opened 7 years ago
(This is actually a really awful graphic. However it should serve very well to contrast with a good one. I'm working on that now. The important missing ingredient above is showing the variance.)
So! Please indulge me in some experimenting aloud.
The weakness of the graph above is that it is only showing averages over one second. This leaves interesting questions unanswered. For example, what were the typical numbers of packets processed in each breath? The timeline contains detailed information that can be used to answer these questions. The challenge is to find a suitable visualization for showing more about the distribution of values than simply their average.
This is actually quite important in the timeline I am using as an example. The graphic shows that the packets/breath
burst size average around 10-30 packets. However, check out this excerpt from the log entries:
packets bytes cycles cycles_per_packet
1 6 3048 1869 311.5000
2 166 84328 121259 730.4759
3 0 0 1506 Inf
4 10 5080 2222 222.2000
5 0 0 1317 Inf
6 0 0 48941 Inf
7 8 4064 1814 226.7500
8 0 0 1407 Inf
9 0 0 33948 Inf
10 25 12700 2527 101.0800
11 0 0 1356 Inf
12 14 7112 21985 1570.3571
13 2 1016 1754 877.0000
14 0 0 1415 Inf
15 168 85344 93893 558.8869
16 0 0 1425 Inf
17 1 508 14519 14519.0000
18 2 1016 1751 875.5000
19 0 0 1285 Inf
20 0 0 1534 Inf
21 0 0 1299 Inf
Here we see that each breath is different. Most of the time there are no packets available but occasionally we get over 150 at the same time. So we can speculate that there is some burstiness in our I/O sources (in this case a VM running DPDK l2fwd and a ConnectX-4 100G NIC.)
Here is a first attempt to illustrate the distribution of values for each metric:
There is a lot of information here:
This puts some things into a different light.
Looking at packets/breath
:
This suggests a very specific way to approach performance optimization. There is no point breaking out the code profiler and trying to shave off CPU cycles: we are already processing the packets faster than they are arriving from the I/O sources. However we should look closely into why we are receiving packets in batches like this and how we can influence this (could it be a feedback loop involving our own internal batching too?)
Cool to build tools! I don't think this visualization is quite the right one but it seems to be on the right track. (I did find an absolutely gorgeous example of showing every quantile value in a band but it's a bit hard to reproduce.)
I am deep down the Timeline (#916) analysis and visualization rabbit hole at the moment. I am close to having some standard visualizations to "productionize" so that they are easy to generate & always available from CI.
Here is the current draft of the first one: an overview of how a Snabb process has been behaving over time. The graph may seem overwhelming due to the number of metrics but my hope is that on closer inspection they are easy to understand and relate. (Click to zoom.)
Quick reading:
Can say a few things about this process:
packet.free()
is called - not necessarily equal to ingress/egress.)Overall this looks to me like a Snabb process whose performance is limited by the rate of ingress traffic. If it were busy then we should see more packets/breath (~100) and more microseconds per breath (due to doing more work.) Seems like we could estimate the point at which we would become CPU bound based on the ~750 cycles per packet cost when packets/breath is looking respectable.
I am thinking that this could be a candidate for the first graph that you look at when a Snabb process is misbehaving, and a common view that everybody in the support chain could refer to for troubleshooting and root cause analysis and so on. Then we could have a stock of more specialized graphs e.g. zooming in to individual apps, zooming out to groups of processes, etc.
This graph was automatically generated by a small R program that reads the
shm
folder of a Snabb process directly. Theshm
folder can be captured using e.g.tar
either while the process is running or after it terminates (ifSNABB_SHM_KEEP
environment variable is set.) This should be safe to do even on busy production systems.