Cache coherence and Virtio-net performance

lukego commented 8 years ago

I am thinking about the Intel multicore cache coherence protocol and its performance implications for shared memory performance e.g. for Virtio-net I/O. This issue is for capturing ideas.

microbenchmarks paper

One excellent paper I have read for background information is Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture. From the abstract:

A major challenge in the design of contemporary microprocessors is the increasing number of cores in conjunction with the persevering need for cache coherence. To achieve this, the memory subsystem steadily gains complexity that has evolved to levels beyond comprehension of most application performance analysts.

which seems hard to argue with :-). The authors have done a lot of microbenchmarking and I found these latency figures very interesting (based on 2.5 GHz CPU):

1.6ns - L1 hit.
4.8ns - L2 hit.
21.2ns - L3 hit.
49ns - L3 hit (modified in L2 of another core)
53ns - L3 hit (modified in L1 of another core)

that tells us that the latency of an L3 cache access can be more than doubled if the L3 cache needs to fetch the latest value from the L1/L2 cache of another core. This suggests that high-speed shared memory I/O protocols may need to take cache coherence states into account for optimal performance.

napkin math and throughput vs latency

Just to put processing throughput and latency into perspective, consider that a Haswell CPU can simultaneously load 64 bytes per cycle and store 32 bytes per cycle. (This is because it is engineered for scientific computing workloads that want to execute a SIMD operation each cycle with two source operands and one destination.) On a 2GHz clock speed this means you have 1024 Gbps of load bandwidth and 512 Gbps of store bandwidth per core. This means that you could copy around 3.3KB of data in the time it takes to wait for one 53ns cache access. (If you were copying a 64-byte packet then you'd have spent 98% of your time waiting and 2% copying - a different perspective on #648).

This suggests to me that in the context of #710 it is worth doing some detailed profiling with the PMU to see what kind of cache accesses are really happening in our Virtio-net interactions. It is possible that we are paying heavy latency costs when one core is writing data into its L1 cache and then another core is accessing this immediately afterwards. It may be possible to do something creative to reduce the latency (e.g. drop data from sender's L1/L2 cache immediately) or manage it (e.g. know which memory accesses are expensive and find a way to parallelize them).

I would also like to review the microbenchmarks from the paper above and see how closely they resemble our Virtio-net code.

Open questions

Here is what I am wondering next:

How can we use the PMU to understand detailed cache coherency protocol (MESIF) interactions?
Is there a mechanism to flush cache lines from L1/L2? (what exactly does the CLFLUSH instruction do on a Haswell core?)
How does cache latency affect execution in practice? Is it different for loads and stores? Do stores run faster if you write/clobber a whole cache line? In practice do the CPU's store buffer and invalidation queue mask the latency? If so then at what cost and how exactly do they interact with memory barrier instructions?
Other notes

I learned two more interesting things from the paper above and chasing its references:

Xeon CPUs with > 8 cores are internally NUMA. That is, the L3 cache is shared by at most 8 cores, and if there are more than 8 cores then there will be two L3 caches. This is ordinarily hidden from the operating system.
Xeon CPUs reduce their clock speed to run AVX instructions. This is like a reverse-turbo: the AVX instructions cannot run at the base clock speed of the processor and have to clock down by ~20%. Processors have a separate base clock speed for AVX instructions when you read the fine print.

lukego commented 8 years ago

cc @xrme

ghost commented 8 years ago

The issue with the cache hierarchy has been bothering me for a long time too. I completely share all the expressed thoughts and appreciate the pointed paper. And it seems that the situation is even worse than what I have imagined.

I would say that this is not only a virtio-net problem. The moment when SnabbSwitch "paralelises" itself it will become its problem too. On the other hand, since the virtio-net driver is now in the master, how are those caching issues affecting the KVM VMs?

There is a nice tool called hwloc-ls part of the hwloc package in Ubuntu which prints the topology - NUMA, CPU cores, L1/L2/L3 caches and PCI allocation. hwloc-ps, hwloc-distances could be useful too. Other usefull tool could be likwid, which provides a number of microbenchmarks too.

One note from the paper: An optional Cluster-on-Die (COD) mode can be enabled in the BIOS. So the "internal NUMA" could be "show" to the world, but instead of 8+4 (for 12 core CPU) it will present itself as 6+6. So again, far from perfect. And then: The asymmetrical chip layout which is mapped to a balanced NUMA topology causes performance variations. I could say that we have seen this many, many times.

xrme commented 8 years ago

I expect that it will be important for multi-process Snabb Switch to manage (via taskset or whatever) which cores the cooperating Snabb Switch processes run on. They will almost certainly need to be in the same NUMA domain.

As I understand it, the cache line is the unit of storage that gets moved around between cores. As you probably know, that's the source of the false sharing problem. (That's where two different cores are writing to different parts of the same cache line, obliging the system to keep transferring the dirty line back and forth, even though process 1 may only be interested in byte 0 of the line, and process 2 only interested in byte 63.)

wingo commented 8 years ago

IIRC L2 can only have 16 outstanding misses on Haswell: http://www.realworldtech.com/haswell-cpu/5/

I think could be the throughput bottleneck when working on data outside of L2.

lukego commented 8 years ago

Nikolay: True - this will be even more critical in the Snabb internal parallelism context. @xrme is working on this aspect right now i.e. taking the multiprocess design and doing a low-level implementation geared towards 100G with software dispatching. Could be for example that a software packet dispatcher needs to inspect payload using cache-bypassing instructions (MOVNTDQA and so on) in order to avoid populating its own L1 cache with data that will be immediately accessed by another core.

@xrme Yes. False sharing is an interesting case: I suppose that we would expect to have that today if two cores would access the same struct link since each one is updating different fields (read vs write cursor) that happen to be on the same cache line and so they might frequently block in synchronous RFO (Read For Ownership) requests to each other to take over the cache line. Likely we will need to add some padding so that each core updates a different cache line. (This is what Virtio-net does with its shared memory ring buffers IIRC.)

Overall I feel like the only hope to make sense of this stuff is with profiling e.g. using the PMU to isolate exactly what is happening and take deliberate (non-voodoo :-)) corrective measures. Could be that it is not really so difficult to optimize these aspects once we have the right mental model and the right tools. The PMU is able to count events like RFO requests on the cache coherence protocol and so on.

lukego commented 8 years ago

@wingo Interesting observation. Overall it seems like there are two ways to deal with latency: minimize/shorten it or parallelize/amortize it. Can be that both are needed if there are fundamental bounds on what is possible with each approach.

snabbco / snabb