Open lukego opened 8 years ago
cc @xrme
The issue with the cache hierarchy has been bothering me for a long time too. I completely share all the expressed thoughts and appreciate the pointed paper. And it seems that the situation is even worse than what I have imagined.
I would say that this is not only a virtio-net problem. The moment when SnabbSwitch "paralelises" itself it will become its problem too. On the other hand, since the virtio-net
driver is now in the master, how are those caching issues affecting the KVM VMs?
There is a nice tool called hwloc-ls
part of the hwloc
package in Ubuntu which prints the topology - NUMA, CPU cores, L1/L2/L3 caches and PCI allocation. hwloc-ps
, hwloc-distances
could be useful too. Other usefull tool could be likwid
, which provides a number of microbenchmarks too.
One note from the paper: An optional Cluster-on-Die (COD) mode can be enabled in the BIOS.
So the "internal NUMA" could be "show" to the world, but instead of 8+4 (for 12 core CPU) it will present itself as 6+6. So again, far from perfect. And then: The asymmetrical chip layout which is mapped to a balanced NUMA topology causes performance variations
. I could say that we have seen this many, many times.
I expect that it will be important for multi-process Snabb Switch to manage (via taskset or whatever) which cores the cooperating Snabb Switch processes run on. They will almost certainly need to be in the same NUMA domain.
As I understand it, the cache line is the unit of storage that gets moved around between cores. As you probably know, that's the source of the false sharing problem. (That's where two different cores are writing to different parts of the same cache line, obliging the system to keep transferring the dirty line back and forth, even though process 1 may only be interested in byte 0 of the line, and process 2 only interested in byte 63.)
IIRC L2 can only have 16 outstanding misses on Haswell: http://www.realworldtech.com/haswell-cpu/5/
I think could be the throughput bottleneck when working on data outside of L2.
Nikolay: True - this will be even more critical in the Snabb internal parallelism context. @xrme is working on this aspect right now i.e. taking the multiprocess design and doing a low-level implementation geared towards 100G with software dispatching. Could be for example that a software packet dispatcher needs to inspect payload using cache-bypassing instructions (MOVNTDQA
and so on) in order to avoid populating its own L1 cache with data that will be immediately accessed by another core.
@xrme Yes. False sharing is an interesting case: I suppose that we would expect to have that today if two cores would access the same struct link
since each one is updating different fields (read
vs write
cursor) that happen to be on the same cache line and so they might frequently block in synchronous RFO (Read For Ownership) requests to each other to take over the cache line. Likely we will need to add some padding so that each core updates a different cache line. (This is what Virtio-net does with its shared memory ring buffers IIRC.)
Overall I feel like the only hope to make sense of this stuff is with profiling e.g. using the PMU to isolate exactly what is happening and take deliberate (non-voodoo :-)) corrective measures. Could be that it is not really so difficult to optimize these aspects once we have the right mental model and the right tools. The PMU is able to count events like RFO requests on the cache coherence protocol and so on.
@wingo Interesting observation. Overall it seems like there are two ways to deal with latency: minimize/shorten it or parallelize/amortize it. Can be that both are needed if there are fundamental bounds on what is possible with each approach.
I am thinking about the Intel multicore cache coherence protocol and its performance implications for shared memory performance e.g. for Virtio-net I/O. This issue is for capturing ideas.
microbenchmarks paper
One excellent paper I have read for background information is Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture. From the abstract:
which seems hard to argue with :-). The authors have done a lot of microbenchmarking and I found these latency figures very interesting (based on 2.5 GHz CPU):
that tells us that the latency of an L3 cache access can be more than doubled if the L3 cache needs to fetch the latest value from the L1/L2 cache of another core. This suggests that high-speed shared memory I/O protocols may need to take cache coherence states into account for optimal performance.
napkin math and throughput vs latency
Just to put processing throughput and latency into perspective, consider that a Haswell CPU can simultaneously load 64 bytes per cycle and store 32 bytes per cycle. (This is because it is engineered for scientific computing workloads that want to execute a SIMD operation each cycle with two source operands and one destination.) On a 2GHz clock speed this means you have 1024 Gbps of load bandwidth and 512 Gbps of store bandwidth per core. This means that you could copy around 3.3KB of data in the time it takes to wait for one 53ns cache access. (If you were copying a 64-byte packet then you'd have spent 98% of your time waiting and 2% copying - a different perspective on #648).
This suggests to me that in the context of #710 it is worth doing some detailed profiling with the PMU to see what kind of cache accesses are really happening in our Virtio-net interactions. It is possible that we are paying heavy latency costs when one core is writing data into its L1 cache and then another core is accessing this immediately afterwards. It may be possible to do something creative to reduce the latency (e.g. drop data from sender's L1/L2 cache immediately) or manage it (e.g. know which memory accesses are expensive and find a way to parallelize them).
I would also like to review the microbenchmarks from the paper above and see how closely they resemble our Virtio-net code.
Open questions
Here is what I am wondering next:
CLFLUSH
instruction do on a Haswell core?)Other notes
I learned two more interesting things from the paper above and chasing its references: