Open lukego opened 8 years ago
cc @xrme @kbara in case this is handy as a reference one day.
I ran this with sudo taskset -c 0 ./snabb snsh ./snabbmark_pmu.lua
and actually running this in the lab depends on https://github.com/snabblab/snabblab-nixos/issues/23. I did a sneaky way to enable the msr
module for PMU on lugano-1
like this:
$ nix-env -i module-init-tools
$ sudo insmod /nix/store/gxk7qq6s8hx2c20dch634xf2bh5hx6l9-linux-4.3.3/lib/modules/4.3.3/kernel/arch/x86/kernel/msr.ko
any reason you run it on cpu 0? That cpu gets the most work from the kernel by handling all interrupts.
Finger memory! Looking forward to merging and then following the new recommendations from you and @kbara :)
I braindumped some thoughts on lukego/blog#15. Basically I am wondering whether the PMU can work as a wireshark
for catching expensive MESIF interactions (or at least a netstat -s
). Seems to me like all performance issues around multicore are likely to be interactions with the L3 cache.
Reflection from lukego/blog#15: I suspect that L3 cache access is the key thing to optimize for any multiprocess application (including Snabb+VM over virtio-net). The L3 cache is "the network" that connects the cores together and has much higher latency than anything they do locally.
Here is a side-by-side comparison of some L3 PMU counters for the mp-ring
benchmark (#804 #809) comparing single-process operation (left) with multi-process operation (right):
[luke@lugano-1:~/git/snabbswitch/src]$ pr -w 160 -m -t a b
Benchmark configuration: Benchmark configuration:
burst: 100 burst: 100
writebytes: 0 writebytes: 0
processes: 1 processes: 2
readbytes: 0 readbytes: 0
packets: 10000000 packets: 10000000
mode: basic mode: basic
pmuevents: mem_load_uops_l3 pmuevents: mem_load_uops_l3
69.20 Mpps ring throughput per process 5.26 Mpps ring throughput per process
PMU report for child #0: PMU report for child #0:
EVENT TOTAL /packet EVENT TOTAL /packet
cycles 500,751,679 50.075 cycles 6,651,599,204 665.160
ref_cycles 0 0.000 ref_cycles 0 0.000
instructions 562,015,029 56.202 instructions 593,294,818 59.329
mem_load_uops_l3_hit_retired.xsnp_hit 447 0.000 mem_load_uops_l3_hit_retired.xsnp_hit 432,740 0.043
mem_load_uops_l3_hit_retired.xsnp_hitm 4 0.000 mem_load_uops_l3_hit_retired.xsnp_hitm 9,410,279 0.941
mem_load_uops_l3_hit_retired.xsnp_miss 280 0.000 mem_load_uops_l3_hit_retired.xsnp_miss 9,189 0.001
mem_load_uops_l3_hit_retired.xsnp_none 567 0.000 mem_load_uops_l3_hit_retired.xsnp_none 11,856 0.001
mem_load_uops_l3_miss_retired.local_dram 0 0.000 mem_load_uops_l3_miss_retired.local_dram 0 0.000
mem_load_uops_l3_miss_retired.remote_dram 0 0.000 mem_load_uops_l3_miss_retired.remote_dram 0 0.000
mem_load_uops_l3_miss_retired.remote_fwd 0 0.000 mem_load_uops_l3_miss_retired.remote_fwd 0 0.000
mem_load_uops_l3_miss_retired.remote_hitm 0 0.000 mem_load_uops_l3_miss_retired.remote_hitm 0 0.000
packet 10,000,000 1.000 packet 10,000,000 1.000
(also as a gist in case that is easier to read.)
I see two interesting and likely related things here:
xsnp_hitm
event per packet. I believe this event means that the L3 cache served a hit from a line that was _m_odified in another core.From the paper cited in #735 I would expect an L3 hitm
to have >100 cycles of latency and that could partly explain the whopping 665 cycles/packet performance that the benchmark is reporting. So I predict that if somebody is able to identify that hitm
and reduce its frequency then that will significantly increase the performance of this benchmark.
Now if we had working PEBS support (#631) hooked into a profiler we could also have a report telling us which instruction is triggering that hitm and which data structure it is accessing at the time. This is firmly on my wishlist for the future.
The PMU is a powerful piece of hardware. We are only scratching the surface of how it can be used. This issue is meant to remind us of this fact and perhaps inspire the next step in our usage.
The idea I came up with is simple: to run the
snabbmark basic1
benchmark many times, each time with a different set of PMU events, and each time with/packet
counts reported.Script:
Output:
No interpretation for now :-). However, this seems like an approach that could be used to gather data for more interesting benchmarks (like
mp-ring
in #804) and that could also be analyzed for interesting patterns and differences with an R script (like #755) to help explain results and validate intuitions.