NFV appears to allocate

wingo commented 8 years ago

When I do a perf -p PID for a snabb-nfv PID and then look at it with perf report, I see:

   6.09%  snabb    snabb              [.] lj_BC_TGETS
   5.62%  snabb    perf-546.map       [.] 0x000000000fa7face
   5.06%  snabb    perf-546.map       [.] 0x000000000fa7eaf5
   4.20%  snabb    snabb              [.] lj_meta_tget
   2.94%  snabb    snabb              [.] lj_cdata_index
   2.69%  snabb    snabb              [.] lj_alloc_free
   2.42%  snabb    snabb              [.] lj_BC_JLOOP
   2.33%  snabb    snabb              [.] lj_BC_CALL
   2.02%  snabb    snabb              [.] lj_tab_getstr
   2.00%  snabb    snabb              [.] lj_tab_get
   1.92%  snabb    snabb              [.] lj_ctype_getfieldq
   1.86%  snabb    snabb              [.] lj_BC_RET1
   1.84%  snabb    snabb              [.] gc_sweep
   1.69%  snabb    snabb              [.] lj_cconv_ct_ct
   1.65%  snabb    snabb              [.] lj_cf_ffi_meta___index
   1.64%  snabb    snabb              [.] lj_carith_op
   1.57%  snabb    libc-2.21.so       [.] __memmove_avx_unaligned
   1.29%  snabb    snabb              [.] lj_BC_FUNCC
   1.27%  snabb    snabb              [.] lj_alloc_malloc
   1.26%  snabb    snabb              [.] lj_cont_ra
   1.08%  snabb    snabb              [.] lj_BC_RET0
   1.05%  snabb    snabb              [.] lj_BC_UGET
   1.04%  snabb    snabb              [.] lj_tab_setstr
   1.04%  snabb    snabb              [.] cksum_generic
   0.93%  snabb    snabb              [.] lj_BC_JFUNCF
   0.90%  snabb    snabb              [.] release_unused_segments
   0.89%  snabb    snabb              [.] propagatemark
   0.88%  snabb    snabb              [.] lj_vm_exit_interp
   0.88%  snabb    snabb              [.] lj_vm_call_dispatch_f
   0.88%  snabb    snabb              [.] lj_meta_lookup
   0.86%  snabb    snabb              [.] lj_ctype_intern
   0.84%  snabb    snabb              [.] lj_BC_TGETV
   0.78%  snabb    snabb              [.] lj_tab_newkey
   0.71%  snabb    snabb              [.] lj_cont_dispatch

The gc_sweep and various lj_alloc calls indicate to me that the NFV is allocating, somewhere. I am also surprised to see a non-compiled entry at the top. If we can trust this, this seems to indicate that there is a source of allocations in the NFV that needs eliminating in order to eliminate GC jitter.

wingo commented 8 years ago

I note that when I run perf record on a lwaftr process, for comparison, I see something more like:

   7.32%  snabb    perf-1768.map      [.] 0x000000000bc998cd
   6.32%  snabb    perf-1768.map      [.] 0x000000000bc98fcd
   6.11%  snabb    perf-1768.map      [.] 0x000000000bc9cce6
   5.34%  snabb    perf-1768.map      [.] 0x000000000bc97782
   4.63%  snabb    snabb              [.] lj_BC_TGETS
   2.25%  snabb    snabb              [.] lj_tab_len
   1.97%  snabb    perf-1768.map      [.] 0x000000000bc896c6
   1.66%  snabb    snabb              [.] lj_meta_tget
   1.33%  snabb    snabb              [.] lj_tab_get
   1.16%  snabb    snabb              [.] lj_ctype_getfieldq
   0.99%  snabb    snabb              [.] lj_BC_JLOOP
   0.97%  snabb    snabb              [.] lj_cdata_index
   0.81%  snabb    snabb              [.] full_memory_barrier
   0.79%  snabb    snabb              [.] lj_tab_getinth
   0.74%  snabb    snabb              [.] lj_BC_CALL
   0.74%  snabb    snabb              [.] lj_tab_getstr
   0.59%  snabb    libm-2.21.so       [.] __ieee754_log_avx
   0.57%  snabb    perf-1768.map      [.] 0x000000000bca13ef
   0.56%  snabb    perf-1768.map      [.] 0x000000000bca13bf
   0.49%  snabb    snabb              [.] lj_cf_ffi_meta___index
   0.43%  snabb    [vdso]             [.] __vdso_clock_gettime
   0.43%  snabb    perf-1768.map      [.] 0x000000000bc9493c
   0.42%  snabb    perf-1768.map      [.] 0x000000000bca140b
   0.40%  snabb    snabb              [.] lj_BC_JFUNCF
   0.39%  snabb    perf-1768.map      [.] 0x000000000bca149e
   0.39%  snabb    snabb              [.] lj_vm_exit_interp
   0.38%  snabb    libc-2.21.so       [.] __memmove_avx_unaligned
   0.37%  snabb    snabb              [.] lj_cconv_ct_ct
   0.37%  snabb    perf-1768.map      [.] 0x000000000bca142e
   0.36%  snabb    perf-1768.map      [.] 0x000000000bc94931
   0.31%  snabb    perf-1768.map      [.] 0x000000000bca13a8
   0.31%  snabb    perf-1768.map      [.] 0x000000000bc9243c
   0.30%  snabb    snabb              [.] lj_vmeta_call
   0.30%  snabb    snabb              [.] lj_cont_ra

wingo commented 8 years ago

I don't know whether to trust this data. The BC_ things would seem to indicate bytecode handlers. In that case that would mean that both processes are running interpreted, a significant portion of the time!

lukego commented 8 years ago

So! This is an awesome opportunity to apply our new Hydra-based benchmarking tools to a real performance problem. First time! Exciting :-)

Here is a setup that could work for us:

Define our benchmark. For example:
- Run l2fwd branchmark as the nearest available proxy for the lwaftr performance tests.
- Run on lugano servers with Intel 82599 NICs. (We have been recently experimenting with software emulated NICs in order to run tests on more generic servers but better to exclude that code for now and focus on the most realistic setup.)
- Run the benchmark many times (e.g. 50) and present the distribution of results. Then we can see whether the benchmark is consistently fast, consistently slow, randomly varied, bimodally varied, etc.
Define several branches to compare. For example:
- a baseline branch (e.g. master) for comparisons to see improvements and degradations.
- a debug branch (e.g. with more logging and profiling enabled) to provide more information to compare between good and bad runs. The overall performance of the debug branch can be compared with the baseline branch to estimate how much the debugging code is affecting the results.
- some number of speculative optimization branches (e.g. one or more for each person working on the problem) where experimental changes can be pushed and automatically benchmarked against the other branches. This could make optimization into a "multiplayer game" where you can keep pushing changes to your branch and the report comparing all branches is updated automatically.
Define one or more Rmarkdown reports to present comparative results for all branches. The reports could range from quite generic - like visualizing the distributions of results - to quite specific - like counting certain JIT events or performance counters.
Hydra jobset that automatically runs all of the above, publishes the results, and updates every time a branch is modified (reusing any results for unchanged branches).

@domenkozar Could you hook up a setup like this, pretty please? :smile: If master is not suitable as a baseline at the moment because the matrix tests depend on new code then we could pick whatever branch works.

wingo commented 8 years ago

Test plan sounds great :) It will be nice to be able to evaluate changes in a more scientific way. Looking forward to the benchmarking game :)

domenkozar commented 8 years ago

I've managed to prepare most of what's needed, I'm waiting for our existing matrix to build successfully, then I'll create another one for this issue.

I expect to have the matrix ready tomorrow morning, then it's just a matter of generating reports.

domenkozar commented 8 years ago

Here we go: https://hydra.snabb.co/eval/2033

some of the benchmark runs fail (I fail to remember why, I'll look into qemu logs)
only using master branch
no reports yet

lukego commented 8 years ago

@domenkozar Awesome!

I downloaded the CSV from Hydra and made a report in Rstudio: http://rpubs.com/lukego/192632.

I also sent the report source to you via PR snabblab/snabblab-nixos#50. Could you please hook that into the build? I suppose that we will want to incrementally add more reports and refinements to the matrix tests. Hydra would seem to support that quite well.

Here are my notes:

I arbitrarily picked a subset of software versions to focus on: guest kernel = 3.18.29, qemu = 2.4.1, dpdk = 16.04.
This slice performs consistently over 30 iterations i.e. does not successfully reproduce the problem that @wingo is having.
Looking at the broader dataset there is a big difference with QEMU 2.5.1 or QEMU 2.6. They show fewer results (more failures) and unpredictable performance (bimodal at around 2.6 and 1.3 Mpps).
The scores generally look low at <= 2.6 Mpps. Having the pktsize column populated would make this easier to interpret.

So @wingo can you give more tips on how to reproduce? The ideal would be to add the lwaftr benchmark to the test matrix with help from @domenkozar. Otherwise we need to identify a better proxy to measure the problem you are seeing e.g. perhaps switching to a newer QEMU version that is showing problematic results in this test run.

This test matrix seems like it will be a fantastic tool for identifying bugs and incompatibilities. Going to be an interesting problem to prioritize which problems to dig into first...

lukego commented 8 years ago

@domenkozar Could we please add a couple more branches to this jobset? I'm thinking it would be neat to compare:

snabbco/master as the baseline.
snabbco/next as the likely optimizations for the next release.
experimental optimization branches, e.g. myfork/optimize, where people can force-push their own optimizations and have them automatically evaluated.

So first step could be to add snabbco/next and lukego/optimize branches (and perhaps other people's e.g. wingo/optimize if they create those branches first).

I realize that I could create my own Hydra jobset that runs the tests from the branches that I care about, and this will probably be common practice for maintainers to test their own branches in the ways they most care about, but for now I figured it is nice to be social and work this out together in one place :).

wingo commented 8 years ago

This particular bug was not filed because of the bimodal perf I was seeing: it was filed because perf indicates that there are opportunities for optimization around allocation.

The bimodal perf bug is somewhat hard to reproduce. FWIW I run qemu like this:

$ qemu-system-x86_64 --version
QEMU emulator version 2.4.0, Copyright (c) 2003-2008 Fabrice Bellard

I invoke qemu like this:

    sudo numactl -m ${NUMA_NODE} taskset -c ${QEMU_CORE} qemu-system-x86_64 \
        --enable-kvm -name ${NAME} -drive file=${DISK},if=virtio \
        -m ${MEM} -cpu host \
        -fsdev local,security_model=passthrough,id=fsdev0,path=${SHARED_LOCATION} \
            -device virtio-9p-pci,id=fs0,fsdev=fsdev0,mount_tag=share,rombar=0 \
        -object memory-backend-file,prealloc=yes,id=mem,size=${MEM},mem-path=/dev/hugepages,share=on \
        -numa node,memdev=mem \
        -chardev socket,id=char1,path=${VHU_SOCK1},server \
            -netdev type=vhost-user,id=net0,chardev=char1 \
            -device virtio-net-pci,netdev=net0,addr=0x8,mac=52:54:00:00:00:01 \
        -chardev socket,id=char2,path=${VHU_SOCK2},server \
            -netdev type=vhost-user,id=net1,chardev=char2 \
            -device virtio-net-pci,netdev=net1,addr=0x9,mac=52:54:00:00:00:02 \
        -netdev type=tap,id=net2,ifname=${IFACE},vhost=on,script=${NETSCRIPT} \
            -device virtio-net-pci,netdev=net2,addr=0xa,mac=52:54:00:00:00:03 \
        -vnc :${VNC_DISPLAY} &

And then inside I run the lwAFTR. Note that we are running from an old branch for these tests because we needed some NFV fixes regarding negotiation and support for not enabling MRG_RXBUF. Still working on porting forward to master. I would recommend not working on the bimodal perf bug until we have been able to merge forward; should happen in the next week or so.

domenkozar commented 8 years ago

I've included the report: https://hydra.snabb.co/build/81154/download/2/report.html

lukego commented 8 years ago

@wingo True I am using this issue to respond to a bimodal perf issue that was really mentioned on Slack rather than here.

On the topic of this specific bug: In your perf record are you capturing only the behavior while processing traffic or also e.g. the time interval while spinning the idle loop and waiting for the VM to start and connect? I see a risk of "optimizing the idle loop" if perf record is taking samples beyond the real test interval.

wingo commented 8 years ago

lukego: I attach to a running snabb nfv process when it is processing traffic, so I think those perf runs are measuring the right thing.

snabbco / snabb

NFV appears to allocate #951