snabbco / snabb

Snabb: Simple and fast packet networking
Apache License 2.0
2.96k stars 297 forks source link

NFV appears to allocate #951

Open wingo opened 8 years ago

wingo commented 8 years ago

When I do a perf -p PID for a snabb-nfv PID and then look at it with perf report, I see:

   6.09%  snabb    snabb              [.] lj_BC_TGETS
   5.62%  snabb    perf-546.map       [.] 0x000000000fa7face
   5.06%  snabb    perf-546.map       [.] 0x000000000fa7eaf5
   4.20%  snabb    snabb              [.] lj_meta_tget
   2.94%  snabb    snabb              [.] lj_cdata_index
   2.69%  snabb    snabb              [.] lj_alloc_free
   2.42%  snabb    snabb              [.] lj_BC_JLOOP
   2.33%  snabb    snabb              [.] lj_BC_CALL
   2.02%  snabb    snabb              [.] lj_tab_getstr
   2.00%  snabb    snabb              [.] lj_tab_get
   1.92%  snabb    snabb              [.] lj_ctype_getfieldq
   1.86%  snabb    snabb              [.] lj_BC_RET1
   1.84%  snabb    snabb              [.] gc_sweep
   1.69%  snabb    snabb              [.] lj_cconv_ct_ct
   1.65%  snabb    snabb              [.] lj_cf_ffi_meta___index
   1.64%  snabb    snabb              [.] lj_carith_op
   1.57%  snabb    libc-2.21.so       [.] __memmove_avx_unaligned
   1.29%  snabb    snabb              [.] lj_BC_FUNCC
   1.27%  snabb    snabb              [.] lj_alloc_malloc
   1.26%  snabb    snabb              [.] lj_cont_ra
   1.08%  snabb    snabb              [.] lj_BC_RET0
   1.05%  snabb    snabb              [.] lj_BC_UGET
   1.04%  snabb    snabb              [.] lj_tab_setstr
   1.04%  snabb    snabb              [.] cksum_generic
   0.93%  snabb    snabb              [.] lj_BC_JFUNCF
   0.90%  snabb    snabb              [.] release_unused_segments
   0.89%  snabb    snabb              [.] propagatemark
   0.88%  snabb    snabb              [.] lj_vm_exit_interp
   0.88%  snabb    snabb              [.] lj_vm_call_dispatch_f
   0.88%  snabb    snabb              [.] lj_meta_lookup
   0.86%  snabb    snabb              [.] lj_ctype_intern
   0.84%  snabb    snabb              [.] lj_BC_TGETV
   0.78%  snabb    snabb              [.] lj_tab_newkey
   0.71%  snabb    snabb              [.] lj_cont_dispatch

The gc_sweep and various lj_alloc calls indicate to me that the NFV is allocating, somewhere. I am also surprised to see a non-compiled entry at the top. If we can trust this, this seems to indicate that there is a source of allocations in the NFV that needs eliminating in order to eliminate GC jitter.

wingo commented 8 years ago

I note that when I run perf record on a lwaftr process, for comparison, I see something more like:

   7.32%  snabb    perf-1768.map      [.] 0x000000000bc998cd
   6.32%  snabb    perf-1768.map      [.] 0x000000000bc98fcd
   6.11%  snabb    perf-1768.map      [.] 0x000000000bc9cce6
   5.34%  snabb    perf-1768.map      [.] 0x000000000bc97782
   4.63%  snabb    snabb              [.] lj_BC_TGETS
   2.25%  snabb    snabb              [.] lj_tab_len
   1.97%  snabb    perf-1768.map      [.] 0x000000000bc896c6
   1.66%  snabb    snabb              [.] lj_meta_tget
   1.33%  snabb    snabb              [.] lj_tab_get
   1.16%  snabb    snabb              [.] lj_ctype_getfieldq
   0.99%  snabb    snabb              [.] lj_BC_JLOOP
   0.97%  snabb    snabb              [.] lj_cdata_index
   0.81%  snabb    snabb              [.] full_memory_barrier
   0.79%  snabb    snabb              [.] lj_tab_getinth
   0.74%  snabb    snabb              [.] lj_BC_CALL
   0.74%  snabb    snabb              [.] lj_tab_getstr
   0.59%  snabb    libm-2.21.so       [.] __ieee754_log_avx
   0.57%  snabb    perf-1768.map      [.] 0x000000000bca13ef
   0.56%  snabb    perf-1768.map      [.] 0x000000000bca13bf
   0.49%  snabb    snabb              [.] lj_cf_ffi_meta___index
   0.43%  snabb    [vdso]             [.] __vdso_clock_gettime
   0.43%  snabb    perf-1768.map      [.] 0x000000000bc9493c
   0.42%  snabb    perf-1768.map      [.] 0x000000000bca140b
   0.40%  snabb    snabb              [.] lj_BC_JFUNCF
   0.39%  snabb    perf-1768.map      [.] 0x000000000bca149e
   0.39%  snabb    snabb              [.] lj_vm_exit_interp
   0.38%  snabb    libc-2.21.so       [.] __memmove_avx_unaligned
   0.37%  snabb    snabb              [.] lj_cconv_ct_ct
   0.37%  snabb    perf-1768.map      [.] 0x000000000bca142e
   0.36%  snabb    perf-1768.map      [.] 0x000000000bc94931
   0.31%  snabb    perf-1768.map      [.] 0x000000000bca13a8
   0.31%  snabb    perf-1768.map      [.] 0x000000000bc9243c
   0.30%  snabb    snabb              [.] lj_vmeta_call
   0.30%  snabb    snabb              [.] lj_cont_ra
wingo commented 8 years ago

I don't know whether to trust this data. The BC_ things would seem to indicate bytecode handlers. In that case that would mean that both processes are running interpreted, a significant portion of the time!

lukego commented 8 years ago

So! This is an awesome opportunity to apply our new Hydra-based benchmarking tools to a real performance problem. First time! Exciting :-)

Here is a setup that could work for us:

@domenkozar Could you hook up a setup like this, pretty please? :smile: If master is not suitable as a baseline at the moment because the matrix tests depend on new code then we could pick whatever branch works.

wingo commented 8 years ago

Test plan sounds great :) It will be nice to be able to evaluate changes in a more scientific way. Looking forward to the benchmarking game :)

domenkozar commented 8 years ago

I've managed to prepare most of what's needed, I'm waiting for our existing matrix to build successfully, then I'll create another one for this issue.

I expect to have the matrix ready tomorrow morning, then it's just a matter of generating reports.

domenkozar commented 8 years ago

Here we go: https://hydra.snabb.co/eval/2033

lukego commented 8 years ago

@domenkozar Awesome!

I downloaded the CSV from Hydra and made a report in Rstudio: http://rpubs.com/lukego/192632.

I also sent the report source to you via PR snabblab/snabblab-nixos#50. Could you please hook that into the build? I suppose that we will want to incrementally add more reports and refinements to the matrix tests. Hydra would seem to support that quite well.

Here are my notes:

So @wingo can you give more tips on how to reproduce? The ideal would be to add the lwaftr benchmark to the test matrix with help from @domenkozar. Otherwise we need to identify a better proxy to measure the problem you are seeing e.g. perhaps switching to a newer QEMU version that is showing problematic results in this test run.

This test matrix seems like it will be a fantastic tool for identifying bugs and incompatibilities. Going to be an interesting problem to prioritize which problems to dig into first...

lukego commented 8 years ago

@domenkozar Could we please add a couple more branches to this jobset? I'm thinking it would be neat to compare:

So first step could be to add snabbco/next and lukego/optimize branches (and perhaps other people's e.g. wingo/optimize if they create those branches first).

I realize that I could create my own Hydra jobset that runs the tests from the branches that I care about, and this will probably be common practice for maintainers to test their own branches in the ways they most care about, but for now I figured it is nice to be social and work this out together in one place :).

wingo commented 8 years ago

This particular bug was not filed because of the bimodal perf I was seeing: it was filed because perf indicates that there are opportunities for optimization around allocation.

The bimodal perf bug is somewhat hard to reproduce. FWIW I run qemu like this:

$ qemu-system-x86_64 --version
QEMU emulator version 2.4.0, Copyright (c) 2003-2008 Fabrice Bellard

I invoke qemu like this:

    sudo numactl -m ${NUMA_NODE} taskset -c ${QEMU_CORE} qemu-system-x86_64 \
        --enable-kvm -name ${NAME} -drive file=${DISK},if=virtio \
        -m ${MEM} -cpu host \
        -fsdev local,security_model=passthrough,id=fsdev0,path=${SHARED_LOCATION} \
            -device virtio-9p-pci,id=fs0,fsdev=fsdev0,mount_tag=share,rombar=0 \
        -object memory-backend-file,prealloc=yes,id=mem,size=${MEM},mem-path=/dev/hugepages,share=on \
        -numa node,memdev=mem \
        -chardev socket,id=char1,path=${VHU_SOCK1},server \
            -netdev type=vhost-user,id=net0,chardev=char1 \
            -device virtio-net-pci,netdev=net0,addr=0x8,mac=52:54:00:00:00:01 \
        -chardev socket,id=char2,path=${VHU_SOCK2},server \
            -netdev type=vhost-user,id=net1,chardev=char2 \
            -device virtio-net-pci,netdev=net1,addr=0x9,mac=52:54:00:00:00:02 \
        -netdev type=tap,id=net2,ifname=${IFACE},vhost=on,script=${NETSCRIPT} \
            -device virtio-net-pci,netdev=net2,addr=0xa,mac=52:54:00:00:00:03 \
        -vnc :${VNC_DISPLAY} &

And then inside I run the lwAFTR. Note that we are running from an old branch for these tests because we needed some NFV fixes regarding negotiation and support for not enabling MRG_RXBUF. Still working on porting forward to master. I would recommend not working on the bimodal perf bug until we have been able to merge forward; should happen in the next week or so.

domenkozar commented 8 years ago

I've included the report: https://hydra.snabb.co/build/81154/download/2/report.html

lukego commented 8 years ago

@wingo True I am using this issue to respond to a bimodal perf issue that was really mentioned on Slack rather than here.

On the topic of this specific bug: In your perf record are you capturing only the behavior while processing traffic or also e.g. the time interval while spinning the idle loop and waiting for the VM to start and connect? I see a risk of "optimizing the idle loop" if perf record is taking samples beyond the real test interval.

wingo commented 8 years ago

lukego: I attach to a running snabb nfv process when it is processing traffic, so I think those perf runs are measuring the right thing.