sflow / vpp-sflow

sFlow plugin for VPP
Apache License 2.0
6 stars 0 forks source link

Crash when enabling interface #9

Closed yoitszigi closed 1 month ago

yoitszigi commented 1 month ago

Hi all,

I tried compiling this for our VPP lab setup, but I’m running into a crash whenever I enable an interface. The interface is an X520-DA1 using DPDK, and I’m also using the Linux control plane if that’s relevant.

Here's the log:

Oct 07 20:49:54 vpp-ebn-lab-01.int.pdx.net.uk vpp[1551218]: received signal SIGSEGV, PC 0x7fb09d9eb43f, faulting address 0x2b
Oct 07 20:49:54 vpp-ebn-lab-01.int.pdx.net.uk vpp[1551218]: Code:  8b 40 f8 89 c0 49 39 c4 72 14 eb 85 0f 1f 44 00 00 31 c0 89
Oct 07 20:49:54 vpp-ebn-lab-01.int.pdx.net.uk vpp[1551218]: #0  0x00007fb09d9eb43f sflow_node_fn_icl + 0x252f
Oct 07 20:49:54 vpp-ebn-lab-01.int.pdx.net.uk vpp[1551218]:      from /usr/lib/x86_64-linux-gnu/vpp_plugins/sflow_plugin.so
Oct 07 20:49:54 vpp-ebn-lab-01.int.pdx.net.uk vpp[1551218]: #1  0x00007fb107309137 vlib_exit_with_status + 0x537
Oct 07 20:49:54 vpp-ebn-lab-01.int.pdx.net.uk vpp[1551218]:      from /lib/x86_64-linux-gnu/libvlib.so.24.10
Oct 07 20:49:54 vpp-ebn-lab-01.int.pdx.net.uk vpp[1551218]: #2  0x00007fb107295ba8 clib_calljmp + 0x18
Oct 07 20:49:54 vpp-ebn-lab-01.int.pdx.net.uk vpp[1551218]:      from /lib/x86_64-linux-gnu/libvppinfra.so.24.10
Oct 07 20:49:58 vpp-ebn-lab-01.int.pdx.net.uk systemd[1]: vpp.service: Main process exited, code=dumped, status=6/ABRT
Oct 07 20:49:58 vpp-ebn-lab-01.int.pdx.net.uk systemd[1]: vpp.service: Failed with result 'core-dump'.

VPP Version:

vpp# show version 
vpp v24.10-rc0~233-g2fb8d2f96 built by root on vpp-ebn-lab-01.int.pdx.net.uk at 2024-10-07T20:28:22

OS Version: Ubuntu 22.04.5 LTS

Linux vpp-ebn-lab-01.int.pdx.net.uk 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Let me know if anything else is needed from me

stathis commented 1 month ago

Hi @yoitszigi , I've discovered the same exact problem but it seems to arise only when having an odd number of worker cores. With 2, 4, 6, etc it's fine. Can you try to set an even amount of cores?

vpp# show version
vpp v25.02-rc0~28-g911c0fb23 built by root on XX at 2024-10-07T11:30:49

CPU is EPYC 8224P, network card MCX516A-CDAT, latest Debian 12.

yoitszigi commented 1 month ago

Hi @yoitszigi , I've discovered the same exact problem but it seems to arise only when having an odd number of worker cores. With 2, 4, 6, etc it's fine. Can you try to set an even amount of cores?

vpp# show version
vpp v25.02-rc0~28-g911c0fb23 built by root on XX at 2024-10-07T11:30:49

CPU is EPYC 8224P, network card MCX516A-CDAT, latest Debian 12.

Tried this but seems to still crash with workers running on even cores. Is your main thread still on 0?

stathis commented 1 month ago

Tried this but seems to still crash with workers running on even cores. Is your main thread still on 0?

/etc/default/grub:

GRUB_CMDLINE_LINUX="[..] isolcpus=12,13,14,15,16,17,18,19,20,21,22,23"

/etc/vpp/startup.conf:

main-core 12
corelist-workers 13-22

If I set corelist-workers 13-21 (9 cores), 13-23 (11 cores) I get the same crash trace as you. If I lower it to 6 total cores, or 8, it starts up fine.

Edit: I meant an even number of cores, not even cores as in, e.g. core 12.

yoitszigi commented 1 month ago

Tried this but seems to still crash with workers running on even cores. Is your main thread still on 0?

/etc/default/grub:

GRUB_CMDLINE_LINUX="[..] isolcpus=12,13,14,15,16,17,18,19,20,21,22,23"

/etc/vpp/startup.conf:

main-core 12
corelist-workers 13-22

If I set corelist-workers 13-21 (9 cores), 13-23 (11 cores) I get the same crash trace as you. If I lower it to 6 total cores, or 8, it starts up fine.

Edit: I meant an even number of cores, not even cores as in, e.g. core 12.

Yup, it works completely fine if you do an even number of workers If you do an odd number of workers it crashes with that stack trace.

sflow commented 1 month ago

Thankyou for testing. Reading the backtrace: the instruction at sflow_node_fn_icl + 0x252f looks like it might be somewhere deep inside the vlib_validate_buffer_enqueue_x4() step on this line: https://github.com/sflow/vpp-sflow/blob/86368e28b3d07a6a4dce1af0545f8a7c5b592ade/sflow/node.c#L287 but that may be way off if I have not compiled this the same way as you did. Did you just "make build" with the default settings? Any change you can run it under gdb using "make debug" so that it stops and shows you exactly where the SIGSEGV happened?

This looks just like what happens if you ask for a field in a structure that is at offset 0x2b, but the structure pointer is NULL. So that's the sort of thing we are looking for.

pimvanpelt commented 1 month ago

quick update - I can reproduce this with 5 workers and a production build:

vpp# sflow enable-disable TenGigabitEthernet130/0/0 
vpp# received signal SIGSEGV, PC 0x7f32e5f2834f, faulting address 0x3b
Code:  8b 40 f8 89 c0 48 39 c3 72 14 eb 85 0f 1f 44 00 00 31 c0 89
#0  0x00007f32e5f2834f sflow_process_samples + 0x78f
     from /home/pim/src/vpp/build-root/install-vpp-native/vpp/lib/x86_64-linux-gnu/vpp_plugins/sflow_plugin.so
#1  0x00007f334f23d1d7 vlib_process_bootstrap + 0x17
     from /home/pim/src/vpp/build-root/install-vpp-native/vpp/lib/x86_64-linux-gnu/libvlib.so.25.02
#2  0x00007f33510f5ba8 clib_calljmp + 0x18
     from /home/pim/src/vpp/build-root/install-vpp-native/vpp/lib/x86_64-linux-gnu/libvppinfra.so.25.02
make: *** [Makefile:683: run-release] Aborted

But it does not make a debug built crash:

DBGvpp# sflow enable-disable TenGigabitEthernet130/0/0
DBGvpp# [New Thread 0x7fff57a546c0 (LWP 289954)]
[Thread 0x7fff57a546c0 (LWP 289954) exited]
[New Thread 0x7fff57a546c0 (LWP 289955)]
[Thread 0x7fff57a546c0 (LWP 289955) exited]
[New Thread 0x7fff57a546c0 (LWP 289956)]
[Thread 0x7fff57a546c0 (LWP 289956) exited]

Those threads coming and going is a clue. I won't have time to work on this today/tomorrow (traveling in Italy).

pimvanpelt commented 1 month ago

I have a repro with debug build -

DBGvpp# sflow en TenGigabitEthernet3/0/0

Thread 1 "vpp_main" received signal SIGSEGV, Segmentation fault.
0x00007fff8dadb53c in __vec_len (v=0x7fff480ddafe) at /home/pim/src/vpp/src/vppinfra/vec_bootstrap.h:129
129       return _vec_find (v)->len;
(gdb) bt
#0  0x00007fff8dadb53c in __vec_len (v=0x7fff480ddafe) at /home/pim/src/vpp/src/vppinfra/vec_bootstrap.h:129
#1  0x00007fff8dadd895 in update_counter_vector_combined (res=0x7fff9e83ef18, ifCtrs=0x7fff8d441de0, hw_if_index=1) at /home/pim/src/vpp/src/plugins/sflow/sflow.c:88
#2  0x00007fff8dadcfeb in update_counters (smp=0x7fff8dafcb80 <sflow_main>, sfif=0x7fff9dc6ae6c) at /home/pim/src/vpp/src/plugins/sflow/sflow.c:177
#3  0x00007fff8dadc6c9 in counter_polling_check (smp=0x7fff8dafcb80 <sflow_main>) at /home/pim/src/vpp/src/plugins/sflow/sflow.c:287
#4  0x00007fff8dadc252 in sflow_process_samples (vm=0x7fff96a00740, node=0x7fff97139080, frame=0x0) at /home/pim/src/vpp/src/plugins/sflow/sflow.c:459
#5  0x00007ffff7e9801d in vlib_process_bootstrap (_a=140735565884296) at /home/pim/src/vpp/src/vlib/main.c:1208
#6  0x00007ffff6d6e408 in clib_calljmp () at /home/pim/src/vpp/src/vppinfra/longjmp.S:123
#7  0x00007fff8d696b80 in ?? ()
#8  0x00007ffff7e97ab9 in vlib_process_startup (vm=0x13262bfa064af, p=0x48f6d79605, f=0x7fff96a00740) at /home/pim/src/vpp/src/vlib/main.c:1233
#9  0x00000037f6dc9e2c in ?? ()
#10 0x0000000000000004 in ?? ()
#11 0x00007fff9dc2eaa0 in ?? ()
#12 0x00007fff9dc2eaa0 in ?? ()
#13 0x00007fff971b1a38 in ?? ()
#14 0x0000000000000000 in ?? ()

I would take a look at update_counter_vector_combined(), I think the following patch may help:

--- a/src/plugins/sflow/sflow.c
+++ b/src/plugins/sflow/sflow.c
@@ -85,7 +85,7 @@ update_counter_vector_combined (stat_segment_data_t *res,
 {
   for (int th = 0; th < vec_len (res->simple_counter_vec); th++)
     {
-      for (int intf = 0; intf < vec_len (res->combined_counter_vec[intf]);
+      for (int intf = 0; intf < vec_len (res->combined_counter_vec[th]);
           intf++)
        {
          if (intf == hw_if_index)

The first counter (res->simple_counter_vec) is threads, then the second loop should be the res->combind_counter_vec for the thread with index th.

Neil, can you please verify? I cannot solicit a crash with this fix.

sflow commented 1 month ago

Verified. Thank you all for finding and fixing my bug while I slept. I could get used to this.

sflow commented 1 month ago

And FYI, the threads coming and going are expected. Those are for the VAPI queries, because you can't connect to the VAPI from the main thread. See issue #7 for details. In gdb you can turn off those notifications with:

gdb> set print thread-events off