Closed yoitszigi closed 1 month ago
Hi @yoitszigi , I've discovered the same exact problem but it seems to arise only when having an odd number of worker cores. With 2, 4, 6, etc it's fine. Can you try to set an even amount of cores?
vpp# show version
vpp v25.02-rc0~28-g911c0fb23 built by root on XX at 2024-10-07T11:30:49
CPU is EPYC 8224P, network card MCX516A-CDAT, latest Debian 12.
Hi @yoitszigi , I've discovered the same exact problem but it seems to arise only when having an odd number of worker cores. With 2, 4, 6, etc it's fine. Can you try to set an even amount of cores?
vpp# show version vpp v25.02-rc0~28-g911c0fb23 built by root on XX at 2024-10-07T11:30:49
CPU is EPYC 8224P, network card MCX516A-CDAT, latest Debian 12.
Tried this but seems to still crash with workers running on even cores. Is your main thread still on 0?
Tried this but seems to still crash with workers running on even cores. Is your main thread still on 0?
/etc/default/grub
:
GRUB_CMDLINE_LINUX="[..] isolcpus=12,13,14,15,16,17,18,19,20,21,22,23"
/etc/vpp/startup.conf
:
main-core 12
corelist-workers 13-22
If I set corelist-workers
13-21 (9 cores), 13-23 (11 cores) I get the same crash trace as you. If I lower it to 6 total cores, or 8, it starts up fine.
Edit: I meant an even number of cores, not even cores as in, e.g. core 12
.
Tried this but seems to still crash with workers running on even cores. Is your main thread still on 0?
/etc/default/grub
:GRUB_CMDLINE_LINUX="[..] isolcpus=12,13,14,15,16,17,18,19,20,21,22,23"
/etc/vpp/startup.conf
:main-core 12 corelist-workers 13-22
If I set
corelist-workers
13-21 (9 cores), 13-23 (11 cores) I get the same crash trace as you. If I lower it to 6 total cores, or 8, it starts up fine.Edit: I meant an even number of cores, not even cores as in, e.g. core
12
.
Yup, it works completely fine if you do an even number of workers If you do an odd number of workers it crashes with that stack trace.
Thankyou for testing. Reading the backtrace: the instruction at sflow_node_fn_icl + 0x252f looks like it might be somewhere deep inside the vlib_validate_buffer_enqueue_x4() step on this line: https://github.com/sflow/vpp-sflow/blob/86368e28b3d07a6a4dce1af0545f8a7c5b592ade/sflow/node.c#L287 but that may be way off if I have not compiled this the same way as you did. Did you just "make build" with the default settings? Any change you can run it under gdb using "make debug" so that it stops and shows you exactly where the SIGSEGV happened?
This looks just like what happens if you ask for a field in a structure that is at offset 0x2b, but the structure pointer is NULL. So that's the sort of thing we are looking for.
quick update - I can reproduce this with 5 workers and a production build:
vpp# sflow enable-disable TenGigabitEthernet130/0/0
vpp# received signal SIGSEGV, PC 0x7f32e5f2834f, faulting address 0x3b
Code: 8b 40 f8 89 c0 48 39 c3 72 14 eb 85 0f 1f 44 00 00 31 c0 89
#0 0x00007f32e5f2834f sflow_process_samples + 0x78f
from /home/pim/src/vpp/build-root/install-vpp-native/vpp/lib/x86_64-linux-gnu/vpp_plugins/sflow_plugin.so
#1 0x00007f334f23d1d7 vlib_process_bootstrap + 0x17
from /home/pim/src/vpp/build-root/install-vpp-native/vpp/lib/x86_64-linux-gnu/libvlib.so.25.02
#2 0x00007f33510f5ba8 clib_calljmp + 0x18
from /home/pim/src/vpp/build-root/install-vpp-native/vpp/lib/x86_64-linux-gnu/libvppinfra.so.25.02
make: *** [Makefile:683: run-release] Aborted
But it does not make a debug built crash:
DBGvpp# sflow enable-disable TenGigabitEthernet130/0/0
DBGvpp# [New Thread 0x7fff57a546c0 (LWP 289954)]
[Thread 0x7fff57a546c0 (LWP 289954) exited]
[New Thread 0x7fff57a546c0 (LWP 289955)]
[Thread 0x7fff57a546c0 (LWP 289955) exited]
[New Thread 0x7fff57a546c0 (LWP 289956)]
[Thread 0x7fff57a546c0 (LWP 289956) exited]
Those threads coming and going is a clue. I won't have time to work on this today/tomorrow (traveling in Italy).
I have a repro with debug build -
DBGvpp# sflow en TenGigabitEthernet3/0/0
Thread 1 "vpp_main" received signal SIGSEGV, Segmentation fault.
0x00007fff8dadb53c in __vec_len (v=0x7fff480ddafe) at /home/pim/src/vpp/src/vppinfra/vec_bootstrap.h:129
129 return _vec_find (v)->len;
(gdb) bt
#0 0x00007fff8dadb53c in __vec_len (v=0x7fff480ddafe) at /home/pim/src/vpp/src/vppinfra/vec_bootstrap.h:129
#1 0x00007fff8dadd895 in update_counter_vector_combined (res=0x7fff9e83ef18, ifCtrs=0x7fff8d441de0, hw_if_index=1) at /home/pim/src/vpp/src/plugins/sflow/sflow.c:88
#2 0x00007fff8dadcfeb in update_counters (smp=0x7fff8dafcb80 <sflow_main>, sfif=0x7fff9dc6ae6c) at /home/pim/src/vpp/src/plugins/sflow/sflow.c:177
#3 0x00007fff8dadc6c9 in counter_polling_check (smp=0x7fff8dafcb80 <sflow_main>) at /home/pim/src/vpp/src/plugins/sflow/sflow.c:287
#4 0x00007fff8dadc252 in sflow_process_samples (vm=0x7fff96a00740, node=0x7fff97139080, frame=0x0) at /home/pim/src/vpp/src/plugins/sflow/sflow.c:459
#5 0x00007ffff7e9801d in vlib_process_bootstrap (_a=140735565884296) at /home/pim/src/vpp/src/vlib/main.c:1208
#6 0x00007ffff6d6e408 in clib_calljmp () at /home/pim/src/vpp/src/vppinfra/longjmp.S:123
#7 0x00007fff8d696b80 in ?? ()
#8 0x00007ffff7e97ab9 in vlib_process_startup (vm=0x13262bfa064af, p=0x48f6d79605, f=0x7fff96a00740) at /home/pim/src/vpp/src/vlib/main.c:1233
#9 0x00000037f6dc9e2c in ?? ()
#10 0x0000000000000004 in ?? ()
#11 0x00007fff9dc2eaa0 in ?? ()
#12 0x00007fff9dc2eaa0 in ?? ()
#13 0x00007fff971b1a38 in ?? ()
#14 0x0000000000000000 in ?? ()
I would take a look at update_counter_vector_combined()
, I think the following patch may help:
--- a/src/plugins/sflow/sflow.c
+++ b/src/plugins/sflow/sflow.c
@@ -85,7 +85,7 @@ update_counter_vector_combined (stat_segment_data_t *res,
{
for (int th = 0; th < vec_len (res->simple_counter_vec); th++)
{
- for (int intf = 0; intf < vec_len (res->combined_counter_vec[intf]);
+ for (int intf = 0; intf < vec_len (res->combined_counter_vec[th]);
intf++)
{
if (intf == hw_if_index)
The first counter (res->simple_counter_vec
) is threads, then the second loop should be the res->combind_counter_vec
for the thread with index th
.
Neil, can you please verify? I cannot solicit a crash with this fix.
Verified. Thank you all for finding and fixing my bug while I slept. I could get used to this.
And FYI, the threads coming and going are expected. Those are for the VAPI queries, because you can't connect to the VAPI from the main thread. See issue #7 for details. In gdb you can turn off those notifications with:
gdb> set print thread-events off
Hi all,
I tried compiling this for our VPP lab setup, but I’m running into a crash whenever I enable an interface. The interface is an X520-DA1 using DPDK, and I’m also using the Linux control plane if that’s relevant.
Here's the log:
VPP Version:
OS Version: Ubuntu 22.04.5 LTS
Let me know if anything else is needed from me