oracle / bpftune

bpftune uses BPF to auto-tune Linux systems
Other
659 stars 56 forks source link

Segmentation fault on Ubuntu 22.04.2 LTS #24

Closed andrey-admin closed 1 year ago

andrey-admin commented 1 year ago

Hello,

Got Segmentation fault (core dumped) when trying to run bpftune on Linux 5.19.0-1026-gcp kernel with:

[949460.456403] bpftune[82605]: segfault at 0 ip 00007f146066fde2 sp 00007fff34453140 error 4 in tcp_cong_tuner.so[7f146066f000+2000] [949460.456415] Code: 45 a8 0f b6 40 40 83 f0 01 84 c0 0f 84 97 00 00 00 e8 c3 f7 ff ff 48 89 45 c8 48 8b 45 a8 48 8b 55 c8 48 89 50 48 48 8b 45 c8 <48> 8b 10 48 8b 45 a8 48 89 50 38 e8 8e f4 ff ff 48 8b 55 c8 48 8b

in dmesg.

How i can fix that?

Thanks.

alan-maguire commented 1 year ago

thanks for reportig - can you retry with the latest main branch? I ran into a segmentation fault on ubuntu and pushed a fix that resolved it. if it's still there, can you attach the stack associated with the core dump from gdb and i'll try and figure out what's going on.

andrey-admin commented 1 year ago

@alan-maguire same issue but with that in dmesg:

[953685.243427] bpftune[85946]: segfault at 0 ip 00007f29a426fde2 sp 00007ffc218d6980 error 4 in tcp_cong_tuner.so[7f29a426f000+2000] [953685.243437] Code: 45 a8 0f b6 40 40 83 f0 01 84 c0 0f 84 97 00 00 00 e8 c3 f7 ff ff 48 89 45 c8 48 8b 45 a8 48 8b 55 c8 48 89 50 48 48 8b 45 c8 <48> 8b 10 48 8b 45 a8 48 89 50 38 e8 8e f4 ff ff 48 8b 55 c8 48 8b

andrey-admin commented 1 year ago

dump.zip

alan-maguire commented 1 year ago

thanks! i can't reproduce it so can you try running "gdb bpftune ", and once in gdb run "bt" to get a stack backtrace? you might need to run "sudo sysctl -w kernel.core_pattern=core.%f.%p" first to get core files of form core.bpftune..

andrey-admin commented 1 year ago

https://pastebin.com/76XQ30ac

alan-maguire commented 1 year ago

thanks; the crash is happening in bpftune_bpf_init(); would you be able to run "bpftune -ds" to see if we can see what is happening with bpf open/load/attach?

andrey-admin commented 1 year ago

https://pastebin.com/PbFnpyFi

alan-maguire commented 1 year ago

i suspect the issue is https://lore.kernel.org/bpf/20211008000309.43274-7-andrii@kernel.org/ where the bpf skeleton generation does not like the .rodata.cst16 section . It may be that a newer bpftool might help; i'm using bpftool 5.15 on ubuntu from the linux-tools package synced to the kernel version. however we may also be able to work around this; you could try making the following chages to tcp_cong_tuner.bpf.c and rebuilding:

diff --git a/src/tcp_cong_tuner.bpf.c b/src/tcp_cong_tuner.bpf.c index 77957b3..ab6661a 100644 --- a/src/tcp_cong_tuner.bpf.c +++ b/src/tcp_cong_tuner.bpf.c @@ -40,7 +40,7 @@ static __always_inline bool retransmit_threshold(struct remote_host *remote_host, u32 segs_out, u32 total_retrans) {

that was enough to get rid of the .rodata.cst16 section (it's replaced with .rodata.str1.1 that bpftool can handle).

alan-maguire commented 1 year ago

patch got mangled but replaces

const char bbr[CONG_MAXNAME] = "bbr";

...with

static const char bbr[4] = "bbr";

...in the two places it is declared in tcp_cong_tuner.bpf.c

pavlinux commented 1 year ago

static volatile const char const bbr[4] = {'b', 'b', 'r', '\0'}; B-)

andrey-admin commented 1 year ago

can you put patch in attach, please?

alan-maguire commented 1 year ago

0001-avoid-.rodata.cst16-sections-in-tcp_cong_tuner.bpf.c.patch.gz

andrey-admin commented 1 year ago

still sigfault.

0 0x00007fa2a93a2de2 in init (tuner=0x558225754260) at tcp_cong_tuner.c:58

58 bpftuner_bpf_init(tcp_cong, tuner, NULL); (gdb) bt

0 0x00007fa2a93a2de2 in init (tuner=0x558225754260) at tcp_cong_tuner.c:58

1 0x00007fa2a99efa88 in bpftuner_init (path=0x7fff10a1bfb0 "/usr/lib64/bpftune//tcp_cong_tuner.so") at libbpftune.c:655

2 0x0000558225735e32 in init (library_dir=0x5582257374aa "/usr/lib64/bpftune/") at bpftune.c:199

3 0x0000558225736541 in main (argc=2, argv=0x7fff10a1c478) at bpftune.c:391

andrey-admin commented 1 year ago

last strings from -ds:

bpftune: libbpf: prog 'cong_retransmit': found data map 5 (tcp_cong.bss, sec 8, off 0) for insn 157 bpftune: libbpf: sec '.reltp_btf/tcp_retransmit_skb': relo #4: insn #179 against 'init_net' bpftune: libbpf: prog 'cong_retransmit': found extern #0 'init_net' (sym 34) for insn #179 bpftune: libbpf: sec '.reltp_btf/tcp_retransmit_skb': relo #5: insn #182 against 'bpftune_init_net' bpftune: libbpf: prog 'cong_retransmit': found data map 5 (tcp_cong.bss, sec 8, off 0) for insn 182 bpftune: libbpf: sec '.reltp_btf/tcp_retransmit_skb': relo #6: insn #190 against 'ring_buffer_map' bpftune: libbpf: prog 'cong_retransmit': found map 0 (ring_buffer_map, sec 9, off 0) for insn #190 bpftune: libbpf: sec '.reliter/tcp': collecting relocation for section(5) 'iter/tcp' bpftune: libbpf: sec '.reliter/tcp': relo #0: insn #37 against 'remote_host_map' bpftune: libbpf: prog 'bpftune_cong_iter': found map 3 (remote_host_map, sec 9, off 80) for insn #37 bpftune: libbpf: sec '.reliter/tcp': relo #1: insn #52 against 'remote_host_map' bpftune: libbpf: prog 'bpftune_cong_iter': found map 3 (remote_host_map, sec 9, off 80) for insn #52 bpftune: libbpf: sec '.reliter/tcp': relo #2: insn #57 against 'remote_host_map' bpftune: libbpf: prog 'bpftune_cong_iter': found map 3 (remote_host_map, sec 9, off 80) for insn #57 bpftune: libbpf: sec '.reliter/tcp': relo #3: insn #135 against 'debug' bpftune: libbpf: prog 'bpftune_cong_iter': found data map 5 (tcp_cong.bss, sec 8, off 0) for insn 135 bpftune: libbpf: sec '.reliter/tcp': relo #4: insn #139 against '.rodata' bpftune: libbpf: prog 'bpftune_cong_iter': found data map 4 (tcp_cong.rodata, sec 11, off 0) for insn 139 bpftune: libbpf: failed to find skeleton map '.rodata.str1.1' Segmentation fault (core dumped)

alan-maguire commented 1 year ago

can you check bpftool, clang versions ("bpftool --version", "clang --version"? ubuntu with bpftool v5.15 and clang v14 work fine for me, even with the .rodata.str1.1 sections.

andrey-admin commented 1 year ago

root@nginx-01:/usr/src/bpf/bpftune# bpftool --version /usr/lib/linux-tools/5.19.0-1026-gcp/bpftool v7.0.0 using libbpf v1.0 features: libbpf_strict root@nginx-01:/usr/src/bpf/bpftune# clang --version Ubuntu clang version 14.0.0-1ubuntu1 Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin

andrey-admin commented 1 year ago

Machine - google cloud virtual server (n2d-highcpu-16)

alan-maguire commented 1 year ago

thanks; above look fine and similar to my setup so I'm puzzled why we're seeing different things. regardless i think i've fixed one of the issues here; when bpftune opens/loads/attaches bpf it uses macros and these need to return failure status otherwise we try to load a program that failed to open, or attach a program that failed to load. i've merged that in pr https://github.com/oracle-samples/bpftune/pull/26 so hopefully that should resolve the segmentation fault, but i don't yet have a good solution for the bpf loading failure.

andrey-admin commented 1 year ago

Just pulled repo, rebuild bpftune - no changes. Same Segmentation fault. Last strings from -ds:

bpftune: libbpf: prog 'cong_retransmit': found map 0 (ring_buffer_map, sec 9, off 0) for insn #190 bpftune: libbpf: sec '.reliter/tcp': collecting relocation for section(5) 'iter/tcp' bpftune: libbpf: sec '.reliter/tcp': relo #0: insn #37 against 'remote_host_map' bpftune: libbpf: prog 'bpftune_cong_iter': found map 3 (remote_host_map, sec 9, off 80) for insn #37 bpftune: libbpf: sec '.reliter/tcp': relo #1: insn #52 against 'remote_host_map' bpftune: libbpf: prog 'bpftune_cong_iter': found map 3 (remote_host_map, sec 9, off 80) for insn #52 bpftune: libbpf: sec '.reliter/tcp': relo #2: insn #57 against 'remote_host_map' bpftune: libbpf: prog 'bpftune_cong_iter': found map 3 (remote_host_map, sec 9, off 80) for insn #57 bpftune: libbpf: sec '.reliter/tcp': relo #3: insn #135 against 'debug' bpftune: libbpf: prog 'bpftune_cong_iter': found data map 5 (tcp_cong.bss, sec 8, off 0) for insn 135 bpftune: libbpf: sec '.reliter/tcp': relo #4: insn #139 against '.rodata' bpftune: libbpf: prog 'bpftune_cong_iter': found data map 4 (tcp_cong.rodata, sec 11, off 0) for insn 139 bpftune: libbpf: failed to find skeleton map '.rodata.str1.1'

andrey-admin commented 1 year ago

gdb:

Program terminated with signal SIGSEGV, Segmentation fault.

0 0x00007f8fd9adadec in init (tuner=0x55ec5431e260) at tcp_cong_tuner.c:58

58 err = bpftuner_bpf_init(tcp_cong, tuner, NULL); (gdb) bt

0 0x00007f8fd9adadec in init (tuner=0x55ec5431e260) at tcp_cong_tuner.c:58

1 0x00007f8fda166a88 in bpftuner_init (path=0x7ffc34acd3f0 "/usr/lib64/bpftune//tcp_cong_tuner.so") at libbpftune.c:655

2 0x000055ec542ffe32 in init (library_dir=0x55ec543014aa "/usr/lib64/bpftune/") at bpftune.c:199

3 0x000055ec54300541 in main (argc=2, argv=0x7ffc34acd8b8) at bpftune.c:391

pavlinux commented 1 year ago
# gdb --args `which bpftune` -s;
(gdb) break  tcp_cong_tuner.c:58
(gdb) run

and next step, fin, step, ... commands;

andrey-admin commented 1 year ago

root@nginx-01:/usr/src/bpf/bpftune# gdb --args which bpftune -s; GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1 Copyright (C) 2022 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: https://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.

For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /usr/sbin/bpftune... (gdb) run Starting program: /usr/sbin/bpftune -s [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". bpftune: bpftune works fully bpftune: bpftune supports per-netns policy (via netns cookie)

Program received signal SIGSEGV, Segmentation fault. 0x00007ffff784edec in init (tuner=0x555555575260) at tcp_cong_tuner.c:58 58 err = bpftuner_bpf_init(tcp_cong, tuner, NULL);

alan-maguire commented 1 year ago

https://github.com/oracle-samples/bpftune/pull/27 may help here i think; if you get a chance, would you mind rebuilding/retesting. thanks!

andrey-admin commented 1 year ago

Yeah, bpftune started ok, without any fault. Checking how it working.

Thanks!

alan-maguire commented 1 year ago

great, thanks for taking the time to work through this! i'm hoping to still get to the bottom of why the str sections cause issues at your end too; that will result in the associated congestion tuner not loading.

andrey-admin commented 1 year ago

Sorry, miss to check syslog after start.

Jul 11 09:41:09 nginx-01 bpftune[12736]: bpftune works fully Jul 11 09:41:09 nginx-01 bpftune[12736]: bpftune supports per-netns policy (via netns cookie) Jul 11 09:41:09 nginx-01 bpftune[12736]: tcp_cong open bpf: No such process Jul 11 09:41:09 nginx-01 bpftune[12736]: error initializing '/usr/lib64/bpftune//tcp_cong_tuner.so: No such process Jul 11 09:41:09 nginx-01 bpftune[12736]: could not open /proc/sys/net/ipv6/neigh/default/gc_interval (netns fd 0) for reading: No such file or directory Jul 11 09:41:09 nginx-01 bpftune[12736]: error reading tunable 'net.ipv6.neigh.default.gc_interval': No such file or directory Jul 11 09:41:09 nginx-01 bpftune[12736]: error initializing '/usr/lib64/bpftune//neigh_table_tuner.so: No such file or directory Jul 11 09:41:09 nginx-01 bpftune[12736]: could not open /proc/sys/net/ipv6/route/max_size (netns fd 0) for reading: No such file or directory Jul 11 09:41:09 nginx-01 bpftune[12736]: error reading tunable 'net.ipv6.route.max_size': No such file or directory Jul 11 09:41:09 nginx-01 bpftune[12736]: error initializing '/usr/lib64/bpftune//route_table_tuner.so: No such file or directory

But all files on place: root@nginx-01:/usr/src/bpf/bpftune# ls -ld /usr/lib64/bpftune//tcp_cong_tuner.so /usr/lib64/bpftune//neigh_table_tuner.so /usr/lib64/bpftune//route_table_tuner.so -rwxr-xr-x 1 root root 1626360 Jul 11 09:40 /usr/lib64/bpftune//neigh_table_tuner.so -rwxr-xr-x 1 root root 1622040 Jul 11 09:40 /usr/lib64/bpftune//route_table_tuner.so -rwxr-xr-x 1 root root 896456 Jul 11 09:40 /usr/lib64/bpftune//tcp_cong_tuner.so

alan-maguire commented 1 year ago

the "no such file or directory" comes from an ENOENT error; in the case of the neigh_table_tuner, what's missing are the ipv6 tunables . in the case of the tcp congestion tuner, the tuner is not there due to the issues with the string section; it's just that we don't fall over now and segfault. if ipv6 is disabled that probably explains the neigh table tuner issues.

andrey-admin commented 1 year ago

So, all must working proper? How i can check status or some stats while bpftune started as deamon?

Can you fix that errors for disabled ipv6 configurations, please?

And string "tcp_cong open bpf: No such process" - is all ok too?

Thanks!

alan-maguire commented 1 year ago

i'm working on adding support for handling ipv6 disabled by making some tunables optional; should have a fix for this in the next few days. the tcp_cong_tuner issue is that bpf won't load due to the .rodata.str.1 section being a problem on your system. i haven't been able to reproduce that but will try and fix it once i can.

andrey-admin commented 1 year ago

If need any data from my system - just say how to collect, i will.

Thanks.

alan-maguire commented 1 year ago

great, thanks!

alan-maguire commented 1 year ago

https://github.com/oracle-samples/bpftune/pull/30 should help for cases where ipv6 is disabled; it makes ipv6 tunables optional such that the tuner will not fail to load if it optional tunables are not found. still need to solve the tcp_cong_tuner issue..

andrey-admin commented 1 year ago

now after start in syslog that:

Jul 12 08:44:29 nginx-11 bpftune[41197]: bpftune works fully Jul 12 08:44:29 nginx-11 bpftune[41197]: bpftune supports per-netns policy (via netns cookie) Jul 12 08:44:30 nginx-11 bpftune[41197]: tcp_cong open bpf: No such process Jul 12 08:44:30 nginx-11 bpftune[41197]: error initializing '/usr/lib64/bpftune/tcp_cong_tuner.so: No such process Jul 12 08:44:30 nginx-11 bpftune[41197]: could not open /proc/sys/net/ipv6/neigh/default/gc_interval (netns fd 0) for reading: No such file or directory Jul 12 08:44:30 nginx-11 bpftune[41197]: could not open /proc/sys/net/ipv6/neigh/default/gc_stale_time (netns fd 0) for reading: No such file or directory Jul 12 08:44:30 nginx-11 bpftune[41197]: could not open /proc/sys/net/ipv6/neigh/default/gc_thresh1 (netns fd 0) for reading: No such file or directory Jul 12 08:44:30 nginx-11 bpftune[41197]: could not open /proc/sys/net/ipv6/neigh/default/gc_thresh2 (netns fd 0) for reading: No such file or directory Jul 12 08:44:30 nginx-11 bpftune[41197]: could not open /proc/sys/net/ipv6/neigh/default/gc_thresh3 (netns fd 0) for reading: No such file or directory Jul 12 08:44:30 nginx-11 bpftune[41197]: could not open /proc/sys/net/ipv6/route/max_size (netns fd 0) for reading: No such file or directory

alan-maguire commented 1 year ago

the above is all expected if ipv6 isn't enabled; the aim was to ensure the tuner kept going when it failed to find optional tunables. so if all went well, the neigh_table_tuner.so should still have loaded to tune v4 neighbour tables. previously to #30 a single not found would cause the tuner not to load

andrey-admin commented 1 year ago

wow, seems working now. checking in work

alan-maguire commented 1 year ago

great! the latest commit should have gotten rid of the .rodata.str section (you can check with "objdump -h src/tcp_cong_tuner.bpf.o" ; no .rodata.str1.1 or .rodata.cst16 sections should be present (at least that's what i see)

andrey-admin commented 1 year ago

due to loss events for 10.164.3.28, specify 'bbr' congestion control algorithm

is that ok? something need to do?

andrey-admin commented 1 year ago

Scenario 'specify bbr congestion control' occurred for tunable 'TCP congestion control' in global ns. Because loss rate has exceeded 1 percent for a connection, use bbr congestion control algorithm instead of default

and that?

alan-maguire commented 1 year ago

that's a sign it's working; the congestion tuner looks at tcp connections that experience loss and switches congestion control algorithm to one that performs better under loss conditions - bbr. see "man bpftune-tcp-cong" for details.

pavlinux commented 1 year ago

that's a sign it's working; the congestion tuner looks at tcp connections that experience loss and switches congestion control algorithm to one that performs better under loss conditions - bbr. see "man bpftune-tcp-cong" for details.

Could implement iteration method for all available algorithms? for ex.

CONG_LIST="bbr, veno, reno, vegas, westwood,  htcp, . . .";

foreach(algo in CONG_LIST) 
   do 
          set_as_main(algo);
          if (connection_quality < pervious) 
           continue; 
....

As I know, westwood works better on a Wi-Fi network.

alan-maguire commented 1 year ago

yeah, i'm looking at seeing if we can incorporate reinforcement learning techniques in tuning in the future; exploring different policies rather than having a rigid approach would definitely be part of that.

alan-maguire commented 1 year ago

Segmentation fault resolved so closing this out