starwing / lua-protobuf

A Lua module to work with Google protobuf
MIT License
1.71k stars 388 forks source link

Segfaults with gcc 13 and `-fstrict-aliasing` #261

Closed gszr closed 3 months ago

gszr commented 4 months ago

Here are some notes of a segfault I have been facing with lua-protobuf in Kong Gateway, which I pinpointed to the optimization -fstrict-aliasing (part of -O2) in lua-protobuf -- long write-up, be warned : )

2024/03/06 12:50:52 [notice] 184162#0: *1 [kong] init.lua:589 declarative config loaded from /home/gs/code/work/issues/KAG-3949/kong.yml, context: init_worker_by_lua*
2024/03/06 12:50:52 [notice] 184162#0: *644 [kong] process.lua:217 Starting go-hello, context: ngx.timer
2024/03/06 12:50:55 [notice] 184161#0: signal 17 (SIGCHLD) received from 184162
2024/03/06 12:50:55 [alert] 184161#0: worker process 184162 exited on signal 11 (core dumped)
2024/03/06 12:50:55 [notice] 184161#0: start worker process 184186
2024/03/06 12:50:55 [notice] 184186#0: *1291 [kong] process.lua:217 Starting go-hello, context: ngx.timer

A segfault coming from the worker process. I fired up gdb and got the following:

Core was generated by `nginx: worker process                                                         '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000772185e0dc7c in pb_nextfield () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
(gdb) bt
#0  0x0000772185e0dc7c in pb_nextfield () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#1  0x0000772185e1321e in lpb_setdeffields () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#2  0x0000772185e13518 in lpb_pushtypetable () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#3  0x0000772185e172e6 in lpbD_decode () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#4  0x0000772185e17352 in Lpb_decode () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#5  0x0000772186796b93 in lj_BC_FUNCC ()

So I saw it originated in the lua-protobuf C library and rebuilt lua-protobuf with -g2

$ CFLAGS="-O2 -g2" luarocks install lua-protobuf 0.5.0

I preserved optimization level 2 since that is the default the library is built with.

Restarted Kong and reran...

Core was generated by `nginx: worker process                                                         '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  pb_nextfield (t=t@entry=0x5a68f7bdcfa0, pfield=pfield@entry=0x7ffffb7092f0)
    at /tmp/luarocks_lua-protobuf-0.5.0-1-4338781/lua-protobuf/pb.h:1220

warning: Source file is more recent than executable.
1220                if ((*pfield = e->value) != NULL)
(gdb) bt
#0  pb_nextfield (t=t@entry=0x5a68f7bdcfa0, pfield=pfield@entry=0x7ffffb7092f0)
    at /tmp/luarocks_lua-protobuf-0.5.0-1-4338781/lua-protobuf/pb.h:1220

So we can see that the segfault is caused by the assignment:

if ((*pfield = e->value) != NULL)

which happens on line 1220 of the pb.h file: https://github.com/starwing/lua-protobuf/blame/master/pb.h#L1220. Looking at the code around that line did not help much; if pfield itself was NULL, the segfault would have happened on line 1217 instead, so the pointer must be valid, but that region it points to isn't writeable.

Back to the compilation command, I tried a few other things Since the default is to build with -O2 , what happens if I do -O0?

$ CFLAGS="-g2" luarocks install lua-protobuf 0.5.0 master?
Installing https://luarocks.org/lua-protobuf-0.5.0-1.src.rock

lua-protobuf 0.5.0-1 depends on lua >= 5.1 (5.1-1 provided by VM)
gcc -g2 -fPIC -I/home/gs/code/work/kong/bazel-bin/build/kong-dev/openresty/luajit/include/luajit-2.1 -c pb.c -o pb.o
gcc -shared -o pb.so pb.o
lua-protobuf 0.5.0-1 is now installed in /home/gs/code/work/kong/bazel-bin/build/kong-dev (license: MIT)

... it now works. So it must be some optimization that gets enabled by the -O2 flag. I rerun with -O1 - it works; then I rerun with -O3 - and it also works -- (which is strange, since each level of optimization adds to the previous one)

Pursuing another line, I decided to test with other compilers: clang version 16.0.6 -> works gcc version 12.3.0 -> works (with same exact flags)

Back to compilation levels; we know that -O0 and -O1 work, but -O2 does not, so the issue must be triggered by a subset of optimizations introduced by -O2. I tried putting together a list of all individual optimizations done by -O1 and -O2; it turns out that such a list isn't really possible, because gcc does not expose all optimizations done by -Ox in individual flags (and there are no plans for doing so [2]). However, at least some (perhaps most) of those optimizations are exposed through individual -f flags [3], so I compiled the following list: https://gist.github.com/gszr/6164e79c596bd98d3a4fd170ff351611

I wrote a small script to compile the library with all these and "binary search" for at least one that triggers the issue (note on the flags: it is not enough to list all flags; at least -O1 must be specified so they take effect).

After a few rounds of removing / adding flags, I pinpointed the issue to the following optimization: -fstrict-aliasing. Which we can confirm with the following -- enabling all of -O1 + -fstrict-aliasing which belongs to -O2:

$ CC=/usr/bin/gcc CFLAGS="-O2 -fstrict-aliasing" luarocks install lua-protobuf 0.5.0

... and it breaks.

Could there be some other optimization that could also lead to the issue? To test this, I reran with -O2, but disabling specifically -fstrict-aliasing:

$ CC=/usr/bin/gcc CFLAGS="-O2 -fno-strict-aliasing" luarocks install lua-protobuf 0.5.0
... and things work normally, no segfaults : )

[1] https://gcc.gnu.org/wiki/FAQ#Is_-O1_.28-O2.2C-O3.2C_-Os_or_-Og.29_equivalent_to_individual_-foptimization_options.3F [2] https://gcc.gnu.org/legacy-ml/gcc-help/2009-10/msg00134.html [3] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

gszr commented 4 months ago

If I compile with -fstrict-aliasing -Wstrict-aliasing=2, I get some useful warnings - see https://gist.github.com/gszr/84b12df6c5401b4e64fe5e5739405722.

Also, I've tested with this changeset, that refactors some of the "aliasing", and verified it fixes the issue: https://github.com/gszr/lua-protobuf/pull/1/files (I am not familiar with the library's code, so this was just a proof of concept to verify some assumptions)

starwing commented 4 months ago

It seems that caused by the lack of Lua stack space, you could try the latest HEAD version of git to see whether it fixed.

starwing commented 4 months ago

If I compile with -fstrict-aliasing -Wstrict-aliasing=2, I get some useful warnings - see https://gist.github.com/gszr/84b12df6c5401b4e64fe5e5739405722.

Also, I've tested with this changeset, that refactors some of the "aliasing", and verified it fixes the issue: https://github.com/gszr/lua-protobuf/pull/1/files (I am not familiar with the library's code, so this was just a proof of concept to verify some assumptions)

Oh, I see this fix, so it's caused by the optimization of gcc? a pr is welcome :-)

gszr commented 4 months ago

@starwing - can you take a look at the PR? FWIW, we might want to disable the -fstrict-aliasing, since that same pattern I remvoed in the PR appears in other places. Doing -O2 without disabling strict-aliasing seems unsafe, and the segfault can happen in other places.

starwing commented 4 months ago

@gszr it seems the gcc feature, not bug. To compete with clang for performance ... I will accept this pr

starwing commented 4 months ago

I have made a new commit, please try it out and see whether the warnings still exist.

gszr commented 4 months ago

@starwing - I will give this a shot soon and report back.

gszr commented 4 months ago

@starwing I can confirm that fixes the specific aliasing issue I reported. No segfaults with the same testing. Also, it'd be a good exercise to go through other aliasing issues as reported by https://gist.github.com/gszr/84b12df6c5401b4e64fe5e5739405722 and fix them. These result in undefined behavior if the -fstrict-aliasing is on.

gszr commented 4 months ago

Also, could you draft a new patch release of the library?

starwing commented 4 months ago

I have made a new commit which fix the strict aliasing issue, could you check it out and test whether it works?

starwing commented 3 months ago

@gszr new version drafted.

gszr commented 3 months ago

@starwing The issue is gone (https://gist.github.com/gszr/05b43652556e0a1f8dd570806a8e914e) - thanks for the fix and new version (could you also update the GH releases page? :)