Closed gszr closed 3 months ago
If I compile with -fstrict-aliasing -Wstrict-aliasing=2
, I get some useful warnings - see https://gist.github.com/gszr/84b12df6c5401b4e64fe5e5739405722.
Also, I've tested with this changeset, that refactors some of the "aliasing", and verified it fixes the issue: https://github.com/gszr/lua-protobuf/pull/1/files (I am not familiar with the library's code, so this was just a proof of concept to verify some assumptions)
It seems that caused by the lack of Lua stack space, you could try the latest HEAD version of git to see whether it fixed.
If I compile with
-fstrict-aliasing -Wstrict-aliasing=2
, I get some useful warnings - see https://gist.github.com/gszr/84b12df6c5401b4e64fe5e5739405722.Also, I've tested with this changeset, that refactors some of the "aliasing", and verified it fixes the issue: https://github.com/gszr/lua-protobuf/pull/1/files (I am not familiar with the library's code, so this was just a proof of concept to verify some assumptions)
Oh, I see this fix, so it's caused by the optimization of gcc? a pr is welcome :-)
@starwing - can you take a look at the PR?
FWIW, we might want to disable the -fstrict-aliasing
, since that same pattern I remvoed in the PR appears in other places. Doing -O2
without disabling strict-aliasing seems unsafe, and the segfault can happen in other places.
@gszr it seems the gcc feature, not bug. To compete with clang for performance ... I will accept this pr
I have made a new commit, please try it out and see whether the warnings still exist.
@starwing - I will give this a shot soon and report back.
@starwing I can confirm that fixes the specific aliasing issue I reported. No segfaults with the same testing.
Also, it'd be a good exercise to go through other aliasing issues as reported by https://gist.github.com/gszr/84b12df6c5401b4e64fe5e5739405722 and fix them. These result in undefined behavior if the -fstrict-aliasing
is on.
Also, could you draft a new patch release of the library?
I have made a new commit which fix the strict aliasing issue, could you check it out and test whether it works?
@gszr new version drafted.
@starwing The issue is gone (https://gist.github.com/gszr/05b43652556e0a1f8dd570806a8e914e) - thanks for the fix and new version (could you also update the GH releases page? :)
Here are some notes of a segfault I have been facing with lua-protobuf in Kong Gateway, which I pinpointed to the optimization
-fstrict-aliasing
(part of-O2
) in lua-protobuf -- long write-up, be warned : )A segfault coming from the worker process. I fired up gdb and got the following:
So I saw it originated in the lua-protobuf C library and rebuilt lua-protobuf with
-g2
I preserved optimization level 2 since that is the default the library is built with.
Restarted Kong and reran...
So we can see that the segfault is caused by the assignment:
which happens on line 1220 of the pb.h file: https://github.com/starwing/lua-protobuf/blame/master/pb.h#L1220. Looking at the code around that line did not help much; if
pfield
itself wasNULL
, the segfault would have happened on line 1217 instead, so the pointer must be valid, but that region it points to isn't writeable.Back to the compilation command, I tried a few other things Since the default is to build with
-O2
, what happens if I do-O0
?... it now works. So it must be some optimization that gets enabled by the
-O2
flag. I rerun with-O1
- it works; then I rerun with-O3
- and it also works -- (which is strange, since each level of optimization adds to the previous one)Pursuing another line, I decided to test with other compilers: clang version 16.0.6 -> works gcc version 12.3.0 -> works (with same exact flags)
Back to compilation levels; we know that
-O0
and-O1
work, but -O2 does not, so the issue must be triggered by a subset of optimizations introduced by-O2
. I tried putting together a list of all individual optimizations done by-O1
and-O2
; it turns out that such a list isn't really possible, because gcc does not expose all optimizations done by -Ox in individual flags (and there are no plans for doing so [2]). However, at least some (perhaps most) of those optimizations are exposed through individual -f flags [3], so I compiled the following list: https://gist.github.com/gszr/6164e79c596bd98d3a4fd170ff351611I wrote a small script to compile the library with all these and "binary search" for at least one that triggers the issue (note on the flags: it is not enough to list all flags; at least -O1 must be specified so they take effect).
After a few rounds of removing / adding flags, I pinpointed the issue to the following optimization:
-fstrict-aliasing
. Which we can confirm with the following -- enabling all of-O1
+-fstrict-aliasing
which belongs to-O2
:... and it breaks.
Could there be some other optimization that could also lead to the issue? To test this, I reran with
-O2
, but disabling specifically-fstrict-aliasing
:[1] https://gcc.gnu.org/wiki/FAQ#Is_-O1_.28-O2.2C-O3.2C_-Os_or_-Og.29_equivalent_to_individual_-foptimization_options.3F [2] https://gcc.gnu.org/legacy-ml/gcc-help/2009-10/msg00134.html [3] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html