troglobit / finit

Fast init for Linux. Cookies included
https://troglobit.com/projects/finit/
MIT License
632 stars 63 forks source link

finit does not work when compiled by gcc-11.2/glibc-2.34 #216

Closed liuming50 closed 2 years ago

liuming50 commented 2 years ago

We observed a issue that seems finit does not work if we compile it with gcc-11.2/glibc-2.34, the problem is the plugin hook index seems is messed up for some reason.

For instance, if a plugin is registered as HOOK_BASEFS_UP, when the hook function being called at runtime, it's actually called by plugin_run_hooks(HOOK_MOUNT_ERROR).

The test is based on the latest finit commit: f32e4d4b3013ec3bf7130e5a16a4c63a226a906b [ Update changelog and bump version for v4.2 release ]

The compiling flags we use for gcc (in Yocto):

arm-poky-linux-gnueabi-gcc  -march=armv7-a -mfpu=neon -mfloat-abi=hard -U_FORTIFY_SOURCE -D_XOPEN_SOURCE=600 -D_BSD_SOURCE -D_GNU_SOURCE -D_DEFAULT_SOURCE -D_LIBITE_LITE  -W -Wall -Wextra -Wno-unused-parameter -std=gnu99 -O2 -pipe

The problem not exist if we compile finit with gcc-7.3/glibc-2.27, same compiling flags.

I have a ugly workaround as attached: 0001-Fix-a-hook-index-issue.patch, with it applied, then the plugin hook calling seems correct, but after that, the svc/pid conditions files not created in /var/run/finit/cond/.

liuming50 commented 2 years ago

0001-Fix-a-hook-index-issue.txt

troglobit commented 2 years ago

Interesting, I'm debugging a similar issue with hooks on reboot, right now! Also with very new toolchain (from https://toolchains.bootlin.com/)

I'll have to check if I see the same over here, and if so figure out why your patch helps ... but in my case it looks like memory corruption because at reboot the HOOK_SHUTDOWN is called, and for some reason the netlink.so plugin has a callback for this hook on address '0x1' so obviously it segfaults?? I do not see this for 64-bit targets (We have x86_64 and Aarch64 at Westermo), only for our 32-bit targets (Arm v5/v6 and PowerPC 32-bit).

If I add a canary before HOOK_MAX_NUM, like this, it works:

diff --git a/src/plugin.h b/src/plugin.h
index 83d9891..38f16e3 100644
--- a/src/plugin.h
+++ b/src/plugin.h
@@ -82,6 +82,7 @@
                                                                \
        /* Shutdown hooks, runlevel [06] */                     \
        CHOOSE(HOOK_SHUTDOWN,        "hook/sys/shutdown"),      \
+       CHOOSE(HOOK_CANARY,          "nop"),                    \
        CHOOSE(HOOK_MAX_NUM,         "nop")                     \
 }

presumably because the corruption writes to the canary instead. Need more gdb time to isolate this.

I'll have a look at your vector of approach and see if I can find something. But you could perhaps help out with a bisect from your end?

troglobit commented 2 years ago

OK, I've tried reverting back to Finit v4.1, with the same results -- so either this is an even older regression (I don't wan't to attempt Finit v4.0 or older, at least not yet), or it's a toolchain bug. I'll rebuild everything with the Bootlin stable toolchain, which has GCC 10.3, slightly older binutils and gdb.

liuming50 commented 2 years ago

OK, I've tried reverting back to Finit v4.1, with the same results -- so either this is an even older regression (I don't wan't to attempt Finit v4.0 or older, at least not yet), or it's a toolchain bug. I'll rebuild everything with the Bootlin stable toolchain, which has GCC 10.3, slightly older binutils and gdb.

Yes, I believe this issue relates to toolchain, I have tried finit 4.2 with gcc-7 as described above, the issue did not show up with gcc-7.

troglobit commented 2 years ago

Definitely toolchain, I may have isolated this to GLIBC now, potentially GLIBC v2.34, related to dlopen() and initialization of the plugins. In plugins.c:load_one() I can see dlopen() calling the constructor of each plugin, which registers fine and with proper contents of it's plugin_t, but on returning from dlopen() and dumping the contents of the plugin_t hooks the contents have been corrupted on some plugins.

liuming50 commented 2 years ago

Definitely toolchain, I may have isolated this to GLIBC now, potentially GLIBC v2.34, related to dlopen() and initialization of the plugins. In plugins.c:load_one() I can see dlopen() calling the constructor of each plugin, which registers fine and with proper contents of it's plugin_t, but on returning from dlopen() and dumping the contents of the plugin_t hooks the contents have been corrupted on some plugins.

Good finding! Shall we report this to glibc or have you checked if it's a known issue already for glibc? I will also do some investigation on this.

troglobit commented 2 years ago

I reverted back to a gcc 9 toolchain, with older binutils, but still no luck (same glibc 2.34) so I started playing around in glibc when I suddenly remembered: glibc 2.34 introduces the new 64-bit time_t to workaround the UNIX 2038 bug on 32-bit targets. And as I mentioned previously I could only reproduce this on 32-bit targets ...

... sure enough, turns out finit was compiled with the correct flags to get 64-bit time_t, but not the plugins! So they were writing happy data to the wrong offsets :disappointed:

Could you check if the following patch fixes the issue also for you guys? Then we can perhaps close this and move on to more productive things in life.

diff --git a/plugins/Makefile.am b/plugins/Makefile.am
index bd50c59..4f2c719 100644
--- a/plugins/Makefile.am
+++ b/plugins/Makefile.am
@@ -3,7 +3,7 @@ AM_LDFLAGS          = -module -avoid-version -shared
 AM_CFLAGS           = -W -Wall -Wextra -Wno-unused-parameter -std=gnu99
 AM_CPPFLAGS         = -I$(top_srcdir)/src -U_FORTIFY_SOURCE
 AM_CPPFLAGS        += -D_XOPEN_SOURCE=600 -D_BSD_SOURCE -D_GNU_SOURCE -D_DEFAULT_SOURCE
-AM_CPPFLAGS        += $(lite_CFLAGS)
+AM_CPPFLAGS        += $(lite_CFLAGS) $(uev_CFLAGS)

 if STATIC
 noinst_LTLIBRARIES  = libplug.la
troglobit commented 2 years ago

(Just to clarify, at this point I see no problem with GCC 11.2 or GLIBC 2.34, so no need to report anything upstream to them.)

troglobit commented 2 years ago

@liuming50 It would help a LOT if you could confirm that the above patch to plugins/Makefile.am solves your problem too. It has solved all our problems at Westermo, and I would like to do another release asap for this nasty bug,

liuming50 commented 2 years ago

@liuming50 It would help a LOT if you could confirm that the above patch to plugins/Makefile.am solves your problem too. It has solved all our problems at Westermo, and I would like to do another release asap for this nasty bug,

Hi, @troglobit

Sorry, I thought I had replied through my gmail client, but apparently my comments not sent.

It fixes my problems too, with the latest finit commit: ebdc10d1c592937504c807122665b1a64069c348 [ Bump version for 4.3 development cycle ]

all our plugins seems working well! Great work!

troglobit commented 2 years ago

Thank you, that makes me very happy to hear! 😅

I'll be starting the release prep in a day or two. Thanks again for reporting this!