Closed liuming50 closed 2 years ago
Interesting, I'm debugging a similar issue with hooks on reboot, right now! Also with very new toolchain (from https://toolchains.bootlin.com/)
I'll have to check if I see the same over here, and if so figure out why your patch helps ... but in my case it looks like memory corruption because at reboot the HOOK_SHUTDOWN is called, and for some reason the netlink.so plugin has a callback for this hook on address '0x1' so obviously it segfaults?? I do not see this for 64-bit targets (We have x86_64 and Aarch64 at Westermo), only for our 32-bit targets (Arm v5/v6 and PowerPC 32-bit).
If I add a canary before HOOK_MAX_NUM, like this, it works:
diff --git a/src/plugin.h b/src/plugin.h
index 83d9891..38f16e3 100644
--- a/src/plugin.h
+++ b/src/plugin.h
@@ -82,6 +82,7 @@
\
/* Shutdown hooks, runlevel [06] */ \
CHOOSE(HOOK_SHUTDOWN, "hook/sys/shutdown"), \
+ CHOOSE(HOOK_CANARY, "nop"), \
CHOOSE(HOOK_MAX_NUM, "nop") \
}
presumably because the corruption writes to the canary instead. Need more gdb time to isolate this.
I'll have a look at your vector of approach and see if I can find something. But you could perhaps help out with a bisect from your end?
OK, I've tried reverting back to Finit v4.1, with the same results -- so either this is an even older regression (I don't wan't to attempt Finit v4.0 or older, at least not yet), or it's a toolchain bug. I'll rebuild everything with the Bootlin stable toolchain, which has GCC 10.3, slightly older binutils and gdb.
OK, I've tried reverting back to Finit v4.1, with the same results -- so either this is an even older regression (I don't wan't to attempt Finit v4.0 or older, at least not yet), or it's a toolchain bug. I'll rebuild everything with the Bootlin stable toolchain, which has GCC 10.3, slightly older binutils and gdb.
Yes, I believe this issue relates to toolchain, I have tried finit 4.2 with gcc-7 as described above, the issue did not show up with gcc-7.
Definitely toolchain, I may have isolated this to GLIBC now, potentially GLIBC v2.34, related to dlopen() and initialization of the plugins. In plugins.c:load_one() I can see dlopen() calling the constructor of each plugin, which registers fine and with proper contents of it's plugin_t, but on returning from dlopen() and dumping the contents of the plugin_t hooks the contents have been corrupted on some plugins.
Definitely toolchain, I may have isolated this to GLIBC now, potentially GLIBC v2.34, related to dlopen() and initialization of the plugins. In plugins.c:load_one() I can see dlopen() calling the constructor of each plugin, which registers fine and with proper contents of it's plugin_t, but on returning from dlopen() and dumping the contents of the plugin_t hooks the contents have been corrupted on some plugins.
Good finding! Shall we report this to glibc or have you checked if it's a known issue already for glibc? I will also do some investigation on this.
I reverted back to a gcc 9 toolchain, with older binutils, but still no luck (same glibc 2.34) so I started playing around in glibc when I suddenly remembered: glibc 2.34 introduces the new 64-bit time_t to workaround the UNIX 2038 bug on 32-bit targets. And as I mentioned previously I could only reproduce this on 32-bit targets ...
... sure enough, turns out finit was compiled with the correct flags to get 64-bit time_t, but not the plugins! So they were writing happy data to the wrong offsets :disappointed:
Could you check if the following patch fixes the issue also for you guys? Then we can perhaps close this and move on to more productive things in life.
diff --git a/plugins/Makefile.am b/plugins/Makefile.am
index bd50c59..4f2c719 100644
--- a/plugins/Makefile.am
+++ b/plugins/Makefile.am
@@ -3,7 +3,7 @@ AM_LDFLAGS = -module -avoid-version -shared
AM_CFLAGS = -W -Wall -Wextra -Wno-unused-parameter -std=gnu99
AM_CPPFLAGS = -I$(top_srcdir)/src -U_FORTIFY_SOURCE
AM_CPPFLAGS += -D_XOPEN_SOURCE=600 -D_BSD_SOURCE -D_GNU_SOURCE -D_DEFAULT_SOURCE
-AM_CPPFLAGS += $(lite_CFLAGS)
+AM_CPPFLAGS += $(lite_CFLAGS) $(uev_CFLAGS)
if STATIC
noinst_LTLIBRARIES = libplug.la
(Just to clarify, at this point I see no problem with GCC 11.2 or GLIBC 2.34, so no need to report anything upstream to them.)
@liuming50 It would help a LOT if you could confirm that the above patch to plugins/Makefile.am
solves your problem too. It has solved all our problems at Westermo, and I would like to do another release asap for this nasty bug,
@liuming50 It would help a LOT if you could confirm that the above patch to
plugins/Makefile.am
solves your problem too. It has solved all our problems at Westermo, and I would like to do another release asap for this nasty bug,
Hi, @troglobit
Sorry, I thought I had replied through my gmail client, but apparently my comments not sent.
It fixes my problems too, with the latest finit commit: ebdc10d1c592937504c807122665b1a64069c348 [ Bump version for 4.3 development cycle ]
all our plugins seems working well! Great work!
Thank you, that makes me very happy to hear! 😅
I'll be starting the release prep in a day or two. Thanks again for reporting this!
We observed a issue that seems finit does not work if we compile it with gcc-11.2/glibc-2.34, the problem is the plugin hook index seems is messed up for some reason.
For instance, if a plugin is registered as HOOK_BASEFS_UP, when the hook function being called at runtime, it's actually called by plugin_run_hooks(HOOK_MOUNT_ERROR).
The test is based on the latest finit commit: f32e4d4b3013ec3bf7130e5a16a4c63a226a906b [ Update changelog and bump version for v4.2 release ]
The compiling flags we use for gcc (in Yocto):
The problem not exist if we compile finit with gcc-7.3/glibc-2.27, same compiling flags.
I have a ugly workaround as attached: 0001-Fix-a-hook-index-issue.patch, with it applied, then the plugin hook calling seems correct, but after that, the svc/pid conditions files not created in /var/run/finit/cond/.