Closed Fryguy closed 1 month ago
So as mentioned on the other issue, I can't repro even with 3.0.3 even in docker.
But it's a debian base and on ARM64, not red-hat, so hard to tell what kind of patches may have been applied to ruby, or if the bug is x86 only, or dependent of some specific system tuning.
The RSTRING_PTR is returning NULL!!
message makes me wonder if perhaps overcommit is disabled on that system? Which would cause a huge alloc to return NULL
rather than a valid pointer? Just a guess.
Looks like overcommit is enabled
$ cat /proc/sys/vm/overcommit_memory
0
$ cat /proc/sys/vm/overcommit_ratio
50
$ cat /proc/sys/vm/overcommit_kbytes
0
Yeah, probably isn't that. I'm trying to get some reproducer in Docker, using centos:8
, but running in all sorts of issues (e.g. can't resolve mirrorlist.centos.org). Not being familiar at all with the RH/centos family of distros :/
Oh yeah, centos 8 is a pain because they just went into the vault. Unfortunately, we have to cut a new version of an older release, and that's what that release is on...it wasn't until cutting the release that we found this segfault.
I'm also trying to get a docker image - here's what I got so far, but it passes just fine (using ubi8)
FROM registry.access.redhat.com/ubi8/ubi
RUN dnf -y module enable ruby:3.0 && \
dnf -y install ruby ruby-devel gcc make
RUN gem install msgpack -v 1.7.2
ADD segfault.rb /
ENTRYPOINT ["ruby", "segfault.rb"]
Ok, so I managed to get Ruby 3.0.4 installed from centos 8 after quite a lot of pain, but still doesn't reproduce for me (still on ARM64 though).
Not sure if this helps, but looking at the generated Makefiles for the failing 1.7.2 and the working 1.7.3 I noticed this:
diff --git a/opt/manageiq/manageiq-gemset/gems/msgpack-1.7.2/ext/msgpack/Makefile b/opt/manageiq/manageiq-gemset/gems/msgpack-1.7.3/ext/msgpack/Makefile
index 69a2a0c..9106c87 100644
--- a/opt/manageiq/manageiq-gemset/gems/msgpack-1.7.2/ext/msgpack/Makefile
+++ b/opt/manageiq/manageiq-gemset/gems/msgpack-1.7.3/ext/msgpack/Makefile
@@ -32,8 +32,8 @@ rubygemsdir = $(DESTDIR)/usr/share/rubygems
vendorarchdir = $(DESTDIR)/usr/lib64/ruby/vendor_ruby
vendorlibdir = $(vendordir)
vendordir = $(DESTDIR)/usr/share/ruby/vendor_ruby
-sitearchdir = $(DESTDIR)./.gem.20241007-2975-tj4odq
-sitelibdir = $(DESTDIR)./.gem.20241007-2975-tj4odq
+sitearchdir = $(DESTDIR)./.gem.20241015-14952-yysfs2
+sitelibdir = $(DESTDIR)./.gem.20241015-14952-yysfs2
sitedir = $(DESTDIR)/usr/local/share/ruby/site_ruby
rubyarchdir = $(rubyarchprefix)
rubylibdir = $(rubylibprefix)
@@ -84,7 +84,7 @@ debugflags = -ggdb3
warnflags = -Wall -Wextra -Wdeprecated-declarations -Wduplicated-cond -Wimplicit-function-declaration -Wimplicit-int -Wmisleading-indentation -Wpointer-arith -Wwrite-strings -Wimplicit-fallthrough=0 -Wmissing-noreturn -Wno-cast-function-type -Wno-constant-logical-operand -Wno-long-long -Wno-missing-field-initializers -Wno-overlength-strings -Wno-packed-bitfield-compat -Wno-parentheses-equality -Wno-self-assign -Wno-tautological-compare -Wno-unused-parameter -Wno-unused-value -Wsuggest-attribute=format -Wsuggest-attribute=noreturn -Wunused-variable
cppflags =
CCDLFLAGS = -fPIC
-CFLAGS = $(CCDLFLAGS) -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fPIC -fvisibility=hidden -I.. -Wall -O3 -std=gnu99 -ggdb3 -DHASH_ASET_DEDUPE=1 -DSTR_UMINUS_DEDUPE_FROZEN=1 $(ARCH_FLAG)
+CFLAGS = $(CCDLFLAGS) -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fPIC -fvisibility=hidden -I.. -Wall -O3 -std=gnu99 -ggdb3 -DRB_ENC_INTERNED_STR_NULL_CHECK=1 -DHASH_ASET_DEDUPE=1 -DSTR_UMINUS_DEDUPE_FROZEN=1 $(ARCH_FLAG)
INCFLAGS = -I. -I$(arch_hdrdir) -I$(hdrdir)/ruby/backward -I$(hdrdir) -I$(srcdir)
DEFS =
CPPFLAGS = -DHAVE_RB_ENC_INTERNED_STR -DHAVE_RB_PROC_CALL_WITH_BLOCK $(DEFS) $(cppflags)
@@ -110,7 +110,6 @@ sitearch = $(arch)
ruby_version = 3.0.0
ruby = $(bindir)/$(RUBY_BASE_NAME)
RUBY = $(ruby)
-BUILTRUBY = $(bindir)/$(RUBY_BASE_NAME)
ruby_headers = $(hdrdir)/ruby.h $(hdrdir)/ruby/backward.h $(hdrdir)/ruby/ruby.h $(hdrdir)/ruby/defines.h $(hdrdir)/ruby/missing.h $(hdrdir)/ruby/intern.h $(hdrdir)/ruby/st.h $(hdrdir)/ruby/subst.h $(arch_hdrdir)/ruby/config.h
RM = rm -f
That -DRB_ENC_INTERNED_STR_NULL_CHECK=1
feels suspicious. Not sure what causes that to appear.
RB_ENC_INTERNED_STR_NULL_CHECK
Oh yeah, I was suspecting that, and that will do it.
This flag is for a know Ruby 3.0 bug. We check the Ruby version for it, it was fixed in Ruby 3.0.5: https://github.com/msgpack/msgpack-ruby/blob/6bbaa97600430c438675540e1f970d61ce5ccd9e/ext/msgpack/extconf.rb#L18
If somehow your generated Makefile didn't set this flag, then an empty string combined with the freeze: true
flag will run into exactly the bug you describe.
Awesome this helps a lot - It's very likely this, though I'm not sure why the gem we built is missing this for one but not the other...I can dig into this, but I'm suspecting this is the reason.
@byroot Thank you so much for digging into this with me...we understand what's happening now so I'm going to close this.
Our application has 2 deployable form factors, containerized and a virtual machine appliance. Our build process first builds our application and its dependencies into rpms from within a container env (which is ubi8) and those rpms are pushed to a yum repository. Later we install those rpms into the final deployable container image (which is ubi8) as well as a virtual machine appliance image (which is centos8-stream).
So, the weird part here is that centos8-stream is EOL, but Red Hat has been continuing to keep ubi8 up to date, and they are now out of sync. centos8-stream ships with Ruby 3.0.4 and ubi8 ships with Ruby 3.0.7. Thus, when we built msgpack it was built against Ruby 3.0.7, but then when we deployed in the appliance it was running with Ruby 3.0.4.
When I manually built 1.7.3 on the appliance during this investigation, what I actually did was align the Ruby versions unknowingly, which is why it worked. 1.7.3 wasn't the fix at all. In fact, I uninstalled the 1.7.2 gem, and just rebuilt it again on the appliance, and suddenly 1.7.2 worked, because, again, I just aligned the Ruby version.
It really was just a perfect storm where the Ruby versions just happened to straddle this exact bug 😭 .
Again thank you for helping us dig in!
when we built msgpack it was built against Ruby 3.0.7, but then when we deployed in the appliance it was running with Ruby 3.0.4.
Ouch. Yeah, that hurt.
I came here from https://github.com/msgpack/msgpack-ruby/pull/368 only to know how the episode on the odd segfault ended :D
Kudos to both of you for dedication on a single bug and for sharing this!
Extracting the conversation from https://github.com/msgpack/msgpack-ruby/pull/368#issuecomment-2414673285, we are getting a segfault that seems to have been fixed by #368 (v1.7.3), however I'm concerned that PR didn't actually fix it, or that something else is going on, so I'm opening this issue to continue the investigation.
This was only happening in our CentOS 8 VM with Red Hat's ruby 3.0.4, and we couldn't get it to happen on any other system, so it's been difficult to track down. Upgrading to msgpack 1.7.3 and the problem now goes away
The smallest reproduce I have so far is
Full stack trace:
Some system info: