Closed flavorjones closed 1 year ago
Interesting: when I install 3.2.0-rc1 using ruby-build
then that version of Ruby is able to run this gem fine. So maybe this is a problem with how ruby/setup-ruby
is building 3.2.0-rc1?
The _rb_cObject
should be in libruby, but libruby isn't mentioned in the above error output. I guess it has something to do with this commit https://github.com/ruby/ruby/commit/50d81bf . On Macos we use the same trick like on Linux, that unresolved symbols are automatically resolved to already loaded libraries. Therefore the nokogiri.bundle
doesn't link to libruby explicit, but is still resolved at runtime.
Looks like if I build ruby with --enable-shared
then this happens at runtime. When I build without that flag, ruby loads the bundle fine.
Here was a similar issue with nokogiri on Macos: https://github.com/sparklemotion/nokogiri/issues/2063#issuecomment-687187602
Wow, thanks for refreshing my memory, I have very little recollection of doing that work. :shrug:
So at this point, I think the 3.2 bundle is fine ... do you agree?
I want to understand better what's changed about the Ruby build system, but as long as --disable-shared
(which is implicit by default) is set, users should be fine. Do you agree?
Tagging @kateinoigakukun in case he spots something that's wrong with either the rake-compiler-dock build process or Ruby's makefiles.
@flavorjones Unfortunately, an extension library compiled with a --enable-shared
'ed ruby is not compatible with a --disable-shared
'ed ruby, and vice versa.
I think the compatibility has not been guaranteed but it just worked so far. The bundle_loader
change changed the situation, so we can't assume that fat gems are portable among rubies configured with different options.
@kateinoigakukun Thanks for confirming.
@larskanis Do you know why the 3.2.0-rc1 ruby binaries installed by ruby/setup-ruby
aren't compatible? I looked in the repo but nothing jumped out at me.
@flavorjones Which ruby binaries aren't compatible with what? Do you have a CI link?
@larskanis Darwin 3.2.0-rc1 binaries aren't compatible with the extensions being built by rake-compiler-dock.
CI link: https://github.com/sparklemotion/nokogiri/actions/runs/3735058984/jobs/6350173407
The CI config is:
cruby-x86_64-darwin-install:
needs: ["cruby-package"]
strategy:
fail-fast: false
matrix:
ruby: ["2.7", "3.0", "3.1", "3.2.0-rc1"]
runs-on: macos-latest
steps:
- uses: actions/checkout@v3
with:
submodules: true
- uses: ruby/setup-ruby@v1
with:
ruby-version: "${{matrix.ruby}}"
- uses: actions/download-artifact@v3
with:
name: cruby-x86_64-darwin-gem
path: gems
- run: ./scripts/test-gem-install gems
and it's passing for the other Ruby versions.
https://github.com/rake-compiler/rake-compiler-dock/pull/90 reproduces this error in rake-compiler-dock's CI pipeline
@larskanis I confirmed that the rubies built by ruby-builder (and downloadable from here) are built with --enable-shared
.
Not sure what you think we should do here. I just pushed #91 to see what will break if we set --enable-shared
on darwin.
Unfortunately, an extension library compiled with a
--enable-shared
'ed ruby is not compatible with a--disable-shared
'ed ruby, and vice versa
@kateinoigakukun This change may have unintended consequences. Over the past two years, nokogiri precompiled for darwin has been downloaded 12,062,659 times, so I'm very worried that if we can no longer ship precompiled binaries for darwin that this is going to be a burden on both users and gem maintainers.
Is the recommended workaround to ship two bundles for darwin? One for rubies configured with --enable-shared
and one for rubies configured with --disable-shared
?
@flavorjones I've committed a patch to restore the compatibility nokogiri has been expecting so far: "An extension built with --disable-shared Ruby is loadable from --enable-shared Ruby". https://github.com/ruby/ruby/commit/c5eefb7f37db2865891298dd1a1e60dff09560ad
I think it would repair your build and you don't need to distribute two dylibs for darwin now :)
@kateinoigakukun Thank you! I will run that change through the rake-compiler-dock test suite!
Here's how I tested: I'm still using 3.2.0-rc1
as the cross-compiling Ruby, but added append_ldflags("-Wl,-flat_namespace")
to Nokogiri's extconf.rb
.
The good news is that a gem precompiled with that flag does run on darwin both with a "enable-shared ruby" and a "disable-shared ruby".
However, the bad news is that the libxml2 and libxslt symbols are no longer being resolved from within the bundle -- they are resolving against system libraries! I see these warnings:
WARNING: Nokogiri was built against libxml version 2.10.3, but has dynamically loaded 2.9.13
WARNING: Nokogiri was built against libxslt version 1.1.37, but has dynamically loaded 1.1.35
These warnings are generated by comparing the version of libxml2 present in LIBXML_DOTTED_VERSION
at compile time against the value present in the global variable xmlParserVersion
at runtime.
They should always match, but in gems built this way, xmlParserVersion
is being resolved from system libraries.
Looking at the symbols present in the bundle, the libxml2 symbols are properly in the text (code) section, denoted with a "T":
/opt/osxcross/target/bin/arm64-apple-darwin-nm nokogiri.bundle | fgrep xmlParserVersion
00000000000c21d0 T ___xmlParserVersion
00000000002aaa10 D _xmlParserVersion
@kateinoigakukun Do you understand why the bundle isn't being used as the primary location for resolving symbols? Should I play with -fvisibility=hidden
and __attribute__((visibility("default")))
?
I've pushed the workaround mentioned in my previous comment here, in case anyone needs to reproduce:
https://github.com/sparklemotion/nokogiri/pull/2732
I'll link to the specific warnings when CI gets there.
Hmm, very strange, it works fine in the actions runner: https://github.com/sparklemotion/nokogiri/actions/runs/3767005588/jobs/6404353192#step:5:36
Maybe this is close enough for me to cut a release candidate of Nokogiri and get feedback from people.
OK, I'm not sure why I can't reproduce this in github runners, but I've reproduced now the symbol-loading error on three different macs.
When I run vmmap
on a ruby
process, I can see a /usr/lib/libxml2.2.dylib
entry which I believe is mapped to /Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk/usr/lib/
(the xcode CLT libraries). (This is also true for 3.1 rubies, so nothing new there.)
But I believe what is new is that now function calls from the nokogiri.bundle are no longer resolved to symbols that exist in the bundle, but to symbols in the ruby process. This makes sense given the explanation of the -flat_namespace
change.
I'm not sure how to get unblocked from here.
OK, I made some progress with trying -load_hidden
when I add the static libraries for libxml2 and libxslt to the bundle. I might be unblocked.
See https://github.com/sparklemotion/nokogiri/pull/2732/commits/1714e7786b46f3b8533f81c38594d72e5e94054d for what seems to work.
Well, -load_hidden
doesn't completely work, either -- it means all symbols are hidden and that breaks the ABI of the extensions which some other gems (like nokogiri-xmlsec
) depend on (nokogiri-xmlsec
needs to use the same libxml2 library that nokogiri uses).
@kateinoigakukun This all seems really bad. See some conversation with another nokogiri maintainer at https://github.com/sparklemotion/nokogiri/pull/2732 where we're proposing to try to undo these changes when we build the extension.
@kateinoigakukun Please see https://github.com/stevecheckoway/bundle_test for a working example of why -flat_namespace
is not a sufficient solution.
I reverted katei's fix with -flat_namespace
from Ruby 3.2.0. I continue to discuss how to fix dynamic_lookup problem toward Ruby 3.2.1.
Please see https://github.com/stevecheckoway/bundle_test for a working example of why -flat_namespace is not a sufficient solution.
Thank you for clarifying the problem. What I am wondering now is that is the problem is really the problem?
I fixed the files as below and its result is as the same as flat_namespace. Therefore if a gem cause a problem with flat_namespace, it doesn't work on Linux.
foo_version(): 1
init()
function_in_main()
foo_version(): 1
diff --git a/Makefile b/Makefile
index 9f6f601..8df8164 100644
--- a/Makefile
+++ b/Makefile
@@ -10,29 +10,29 @@ endif
.PHONY: all clean
-all: main main_shared extension.bundle
+all: main main_shared extension.so
-main_shared: main.c libmain.dylib libfoo1.dylib
+main_shared: main.c libmain.so libfoo1.so
$(CC) -o $@ -DENABLE_SHARED=1 main.c -L. -lmain -lfoo1
-main: main.c libfoo1.dylib
- $(CC) -o $@ main.c -L. -lfoo1
+main: main.c libfoo1.so
+ $(CC) -o $@ main.c -L. -rdynamic -ldl -lfoo1
-libmain.dylib: main.c libfoo1.dylib
- $(CC) -dynamiclib -DLIBMAIN=1 -o $@ main.c -L. -lfoo1
+libmain.so: main.c libfoo1.so
+ $(CC) -shared -fPIC -DLIBMAIN=1 -o $@ main.c -L. -ldl -lfoo1
-libfoo1.dylib: foo.c
- $(CC) -dynamiclib -DFOO_VERSION=1 -o $@ $<
+libfoo1.so: foo.c
+ $(CC) -shared -fPIC -DFOO_VERSION=1 -o $@ $<
-extension.bundle: extension.c libfoo2.a
- $(CC) -o $@ -bundle $(NAMESPACE_OPTS) extension.c libfoo2.a
+extension.so: extension.c libfoo2.a
+ $(CC) -o $@ -shared -fPIC extension.c libfoo2.a
foo2.o: foo.c
- $(CC) -c -o $@ -DFOO_VERSION=2 $^
+ $(CC) -c -fPIC -o $@ -DFOO_VERSION=2 $^
libfoo2.a: foo2.o
$(RM) $@
ar -cr $@ $^
clean:
- $(RM) main main_shared *.bundle *.dylib *.o *.a
+ $(RM) main main_shared *.bundle *.so *.o *.a
diff --git a/main.c b/main.c
index 30ba3fe..946fe9c 100644
--- a/main.c
+++ b/main.c
@@ -14,7 +14,7 @@ static
#endif
int main_implementation() {
// Load the library.
- void *handle = dlopen("extension.bundle", RTLD_LAZY | RTLD_GLOBAL);
+ void *handle = dlopen("extension.so", RTLD_LAZY | RTLD_GLOBAL);
if (!handle) {
fprintf(stderr, "dlopen(): %s", dlerror());
return 1;
Thank you all :pray: Will have a closer look next month.
@nurse Thanks for asking this question. Although I don't have time today to show that this works as expected on Linux, Nokogiri depends on this behavior working (calls from within the extension should resolve to symbols in static libraries linked) and I should be able to demonstrate that in the next day or two.
Ah - it just occurred to me that this may not be a problem on Linux unless libxml2 has been loaded.
The problem on Darwin is that XCode CLT appears to always load libxml2 into Ruby processes (which you can see if you run vmmap
on any running Ruby process), at least for Ruby binaries built on the machine (via ruby-build or similar).
The answer may be that we hide those symbols, which seems to work the way we need, but might break gems that try to link against Nokogiri. I'm ok with breaking that compatibility.
@nurse You're right, of course: I had to work around symbol collision a few years back in https://github.com/sparklemotion/nokogiri/pull/2106
I think the big problem here now is that, on Darwin, the Ruby process seems very likely to have loaded the system libxml2 (if Ruby was built with XCode CLT), and so suddenly all of libxml2 and libxslt are colliding with already-loaded symbols.
The two options I know of to fix symbol resolution on Darwin are described here: https://github.com/rake-compiler/rake-compiler-dock/pull/92#issuecomment-1366000890
I've written up a brief summary of how Nokogiri is working around this at nokogiri/2022-12-darwin-symbol-resolution.md at main · sparklemotion/nokogiri · GitHub
OK, after shipping an RC for nokogiri native gems, and working on sqlite3-ruby native gems, it looks like everybody who precompiles for Darwin and Ruby 3.2 will need to add something like this to their extconf:
if cross_build? &&
darwin? &&
RbConfig::CONFIG["ruby_version"] >= "3.2"
append_ldflags("-Wl,-flat_namespace")
end
@kateinoigakukun Do you have any ideas on how to make this situation better, potentially in a future Ruby 3.2.1 release?
@larskanis Do you think we should try to add this flag into the link line for users from within rake-compiler-dock? Possibly by modifying the DLDFLAGS
in the rake-compiler rbconfig.rb
files?
@larskanis Do you think we should try to add this flag into the link line for users from within rake-compiler-dock? Possibly by modifying the
DLDFLAGS
in the rake-compilerrbconfig.rb
files?
Yes, I think so. We shouldn't demand these lines.
@larskanis Do you think we should try to add this flag into the link line for users from within rake-compiler-dock? Possibly by modifying the
DLDFLAGS
in the rake-compilerrbconfig.rb
files?Yes, I think so. We shouldn't demand these lines.
OK, will put together a PR
If we merge #94, then I think we should cut a release soon.
Then I can confidently document how to deal with global symbol resolution in https://github.com/flavorjones/ruby-c-extensions-explained. (I think we need to start recommending that everyone build their extension (and all libraries that are statically linked into it) with -fvisibility=hidden
and then only export the Init_*
function, like I did in https://github.com/sparklemotion/sqlite3-ruby/pull/371.)
I've merged #94 and I think we can cut a release. I'm building another rc for sqlite3-ruby based on images generated from https://github.com/rake-compiler/rake-compiler-dock/commit/ce619f2ee5e71fa787d0768ca74d85fb35017e20 and they look good.
@larskanis I think you're still the only owner of the dockerhub account, but if there is anything I can do to help with a release, please let me know?
@flavorjones As far as I can see, there is no way to add other contributors to the docker hub account. The only way is to convert "larskanis" to an organization, which doesn't feel like a good solution. Instead I think it makes sense to create a new organization and publish the images there? Maybe the name "rake-compiler" like on Github? Then we can change the name in the gem as well.
Nevertheless I can do the release and publish the images on the "larskanis" account.
I did some tests on my side (with fxruby, libusb, ffi and ruby-pg) and the latest images worked well. I found only one issue with RbConfig::CONFIG['host']
, which returns x64-w64-mingw32
on ruby-3.2 on x64-mingw-ucrt instead of the x86_64-w64-mingw32
of ruby-3.1 and so it differs from the compiler name prefix. But I found that the host
variable is already different on the linux images, so that I changed the code in fxruby to work with the latest images.
@larskanis I'd like to again propose that we use ghcr.io instead of dockerhub as the future home of these images. I started this work to demonstrate it could work with the weekly snapshots. We could similarly automate creation of the versioned images (and we can cache layers to ensure that image diffs stay small between patch releases).
@larskanis Would it be helpful if I did the work to cut over to ghcr.io to be the primary registry for RCD images so that I'm able to cut a release? You could then sync those images to dockerhub for convenience for folks. I should have some time this week to do that if you're not able to cut a release.
I didn't work with ghcr.io so far, so I'd be happy if you could prepare it. I don't care too much about the platform we use to distribute our images. So if ghcr.io fulfills our needs, I think we can switch the gem to it with no need to mirror the images on docker hub.
See #95! :heart:
Closing this, I think we have a path forward for Ruby 3.2.
@larskanis I'm hoping you can help me out here, I'm not sure what's going wrong in the toolchain.
I've built a version of Nokogiri that precompiles Ruby 3.2 shared objects using the docker images published here. (The pipeline to build and test Nokogiri is here if you're curious, and the PR is here).
For both
x86_64-darwin
andarm64-darwin
(the latter I run manually on my dev machine), I see this runtime error:You can see this in the CI logs here. You can download this gem yourself if you like from the "Artifacts" section of this pipeline run:
Comparing the 3.2 dylib and the 3.1 dylib (which still works), both have
_rb_cObject
listed asU
ndefined, so I'm not actually sure the problem is in how the shared object is built. Do you have any ideas?