rake-compiler / rake-compiler-dock

Easy to use and reliable cross compiler environment for building Windows, Linux, Mac and JRuby binary gems.
MIT License
78 stars 30 forks source link

[Ruby 3.2] having runtime issues on darwin #87

Closed flavorjones closed 1 year ago

flavorjones commented 1 year ago

@larskanis I'm hoping you can help me out here, I'm not sure what's going wrong in the toolchain.

I've built a version of Nokogiri that precompiles Ruby 3.2 shared objects using the docker images published here. (The pipeline to build and test Nokogiri is here if you're curious, and the PR is here).

For both x86_64-darwin and arm64-darwin (the latter I run manually on my dev machine), I see this runtime error:

/Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/lib/ruby/gems/3.2.0+3/gems/nokogiri-1.15.test.2022.1219.1745-x86_64-darwin/lib/nokogiri/extension.rb:7:in `require_relative': dlopen(/Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/lib/ruby/gems/3.2.0+3/gems/nokogiri-1.15.test.2022.1219.1745-x86_64-darwin/lib/nokogiri/3.2/nokogiri.bundle, 0x0009): Symbol not found: (_rb_cObject) (LoadError)
  Referenced from: '/Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/lib/ruby/gems/3.2.0+3/gems/nokogiri-1.15.test.2022.1219.1745-x86_64-darwin/lib/nokogiri/3.2/nokogiri.bundle'
  Expected in: '/Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/bin/ruby' - /Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/lib/ruby/gems/3.2.0+3/gems/nokogiri-1.15.test.2022.1219.1745-x86_64-darwin/lib/nokogiri/3.2/nokogiri.bundle

You can see this in the CI logs here. You can download this gem yourself if you like from the "Artifacts" section of this pipeline run:

Comparing the 3.2 dylib and the 3.1 dylib (which still works), both have _rb_cObject listed as Undefined, so I'm not actually sure the problem is in how the shared object is built. Do you have any ideas?

flavorjones commented 1 year ago

Interesting: when I install 3.2.0-rc1 using ruby-build then that version of Ruby is able to run this gem fine. So maybe this is a problem with how ruby/setup-ruby is building 3.2.0-rc1?

larskanis commented 1 year ago

The _rb_cObject should be in libruby, but libruby isn't mentioned in the above error output. I guess it has something to do with this commit https://github.com/ruby/ruby/commit/50d81bf . On Macos we use the same trick like on Linux, that unresolved symbols are automatically resolved to already loaded libraries. Therefore the nokogiri.bundle doesn't link to libruby explicit, but is still resolved at runtime.

flavorjones commented 1 year ago

Looks like if I build ruby with --enable-shared then this happens at runtime. When I build without that flag, ruby loads the bundle fine.

larskanis commented 1 year ago

Here was a similar issue with nokogiri on Macos: https://github.com/sparklemotion/nokogiri/issues/2063#issuecomment-687187602

flavorjones commented 1 year ago

Wow, thanks for refreshing my memory, I have very little recollection of doing that work. :shrug:

So at this point, I think the 3.2 bundle is fine ... do you agree?

I want to understand better what's changed about the Ruby build system, but as long as --disable-shared (which is implicit by default) is set, users should be fine. Do you agree?

flavorjones commented 1 year ago

Tagging @kateinoigakukun in case he spots something that's wrong with either the rake-compiler-dock build process or Ruby's makefiles.

kateinoigakukun commented 1 year ago

@flavorjones Unfortunately, an extension library compiled with a --enable-shared'ed ruby is not compatible with a --disable-shared'ed ruby, and vice versa.

I think the compatibility has not been guaranteed but it just worked so far. The bundle_loader change changed the situation, so we can't assume that fat gems are portable among rubies configured with different options.

flavorjones commented 1 year ago

@kateinoigakukun Thanks for confirming.

@larskanis Do you know why the 3.2.0-rc1 ruby binaries installed by ruby/setup-ruby aren't compatible? I looked in the repo but nothing jumped out at me.

larskanis commented 1 year ago

@flavorjones Which ruby binaries aren't compatible with what? Do you have a CI link?

flavorjones commented 1 year ago

@larskanis Darwin 3.2.0-rc1 binaries aren't compatible with the extensions being built by rake-compiler-dock.

CI link: https://github.com/sparklemotion/nokogiri/actions/runs/3735058984/jobs/6350173407

The CI config is:

  cruby-x86_64-darwin-install:
    needs: ["cruby-package"]
    strategy:
      fail-fast: false
      matrix:
        ruby: ["2.7", "3.0", "3.1", "3.2.0-rc1"]
    runs-on: macos-latest
    steps:
      - uses: actions/checkout@v3
        with:
          submodules: true
      - uses: ruby/setup-ruby@v1
        with:
          ruby-version: "${{matrix.ruby}}"
      - uses: actions/download-artifact@v3
        with:
          name: cruby-x86_64-darwin-gem
          path: gems
      - run: ./scripts/test-gem-install gems

and it's passing for the other Ruby versions.

flavorjones commented 1 year ago

https://github.com/rake-compiler/rake-compiler-dock/pull/90 reproduces this error in rake-compiler-dock's CI pipeline

flavorjones commented 1 year ago

@larskanis I confirmed that the rubies built by ruby-builder (and downloadable from here) are built with --enable-shared.

Not sure what you think we should do here. I just pushed #91 to see what will break if we set --enable-shared on darwin.

flavorjones commented 1 year ago

Unfortunately, an extension library compiled with a --enable-shared'ed ruby is not compatible with a --disable-shared'ed ruby, and vice versa

@kateinoigakukun This change may have unintended consequences. Over the past two years, nokogiri precompiled for darwin has been downloaded 12,062,659 times, so I'm very worried that if we can no longer ship precompiled binaries for darwin that this is going to be a burden on both users and gem maintainers.

Is the recommended workaround to ship two bundles for darwin? One for rubies configured with --enable-shared and one for rubies configured with --disable-shared?

kateinoigakukun commented 1 year ago

@flavorjones I've committed a patch to restore the compatibility nokogiri has been expecting so far: "An extension built with --disable-shared Ruby is loadable from --enable-shared Ruby". https://github.com/ruby/ruby/commit/c5eefb7f37db2865891298dd1a1e60dff09560ad

I think it would repair your build and you don't need to distribute two dylibs for darwin now :)

flavorjones commented 1 year ago

@kateinoigakukun Thank you! I will run that change through the rake-compiler-dock test suite!

flavorjones commented 1 year ago

Here's how I tested: I'm still using 3.2.0-rc1 as the cross-compiling Ruby, but added append_ldflags("-Wl,-flat_namespace") to Nokogiri's extconf.rb.

The good news is that a gem precompiled with that flag does run on darwin both with a "enable-shared ruby" and a "disable-shared ruby".

However, the bad news is that the libxml2 and libxslt symbols are no longer being resolved from within the bundle -- they are resolving against system libraries! I see these warnings:

WARNING: Nokogiri was built against libxml version 2.10.3, but has dynamically loaded 2.9.13
WARNING: Nokogiri was built against libxslt version 1.1.37, but has dynamically loaded 1.1.35
Here's the complete output from `nokogiri -v` ``` $ nokogiri -v WARNING: Nokogiri was built against libxml version 2.10.3, but has dynamically loaded 2.9.13 WARNING: Nokogiri was built against libxslt version 1.1.37, but has dynamically loaded 1.1.35 # Nokogiri (1.14.0.dev) --- warnings: - Nokogiri was built against libxml version 2.10.3, but has dynamically loaded 2.9.13 - Nokogiri was built against libxslt version 1.1.37, but has dynamically loaded 1.1.35 nokogiri: version: 1.14.0.dev cppflags: - "-I/Users/flavorjones/.gem/ruby/3.2.0/gems/nokogiri-1.14.0.dev-arm64-darwin/ext/nokogiri" - "-I/Users/flavorjones/.gem/ruby/3.2.0/gems/nokogiri-1.14.0.dev-arm64-darwin/ext/nokogiri/include" - "-I/Users/flavorjones/.gem/ruby/3.2.0/gems/nokogiri-1.14.0.dev-arm64-darwin/ext/nokogiri/include/libxml2" ldflags: [] ruby: version: 3.2.0 platform: arm64-darwin22 gem_platform: arm64-darwin-22 description: ruby 3.2.0dev (2022-12-06T05:55:05Z v3_2_0_rc1 81e274c990) [arm64-darwin22] engine: ruby libxml: source: packaged precompiled: true patches: - 0001-Remove-script-macro-support.patch - 0002-Update-entities-to-remove-handling-of-ssi.patch - 0003-libxml2.la-is-in-top_builddir.patch - '0009-allow-wildcard-namespaces.patch' libxml2_path: "/Users/flavorjones/.gem/ruby/3.2.0/gems/nokogiri-1.14.0.dev-arm64-darwin/ext/nokogiri" memory_management: ruby iconv_enabled: true compiled: 2.10.3 loaded: 2.9.13 libxslt: source: packaged precompiled: true patches: - 0001-update-automake-files-for-arm64.patch datetime_enabled: true compiled: 1.1.37 loaded: 1.1.35 other_libraries: zlib: 1.2.13 libiconv: '1.17' libgumbo: 1.0.0-nokogiri ```

These warnings are generated by comparing the version of libxml2 present in LIBXML_DOTTED_VERSION at compile time against the value present in the global variable xmlParserVersion at runtime.

They should always match, but in gems built this way, xmlParserVersion is being resolved from system libraries.

Looking at the symbols present in the bundle, the libxml2 symbols are properly in the text (code) section, denoted with a "T":

/opt/osxcross/target/bin/arm64-apple-darwin-nm nokogiri.bundle | fgrep xmlParserVersion
00000000000c21d0 T ___xmlParserVersion
00000000002aaa10 D _xmlParserVersion

@kateinoigakukun Do you understand why the bundle isn't being used as the primary location for resolving symbols? Should I play with -fvisibility=hidden and __attribute__((visibility("default")))?

flavorjones commented 1 year ago

I've pushed the workaround mentioned in my previous comment here, in case anyone needs to reproduce:

https://github.com/sparklemotion/nokogiri/pull/2732

I'll link to the specific warnings when CI gets there.

flavorjones commented 1 year ago

Hmm, very strange, it works fine in the actions runner: https://github.com/sparklemotion/nokogiri/actions/runs/3767005588/jobs/6404353192#step:5:36

Maybe this is close enough for me to cut a release candidate of Nokogiri and get feedback from people.

flavorjones commented 1 year ago

OK, I'm not sure why I can't reproduce this in github runners, but I've reproduced now the symbol-loading error on three different macs.

When I run vmmap on a ruby process, I can see a /usr/lib/libxml2.2.dylib entry which I believe is mapped to /Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk/usr/lib/ (the xcode CLT libraries). (This is also true for 3.1 rubies, so nothing new there.)

But I believe what is new is that now function calls from the nokogiri.bundle are no longer resolved to symbols that exist in the bundle, but to symbols in the ruby process. This makes sense given the explanation of the -flat_namespace change.

I'm not sure how to get unblocked from here.

flavorjones commented 1 year ago

OK, I made some progress with trying -load_hidden when I add the static libraries for libxml2 and libxslt to the bundle. I might be unblocked.

See https://github.com/sparklemotion/nokogiri/pull/2732/commits/1714e7786b46f3b8533f81c38594d72e5e94054d for what seems to work.

flavorjones commented 1 year ago

Well, -load_hidden doesn't completely work, either -- it means all symbols are hidden and that breaks the ABI of the extensions which some other gems (like nokogiri-xmlsec) depend on (nokogiri-xmlsec needs to use the same libxml2 library that nokogiri uses).

@kateinoigakukun This all seems really bad. See some conversation with another nokogiri maintainer at https://github.com/sparklemotion/nokogiri/pull/2732 where we're proposing to try to undo these changes when we build the extension.

flavorjones commented 1 year ago

@kateinoigakukun Please see https://github.com/stevecheckoway/bundle_test for a working example of why -flat_namespace is not a sufficient solution.

nurse commented 1 year ago

I reverted katei's fix with -flat_namespace from Ruby 3.2.0. I continue to discuss how to fix dynamic_lookup problem toward Ruby 3.2.1.

Please see https://github.com/stevecheckoway/bundle_test for a working example of why -flat_namespace is not a sufficient solution.

Thank you for clarifying the problem. What I am wondering now is that is the problem is really the problem?

I fixed the files as below and its result is as the same as flat_namespace. Therefore if a gem cause a problem with flat_namespace, it doesn't work on Linux.

foo_version(): 1
init()
function_in_main()
foo_version(): 1
diff --git a/Makefile b/Makefile
index 9f6f601..8df8164 100644
--- a/Makefile
+++ b/Makefile
@@ -10,29 +10,29 @@ endif

 .PHONY: all clean

-all: main main_shared extension.bundle
+all: main main_shared extension.so

-main_shared: main.c libmain.dylib libfoo1.dylib
+main_shared: main.c libmain.so libfoo1.so
        $(CC) -o $@ -DENABLE_SHARED=1 main.c -L. -lmain -lfoo1

-main: main.c libfoo1.dylib
-       $(CC) -o $@ main.c -L. -lfoo1
+main: main.c libfoo1.so
+       $(CC) -o $@ main.c -L. -rdynamic -ldl -lfoo1

-libmain.dylib: main.c libfoo1.dylib
-       $(CC) -dynamiclib -DLIBMAIN=1 -o $@ main.c -L. -lfoo1
+libmain.so: main.c libfoo1.so
+       $(CC) -shared -fPIC -DLIBMAIN=1 -o $@ main.c -L. -ldl -lfoo1

-libfoo1.dylib: foo.c
-       $(CC) -dynamiclib -DFOO_VERSION=1 -o $@ $<
+libfoo1.so: foo.c
+       $(CC) -shared -fPIC -DFOO_VERSION=1 -o $@ $<

-extension.bundle: extension.c libfoo2.a
-       $(CC) -o $@ -bundle $(NAMESPACE_OPTS) extension.c libfoo2.a
+extension.so: extension.c libfoo2.a
+       $(CC) -o $@ -shared -fPIC extension.c libfoo2.a

 foo2.o: foo.c
-       $(CC) -c -o $@ -DFOO_VERSION=2 $^
+       $(CC) -c -fPIC -o $@ -DFOO_VERSION=2 $^

 libfoo2.a: foo2.o
        $(RM) $@
        ar -cr $@ $^

 clean:
-       $(RM) main main_shared *.bundle *.dylib *.o *.a
+       $(RM) main main_shared *.bundle *.so *.o *.a
diff --git a/main.c b/main.c
index 30ba3fe..946fe9c 100644
--- a/main.c
+++ b/main.c
@@ -14,7 +14,7 @@ static
 #endif
 int main_implementation() {
     // Load the library.
-    void *handle = dlopen("extension.bundle", RTLD_LAZY | RTLD_GLOBAL);
+    void *handle = dlopen("extension.so", RTLD_LAZY | RTLD_GLOBAL);
     if (!handle) {
         fprintf(stderr, "dlopen(): %s", dlerror());
         return 1;
kateinoigakukun commented 1 year ago

Thank you all :pray: Will have a closer look next month.

flavorjones commented 1 year ago

@nurse Thanks for asking this question. Although I don't have time today to show that this works as expected on Linux, Nokogiri depends on this behavior working (calls from within the extension should resolve to symbols in static libraries linked) and I should be able to demonstrate that in the next day or two.

flavorjones commented 1 year ago

Ah - it just occurred to me that this may not be a problem on Linux unless libxml2 has been loaded.

The problem on Darwin is that XCode CLT appears to always load libxml2 into Ruby processes (which you can see if you run vmmap on any running Ruby process), at least for Ruby binaries built on the machine (via ruby-build or similar).

The answer may be that we hide those symbols, which seems to work the way we need, but might break gems that try to link against Nokogiri. I'm ok with breaking that compatibility.

flavorjones commented 1 year ago

@nurse You're right, of course: I had to work around symbol collision a few years back in https://github.com/sparklemotion/nokogiri/pull/2106

I think the big problem here now is that, on Darwin, the Ruby process seems very likely to have loaded the system libxml2 (if Ruby was built with XCode CLT), and so suddenly all of libxml2 and libxslt are colliding with already-loaded symbols.

The two options I know of to fix symbol resolution on Darwin are described here: https://github.com/rake-compiler/rake-compiler-dock/pull/92#issuecomment-1366000890

flavorjones commented 1 year ago

I've written up a brief summary of how Nokogiri is working around this at nokogiri/2022-12-darwin-symbol-resolution.md at main · sparklemotion/nokogiri · GitHub

flavorjones commented 1 year ago

OK, after shipping an RC for nokogiri native gems, and working on sqlite3-ruby native gems, it looks like everybody who precompiles for Darwin and Ruby 3.2 will need to add something like this to their extconf:

          if cross_build? &&
             darwin? &&
             RbConfig::CONFIG["ruby_version"] >= "3.2"
            append_ldflags("-Wl,-flat_namespace")
          end

@kateinoigakukun Do you have any ideas on how to make this situation better, potentially in a future Ruby 3.2.1 release?

@larskanis Do you think we should try to add this flag into the link line for users from within rake-compiler-dock? Possibly by modifying the DLDFLAGS in the rake-compiler rbconfig.rb files?

larskanis commented 1 year ago

@larskanis Do you think we should try to add this flag into the link line for users from within rake-compiler-dock? Possibly by modifying the DLDFLAGS in the rake-compiler rbconfig.rb files?

Yes, I think so. We shouldn't demand these lines.

flavorjones commented 1 year ago

@larskanis Do you think we should try to add this flag into the link line for users from within rake-compiler-dock? Possibly by modifying the DLDFLAGS in the rake-compiler rbconfig.rb files?

Yes, I think so. We shouldn't demand these lines.

OK, will put together a PR

flavorjones commented 1 year ago

If we merge #94, then I think we should cut a release soon.

Then I can confidently document how to deal with global symbol resolution in https://github.com/flavorjones/ruby-c-extensions-explained. (I think we need to start recommending that everyone build their extension (and all libraries that are statically linked into it) with -fvisibility=hidden and then only export the Init_* function, like I did in https://github.com/sparklemotion/sqlite3-ruby/pull/371.)

flavorjones commented 1 year ago

I've merged #94 and I think we can cut a release. I'm building another rc for sqlite3-ruby based on images generated from https://github.com/rake-compiler/rake-compiler-dock/commit/ce619f2ee5e71fa787d0768ca74d85fb35017e20 and they look good.

@larskanis I think you're still the only owner of the dockerhub account, but if there is anything I can do to help with a release, please let me know?

larskanis commented 1 year ago

@flavorjones As far as I can see, there is no way to add other contributors to the docker hub account. The only way is to convert "larskanis" to an organization, which doesn't feel like a good solution. Instead I think it makes sense to create a new organization and publish the images there? Maybe the name "rake-compiler" like on Github? Then we can change the name in the gem as well.

larskanis commented 1 year ago

Nevertheless I can do the release and publish the images on the "larskanis" account.

I did some tests on my side (with fxruby, libusb, ffi and ruby-pg) and the latest images worked well. I found only one issue with RbConfig::CONFIG['host'], which returns x64-w64-mingw32 on ruby-3.2 on x64-mingw-ucrt instead of the x86_64-w64-mingw32 of ruby-3.1 and so it differs from the compiler name prefix. But I found that the host variable is already different on the linux images, so that I changed the code in fxruby to work with the latest images.

flavorjones commented 1 year ago

@larskanis I'd like to again propose that we use ghcr.io instead of dockerhub as the future home of these images. I started this work to demonstrate it could work with the weekly snapshots. We could similarly automate creation of the versioned images (and we can cache layers to ensure that image diffs stay small between patch releases).

flavorjones commented 1 year ago

@larskanis Would it be helpful if I did the work to cut over to ghcr.io to be the primary registry for RCD images so that I'm able to cut a release? You could then sync those images to dockerhub for convenience for folks. I should have some time this week to do that if you're not able to cut a release.

larskanis commented 1 year ago

I didn't work with ghcr.io so far, so I'd be happy if you could prepare it. I don't care too much about the platform we use to distribute our images. So if ghcr.io fulfills our needs, I think we can switch the gem to it with no need to mirror the images on docker hub.

flavorjones commented 1 year ago

See #95! :heart:

flavorjones commented 1 year ago

Closing this, I think we have a path forward for Ruby 3.2.