mozilla / rust-android-gradle

Apache License 2.0
1.03k stars 67 forks source link

NDK version compatibility #105

Open HonzaR opened 2 years ago

HonzaR commented 2 years ago

Hi everyone,

We've been facing a new runtime issue on one of our open source projects, when we switched from NDK version 22.1.7171670 to a newer version 23.1.7779620. Only emulator x86_64 devices suffer from this failure. I've tried to reproduce the failure on your sample app project, but I wasn't able to make the project be built, as there is some problem with Android Studio run configurations. So I better ask here for help. Thank you in advance. Any help is really appreciated. Below is a list of next information, which can be useful:

ncalexan commented 2 years ago

Hi @HonzaR, I have limited time to help with this but I'll suggest where I would start.

I wonder if you somehow have different symbol exports in your x86 and x86_64 libraries. You could dig into this using llvm-objdump. I am aware that Rust has had a few different issues with symbols and exports over the years, so it could be that you're hitting some weirdness with the toolchains (Rust and the underlying LLVM toolchain) that's causing symbols to be exported differently. I don't see anything obviously funky with that symbol. Are you confident that other symbols are resolved, or is this the first JNI access?

If you can point me at a CI build with APKs for the different architectures, I could perhaps take this line of inquiry one step further. Good luck!

HonzaR commented 2 years ago

Hi @ncalexan, thank you for your quick response. As for your questions:

different symbol exports in your x86 and x86_64 libraries

Yes, that is something which we suspect to be the problem.

other symbols are resolved, or is this the first JNI access?

Several other symbols seem to work as expected. I can confirm that I can surely reproduce the problem with these methods:

is this the first JNI access?

No, I think there are some previous JNI calls to our SDK Rust code.

If you can point me at a CI build with APKs for the different architectures, I could perhaps take this line of inquiry one step further.

That would be really great, as I have no experience with the mentioned llvm-objdump tool. This link leads the lib.zip (latest binaries with the desired NDK version 23.1.7779620) - it contains all our supported cpu architectures binaries; you can check the x86 and x86_64 SDK differences.

Thank you very much for your time!

ncalexan commented 2 years ago

I can confirm that the libraries in your archive have no symbols exported; that's never going to succeed:

$ ~/.mozbuild/android-sdk-macosx//ndk/23.1.7779620/toolchains/llvm/prebuilt/darwin-x86_64/bin/llvm-objdump lib/x86_64/libzcashwalletsdk.so --syms

lib/x86_64/libzcashwalletsdk.so:    file format elf64-x86-64

SYMBOL TABLE:
$ ~/.mozbuild/android-sdk-macosx//ndk/23.1.7779620/toolchains/llvm/prebuilt/darwin-x86_64/bin/llvm-objdump lib/x86/libzcashwalletsdk.so --syms

lib/x86/libzcashwalletsdk.so:   file format elf32-i386

SYMBOL TABLE:

As an experiment, could you add doNotStrip or the modern equivalent and see if that addresses your issue? If yes, then at least we've correctly identified that it's symbol export that needs to be fixed. What I don't understand is how this could differ between x86 and x86_64 -- or even how it ever worked for earlier NDK versions.

However, we don't want to disable stripping; we want to control that process. And that I don't really know how to do, either at the native toolchain level (with Rust, LLVM, etc) or at the Android toolchain level (with Gradle). Reading about this, it rather looks like the right approach will be to disable Android-level stripping entirely and then manage stripping (really, LTO and symbol exposure) at the Rust level. And that seems to be possible with linker scripts, etc.

So: first, verify that it's symbol export causing this issue. Assuming it is, I would figure out how to use -fvisibility=hidden or the Rust equivalent. And then I would export only JNI_OnLoad and use RegisterNatives to only have to arrange one symbol export. Looking further, it rather looks like you could do this with a single --version-script argument. Arranging for JVM* symbols to be public and everything else to be private might be a sensible default for rust-android-gralde to arrange. But I'm confused as to why more people are not seeing this.

In any case, let me know if this is symbol visibility or something else. Good luck!

ncalexan commented 2 years ago

Oh -- can you also verify that some JNI natives are being invoked? That doesn't make sense to me, but here we are. If that is happening, then all of this symbol visibility is not the issue, and perhaps there's some wrinkle about byte[] that we don't understand?

HonzaR commented 2 years ago

Hi @ncalexan,

Thank you for your very valuable thoughts. I've tested the not stripping command and, unfortunately, no luck there. The failure is still the same:

UnsatisfiedLinkError: No implementation found for java.lang.String[][] cash.z.ecc.android.sdk.tool.DerivationTool.deriveUnifiedViewingKeysFromSeed(byte[], int, int)

I've tested it with ANDROID_NDK_VERSION=23.2.8568313 and these variants of not stripping feature: doNotStrip("*/x86_64/*.so"), jniLibs.keepDebugSymbols.add("*/x86_64/*.so"), jniLibs.keepDebugSymbols.add("**/*.so") I also checked the result of calling these methods to be sure it succeeded. Here I linked the resulted binaries without stripping.

Here I linked the debug APK, which contains the binaries above, if that can be useful too.

The important thing is that, when I broader analysed the JNI calls, it showed up that we're not able to call any of our JNI methods from DerivationTool.kt, nor from RustBackend.kt. It also showed up that we have a problem with loading the library according to this message, which appears before the one above:

[NativeLibraryLoader:17](#) Error while loading native library: dlopen failed: cannot locate symbol "__extenddftf2" referenced by "/data/app/~~5rfGO0A-SE-3Z7VkOAb3hw==/cash.z.ecc.android.sdk.demoapp.mainnet-dXDGdSOeib8Wghg7pctu2w==/lib/x86_64/libzcashwalletsdk.so"...

What do you think about it? Thank you very much!

ncalexan commented 2 years ago

Fascinating. It sure looks like you're getting bitten by this NDK bug. The linked discussion, in particular this comment suggests that you're seeing something rather like #75. In that ticket, we started to rewrite -lgcc transparently. It kind of looks like you need a similar workaround: either ignoring these symbols entirely or (more likely) adding whatever additional linker flag/clang runtime library provides these symbols.

HonzaR commented 2 years ago

It's great to have some guidelines to solve this problem. Thank you very much! We'll check it out and let you know.

HonzaR commented 2 years ago

Hi @ncalexan, we've been talking about this in our team. And this is just a thought, but can this problem be solved by the rust-android plugin. It there a change to have the problem covered by this plugin, so other developers won't face it too? Thank you for your thoughts.

ncalexan commented 2 years ago

Hi @ncalexan, we've been talking about this in our team. And this is just a thought, but can this problem be solved by the rust-android plugin. It there a change to have the problem covered by this plugin, so other developers won't face it too? Thank you for your thoughts.

If you can identify and explain to me what the problem actually is, I'm definitely open to solving this at the plugin level. That's what I was suggesting in https://github.com/mozilla/rust-android-gradle/issues/105#issuecomment-1159047120, specifically: "Arranging for JVM* [JNI*] symbols to be public and everything else to be private might be a sensible default for rust-android-gradle to arrange. But I'm confused as to why more people are not seeing this."

ncalexan commented 2 years ago

Note for self: Mozilla uses JNA, so this exact symbol approach won't work for our internal uses. But generally, this approach makes some sense to me.

ccjernigan commented 2 years ago

As to why others might not be reporting it: It only impacts developers with Intel machines using AVDs. I use an M1 Mac, so I didn't see it. Firebase Test Lab doesn't support API 31, so CI didn't pick it up. Even people on Intel machines often test on physical ARM devices. So the team didn't catch this until several weeks after it was merged to our main branch.

daira commented 1 year ago

If I'm reading correctly, the root cause is (from https://github.com/mozilla/rust-android-gradle/issues/75#issuecomment-964368185):

The problem is NDK 23.1.7779620 has stopped distributing libgcc.

We can statically link libgcc I guess, but seriously, who at Google thought it was a good idea to stop distributing one of the essential runtime libraries in the NDK? It's absolutely guaranteed to cause breakage.

This is why we originally decided to link to libgcc dynamically, despite linking the C++ standard library statically: https://github.com/zcash/zcash/issues/2513#issuecomment-313361129

ncalexan commented 1 year ago

If I'm reading correctly, the root cause is (from #75 (comment)):

The problem is NDK 23.1.7779620 has stopped distributing libgcc.

We can statically link libgcc I guess, but seriously, who at Google thought it was a good idea to stop distributing one of the essential runtime libraries in the NDK? It's absolutely guaranteed to cause breakage.

This is why we originally decided to link to libgcc dynamically, despite linking the C++ standard library statically: zcash/zcash#2513 (comment)

Yes, I think this is the root cause. To paper over issues with Rust and the NDK, however, this plugin transparently rewrites -lgcc. Now it seems that's not enough.

I will note that a Firefox engineer recommended Rust 1.68 as a work-around: see https://bugzilla.mozilla.org/show_bug.cgi?id=1820876#c4. I don't know if this addresses the issue, but it might. If it does I'd appreciate an update saying as much!

daira commented 1 year ago

It's very likely that rewriting -lgcc is enough, but that a stray -lgcc is getting through to our build process because we're somehow invoking the linker independently of the wrapper. Which would not be the plug-in's fault (probably).

str4d commented 1 year ago

I think the -lgcc rewriting may be a red herring; that problem was fixed in upstream Rust in https://github.com/rust-lang/rust/pull/85806 (IDK what version it rolled out in, but it was merged in June 2021).

It also showed up that we have a problem with loading the library according to this message, which appears before the one above:

[NativeLibraryLoader:17](#) Error while loading native library: dlopen failed: cannot locate symbol "__extenddftf2" referenced by "/data/app/~~5rfGO0A-SE-3Z7VkOAb3hw==/cash.z.ecc.android.sdk.demoapp.mainnet-dXDGdSOeib8Wghg7pctu2w==/lib/x86_64/libzcashwalletsdk.so"...

I think this is the same issue that Mozilla is now running into (https://github.com/mozilla/application-services/issues/5436). That issue currently points the finger at sqlcipher; we don't use that, but we do use the rusqlite crate, so the common denominator may be SQLite.

str4d commented 1 year ago

Okay, so I think the core issue here is that libgcc is being used in our stack for two things:

Android removed libgcc and replaced it with libunwind, which is a replacement for the first use case but not the second.

I will note that a Firefox engineer recommended Rust 1.68 as a work-around: see https://bugzilla.mozilla.org/show_bug.cgi?id=1820876#c4. I don't know if this addresses the issue, but it might. If it does I'd appreciate an update saying as much!

https://github.com/termux/termux-packages/issues/14576 has what appears to be an identical problem to us: using rusqlite and getting undefined symbol: __extenddftf2. The issue notes that the problem occurred with Rust 1.66.1, but was not reproducible with 1.67.0. So that does suggest a newer Rust might help. However, https://github.com/mozilla/application-services/issues/5436 says they did try Rust 1.68 and still had missing symbols (though it doesn't make clear what symbols were missing in that particular case of the test matrix).

EDIT: It's not clear whether the termux-packages issue was tested on x86_64-linux-android, so that may also be why they saw it as fixed in 1.67 (because this issue doesn't occur for aarch64-linux-android).

str4d commented 1 year ago

As to why others might not be reporting it: It only impacts developers with Intel machines using AVDs. I use an M1 Mac, so I didn't see it. Firebase Test Lab doesn't support API 31, so CI didn't pick it up. Even people on Intel machines often test on physical ARM devices. So the team didn't catch this until several weeks after it was merged to our main branch.

This lines up with https://github.com/mozilla/application-services/issues/5436#issuecomment-1472219256, which says that the __extenddftf2 symbol is exported by Rust for aarch64-linux-android but not for x86_64-linux-android.

str4d commented 1 year ago

Aaand I now see that the Mozilla issue points to Rust as the other culprit here: __extenddftf2 is not provided by Rust compiler-builtins because it involves floating-point types that are not supported by Rust. That doesn't explain why they do export it for aarch64-linux-android though.

str4d commented 1 year ago

SQLite defines a LONGDOUBLE_TYPE type as long double (and falls back to sqlite_int64 on platforms that don't support it). The main places it appears to be used are:

The C language only requires long double to be at least as precise as double; SQLite handles this by disabling tests that require 80-bit long doubles (and using alternative code for sqlite3IntFloatCompare).

str4d commented 1 year ago

Possibly relevent SQLite forum discussions:

str4d commented 1 year ago

I've opened https://github.com/zcash/librustzcash/issues/800 because I think this problem is not a bug with rust-android-gradle specifically. If we learn more there that suggests otherwise, I'll comment again here.

ncalexan commented 1 year ago

I've opened zcash/librustzcash#800 because I think this problem is not a bug with rust-android-gradle specifically. If we learn more there that suggests otherwise, I'll comment again here.

Thanks for all of the investigation here (and sorry to be late to this party).

aiongg commented 1 year ago

Hi. I'm currently facing this issue on my project. I read through this thread and the various linked threads, and don't really have a clear picture of where this issue is coming from or how to work around it (other than using the x86 emulator). Is this something that the plugin can ultimately take care of by linking to appropriate libraries as needed, or this needs to be solved in some other repo - NDK, rust, rusqlite, or...? Is there a workaround for the time being? I did not understand the workaround in the Firefox repo, or how I would apply that to my project.

For now I'm just using the x86 emulator on Windows, or an arm64 emulator on Mac, but I'd like to be able to use the usual x86_64 emulator if possible.

hanaTsuk1 commented 6 months ago

@aiongg Try this to see if it can solve your problem https://github.com/breez/c-breez/issues/553#issuecomment-1731477289

str4d commented 5 months ago

Prompted by the above, I ended up solving this in our stack by writing a build script that leverages the environment variables set by the rust-android-gradle plugin to automate the library detection. An ANDROID_NDK_HOME environment variable is only needed when building the Rust code directly from the CLI, and in that case it is the only environment variable needed. https://github.com/Electric-Coin-Company/zcash-android-wallet-sdk/blob/88058c63461f2808efc953af70db726b9f36f9b9/backend-lib/build.rs