pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.3k stars 22.14k forks source link

arm64-v8a not compiling due to libpytorch_jni.so #51020

Open CodeSammich opened 3 years ago

CodeSammich commented 3 years ago

❓ Questions and Help

I'm currently adapting the Native Android C++ with Custom Ops guide to run my own model on React Native for Android on the nightly PyTorch 1.8.0 build.

When I run something like npx react-native run-android to compile, it compiles successfully for the first kind of CPU architecture (e.g. armeabi-v7a):

Build pytorch_native_armeabi-v7a [1/2] Building CXX object CMakeFiles/pytorch_native.dir/src/main/cpp/pytorch_native.cpp.o [2/2] Linking CXX shared library ../../../../build/intermediates/cmake/debug/obj/armeabi-v7a/libpytorch_native.so

but then it fails for the second architecture (e.g. arm64-v8a.

Build pytorch_native_arm64-v8a [1/2] Building CXX object CMakeFiles/pytorch_native.dir/src/main/cpp/pytorch_native.cpp.o [2/2] Linking CXX shared library ../../../../build/intermediates/cmake/debug/obj/arm64-v8a/libpytorch_native.so FAILED: : && /Users/Sigma/Library/Android/sdk/ndk/22.0.7026061/toolchains/llvm/prebuilt/darwin-x86_64/bin/clang++ --target=aarch64-none-linux-android21 --gcc-toolchain=/Users/Sigma/Library/Android/sdk/ndk/22.0.7026061/toolchains/llvm/prebuilt/darwin-x86_64 --sysroot=/Users/Sigma/Library/Android/sdk/ndk/22.0.7026061/toolchains/llvm/prebuilt/darwin-x86_64/sysroot -fPIC -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -O0 -fno-limit-debug-info -Wl,--exclude-libs,libgcc.a -Wl,--exclude-libs,libgcc_real.a -Wl,--exclude-libs,libatomic.a -Wl,--build-id=sha1 -Wl,--no-rosegment -Wl,--fatal-warnings -Wl,--no-undefined -Qunused-arguments -shared -Wl,-soname,libpytorch_native.so -o ../../../../build/intermediates/cmake/debug/obj/arm64-v8a/libpytorch_native.so CMakeFiles/pytorch_native.dir/src/main/cpp/pytorch_native.cpp.o -L/Users/Sigma/recycleNN/android/app/build/pytorch_android-1.8.0-SNAPSHOT.aar/jni/arm64-v8a -lpytorch_jni -lfbjni -llog -latomic -lm && : ld: error: found local symbol '_edata' in global part of symbol table in file /Users/Sigma/recycleNN/android/app/build/pytorch_android-1.8.0-SNAPSHOT.aar/jni/arm64-v8a/libpytorch_jni.so ld: error: found local symbol 'end' in global part of symbol table in file /Users/Sigma/recycleNN/android/app/build/pytorch_android-1.8.0-SNAPSHOT.aar/jni/arm64-v8a/libpytorch_jni.so ld: error: found local symbol 'bss_end' in global part of symbol table in file /Users/Sigma/recycleNN/android/app/build/pytorch_android-1.8.0-SNAPSHOT.aar/jni/arm64-v8a/libpytorch_jni.so ld: error: found local symbol '_bss_end' in global part of symbol table in file /Users/Sigma/recycleNN/android/app/build/pytorch_android-1.8.0-SNAPSHOT.aar/jni/arm64-v8a/libpytorch_jni.so ld: error: found local symbol '__bss_start' in global part of symbol table in file /Users/Sigma/recycleNN/android/app/build/pytorch_android-1.8.0-SNAPSHOT.aar/jni/arm64-v8a/libpytorch_jni.so ld: error: found local symbol '_end' in global part of symbol table in file /Users/Sigma/recycleNN/android/app/build/pytorch_android-1.8.0-SNAPSHOT.aar/jni/arm64-v8a/libpytorch_jni.so ld: error: found local symbol '__bss_start' in global part of symbol table in file /Users/Sigma/recycleNN/android/app/build/pytorch_android-1.8.0-SNAPSHOT.aar/jni/arm64-v8a/libpytorch_jni.so clang++: error: linker command failed with exit code 1 (use -v to see invocation) ninja: build stopped: subcommand failed.

It seems like certain compiled variables inside libpytorch_jni.so is being added to runtime multiple times, causing the conflict. Does that mean I have to compile only for one architecture at a time, or am I missing something entirely? Since the native build extraction extracts all 4 at the same time, I figure they must be able to compile separately just fine.

Thank you!

EDIT: It seems it's just arm64-v8a inherently that has this problem, not necessarily due to conflicts. I've cleaned my Android build files and filtered the NDK to just compile for arm64-v8a and it still fails on the same error. All other architectures build properly.

cc @malfet @seemethere @walterddr

code4lala commented 2 years ago

I found that compiling libpytorch_jni.so with -fuse-ld=lld linking option makes it work Tested with pytorch-1.10.2, android-ndk-r22b, cross compile arm64.

Edit file android/pytorch_android/CMakeLists.txt, add one line target_link_libraries(${PYTORCH_JNI_TARGET} -fuse-ld=lld), follow https://github.com/pytorch/pytorch/tree/master/android and run scripts/build_pytorch_android.sh extract shared libraries from aar file, and it works

TimPushkin commented 1 year ago

I've built myself a PyTorch 1.13 AAR (from 1.13 git tag) with scripts/build_pytorch_android.sh (no arguments provided) using NDK 21e as recommended in building instructions for Android and got this error.

I see there are fixes suggested, but some should probably be merged or added to the building instructions.

TimPushkin commented 1 year ago

I encounter the same error when using the official builds of 1.12.2 for Android (the latest available at the moment), it's both in pytorch_android and pytorch_android_lite

TimPushkin commented 1 year ago

Manually building a PyTorch AAR with NDK 23+ that uses LLD by default instead of NDK 21.x as recommended solves the issue. For this one has to do the following:

  1. Install NDK 23.x and specify it in PyTorch's Gradle files (I used 23.1.7779620).
  2. Since NDK stopped bundling a Vulkan wrapper since v23 one has to manually copy the common directory from KhronosGroup/Vulkan-Tools, git tag v1.2.161 (the version is important to build PyTorch successfully, this one is the latest that worked for me) into $ANDROID_NDK/sources/third_party/vulkan/src.

The above should be enough to get a working build, but I personally upgraded PyTorch's Android Gradle Plugin to 7.3.1 (since 7.3.x bundles NDK 23.1.7779620 by default) instead of specifying the NDK version directly just in case there is some incompatibility between the newer NDK and the older AGP. This requires some additional tweaks though:

milo1000 commented 1 year ago

@TimPushkin Thanks for your effort, you saved me a lot of pain, apparently nobody cares that official builds (and snapshots) are broken.

TimPushkin commented 1 year ago

The issue persists in the 1.13 release published on Maven

TimPushkin commented 1 year ago

@malfet I am sorry for the direct tag, but is there any chance this will be fixed in the upcoming releases? LibTorch cannot be linked against on arm64-v8a ABI with LLD which is the default linker on Android for two years now. Both official releases and instructions for building from source provide unlinkable results.

milo1000 commented 1 year ago

The fact that issue remains, means no one bother to do simplest integration testing. How's hard to compile properly just for once? One must build manually or do some voodoo to have this working. This is ridiculous.

malfet commented 1 year ago

@milo1000 if you have a PR to fix the problem, please do not hesitate to propose one. @agunapal you've worked on 1.13 Android release. Have you encountered this issue? @TimPushkin no worries about the ping, let me try to find some time to look into the problem. Just to clarify: does this happen to maven packages, or during the source build?

TimPushkin commented 1 year ago

@malfet Thanks! This happens both to the maven packages and to AAR I build from source with build_pytorch_android script.

milo1000 commented 1 year ago

@malfet I went over procedure described by @TimPushkin in comment https://github.com/pytorch/pytorch/issues/51020#issuecomment-1336405310 Don't get me wrong, I cherish your effort, but such bugs makes your whole great job pointless.

agunapal commented 1 year ago

@milo1000 I have built and published the libraries for 1.13. I did not see this error. Let me try running this on 1.13.1 and get back to you

TimPushkin commented 1 year ago

@agunapal To be clear, it seems like you need to use a sufficiently recent NDK where LLD is the default linker in the test app. According to its release notes, you need NDK r22b or later, I personally get these errors on NDK r23 (which is the default on current Android Gradle Plugin versions), and I believe I also tested r25 (which is the latest):

ld: error: found local symbol '__bss_end__' in global part of symbol table in file _deps/torch-src/jni/arm64-v8a/libpytorch_jni_lite.so
ld: error: found local symbol '__bss_start' in global part of symbol table in file _deps/torch-src/jni/arm64-v8a/libpytorch_jni_lite.so
ld: error: found local symbol '_end' in global part of symbol table in file _deps/torch-src/jni/arm64-v8a/libpytorch_jni_lite.so
ld: error: found local symbol '_edata' in global part of symbol table in file _deps/torch-src/jni/arm64-v8a/libpytorch_jni_lite.so
ld: error: found local symbol '__bss_start__' in global part of symbol table in file _deps/torch-src/jni/arm64-v8a/libpytorch_jni_lite.so
ld: error: found local symbol '_bss_end__' in global part of symbol table in file _deps/torch-src/jni/arm64-v8a/libpytorch_jni_lite.so
ld: error: found local symbol '__end__' in global part of symbol table in file _deps/torch-src/jni/arm64-v8a/libpytorch_jni_lite.so

Also, there seems to be no 1.13.1 release published on the official Maven, so I've tested this on 1.12.2 and 1.13.0 official builds from there.

dipu0 commented 1 year ago

I am writing to request your assistance in resolving an error I am encountering while using D2Go on Android. I have my project submission next week and this is my final year project. so I need your assistance badly to run my custom model on Android.

I have used PyTorch version 1.13.0 for training, and I am now attempting to use D2Go on Android. However, I have encountered an error, and I am not sure how to resolve it. I came across your GitHub profile and noticed that you have experience working with D2Go, so I am hoping that you can provide me with guidance on how to fix this error.

The error message I am receiving is

FATAL EXCEPTION: main Process: org.pytorch.demo.objectdetection, PID: 31163 java.lang.UnsatisfiedLinkError: dalvik.system.PathClassLoader[DexPathList[[zip file "/data/app/org.pytorch.demo.objectdetection-5BnM-_V8v6oKj-1tFHt8xQ==/base.apk"],nativeLibraryDirectories=[/data/app/org.pytorch.demo.objectdetection-5BnM-_V8v6oKj-1tFHt8xQ==/lib/arm64, /data/app/org.pytorch.demo.objectdetection-5BnM-_V8v6oKj-1tFHt8xQ==/base.apk!/lib/arm64-v8a, /system/lib64]]] couldn't find "libpytorch_jni.so

in build.gradle i have update version as well!

implementation 'org.pytorch:pytorch_android_lite:1.13.0' implementation 'org.pytorch:pytorch_android_torchvision_lite:1.13.0' implementation 'org.pytorch:torchvision_ops:0.14.0'

TimPushkin commented 1 year ago

@dipu0 I'm not sure, but I think the error in your message is raised because libpytorch_jni.so is loaded which does not exist in org.pytorch:pytorch_android_lite package, it is libpytorch_jni_lite.so

agunapal commented 1 year ago

@TimPushkin Can you please try with android ndk r19c

@dipu0 Seems like the .aar file didn't get uploaded for some reason. Can you please use the previous version of PyTorch.

dipu0 commented 1 year ago

@agunapal

@dipu0 Seems like the .aar file didn't get uploaded for some reason. Can you please use the previous version of PyTorch.

i tried to train with older torch version but it gives error and also tried to use olde version in android build.Gradle that did not work as well.

TimPushkin commented 1 year ago

@agunapal Yes, my app compiles with NDK r19c, and just for the record, it also does with r21e. I figure, this is because these NDKs don't use LLD by default. But I would prefer not to be stuck with 2+ year old NDKs.

TimPushkin commented 1 year ago

@agunapal Is there any news regarding this?

agunapal commented 1 year ago

@TimPushkin i published 1.13.1 android binaries few days ago.

TimPushkin commented 1 year ago

@agunapal Just tried them and I get the same linker errors

agunapal commented 1 year ago

@TimPushkin do you mean you are not able to build from source or do the binaries not work?

TimPushkin commented 1 year ago

I tried the 1.13.1 AAR published on Maven

agunapal commented 1 year ago

@TimPushkin could you please paste the error you are seeing .

TimPushkin commented 1 year ago

@agunapal Sure, it is the same as I posted above: comment

AreaScout commented 1 year ago

Same error for me, the binaries should be recompiled with -fvisibility=hidden compiler switch, but then you have to manually choose which functions you want to be visible :(

Error is ld: error: found local symbol '__end__' in global part of symbol table in file but only with NDK higher than 21.4, everything I tested from 22.x failed

RG

TimPushkin commented 1 year ago

@agunapal Any updates on the issue?

mmohammadi9812 commented 1 year ago

Any news on this issue?

flyskywhy commented 1 year ago

For me, when there are 21.4.7075529/ and 22.1.7171670/ in folder android-sdk/ndk/, the "com.android.tools.build:gradle:3.5.3" will use 22.1.7171670 to compile and meet such error

ld: error: found local symbol '__end__' in global part of symbol table in file ../../../../build/pytorch_android_lite-1.12.2.aar/jni/arm64-v8a/libpytorch_jni_lite.so

After delete android-sdk/ndk/22.1.7171670/, the "com.android.tools.build:gradle:3.5.3" will use 21.4.7075529/ and everything is fine.

Qinlong275 commented 8 months ago

For me, when there are 21.4.7075529/ and 22.1.7171670/ in folder android-sdk/ndk/, the "com.android.tools.build:gradle:3.5.3" will use 22.1.7171670 to compile and meet such error

ld: error: found local symbol '__end__' in global part of symbol table in file ../../../../build/pytorch_android_lite-1.12.2.aar/jni/arm64-v8a/libpytorch_jni_lite.so

After delete android-sdk/ndk/22.1.7171670/, the "com.android.tools.build:gradle:3.5.3" will use 21.4.7075529/ and everything is fine.

nice bro, I meet the same error, and then use ndkVersion "21.4.7075529" will be ok.

dvagala commented 4 months ago

Any updates?