termux / termux-packages

A package build system for Termux.
https://termux.dev
Other
13.34k stars 3.07k forks source link

[Bug]: Neovim cannot load native Lua modules that are not linked with LuaJIT #22328

Closed s-cerevisiae closed 15 hours ago

s-cerevisiae commented 2 days ago

Problem description

Neovim is unable to load lua modules created with mlua if the module is not specifically linked with LuaJIT. require-ing the module throws an error saying that dlopen cannot locate lua_gettop (or some other lua symbols).

This doesn't happen when the same code is run with luajit instead of nvim.

What steps will reproduce the bug?

  1. Install Neovim
  2. Create a lua module with mlua (Like this one but change the lua feature to "luajit")
  3. Build the module with cargo build
  4. cd into ./target/debug
  5. ln -s lib*.so my_module.so
  6. nvim --clean --cmd "lua require('my_module')"
  7. See the error output

What is the expected behavior?

The module is loaded and the function is accessible.

If the library is instead built with RUSTFLAGS="-C link-args=-lluajit" cargo build it works as expected, but this is not required on any other platforms and it makes cross compiling very difficult.

System information

Termux Variables:
TERMUX_API_VERSION=0.50.1
TERMUX_APK_RELEASE=F_DROID
TERMUX_APP_PACKAGE_MANAGER=apt
TERMUX_APP_PID=8828
TERMUX_IS_DEBUGGABLE_BUILD=0
TERMUX_MAIN_PACKAGE_FORMAT=debian
TERMUX_VERSION=0.118.0
TERMUX__USER_ID=0
Packages CPU architecture:
aarch64
Subscribed repositories:
# sources.list
deb https://mirrors.tuna.tsinghua.edu.cn/termux/apt/termux-main stable main
# root-repo (sources.list.d/root.list)
deb https://mirrors.tuna.tsinghua.edu.cn/termux/apt/termux-root root stable
Updatable packages:
All packages up to date
termux-tools version:
1.44.3
Android version:
14
Kernel build information:
Linux localhost 5.15.104-android13-8-00010-g1dc49517c375-ab11006012 #1 SMP PREEMPT Wed Oct 25 06:11:20 UTC 2023 aarch64 Android
Device manufacturer:
Xiaomi
Device model:
23078RKD5C
LD Variables:
LD_LIBRARY_PATH=
LD_PRELOAD=/data/data/com.termux/files/usr/lib/libtermux-exec.so
Installed termux plugins:
com.termux.api versionCode:51
com.termux.widget versionCode:13
com.termux.styling versionCode:32
TomJo2000 commented 2 days ago

That definitely is a quandry. I've cross referenced our build script for Neovim against the ones used by Arch^1, Alpine^2, OpenBSD^3 and Void^4.

We aren't doing anything out of the ordinary for the build options.

Our Neovim is compiled with luajit as the lua interpreter, but so are all of the other four. The OpenBSD makefile even mentions cross compatibility with Lua 5.1 modules explicitly.

We do not package mlua, so this issue slipped through the cracks. If you could provide any additional help in identifying the root cause of the issue, for example a minimal reproducible example of the issue, that would be greatly appreciated.

s-cerevisiae commented 2 days ago

@TomJo2000 Thanks for your quick reply! I've set up a minimal reproduction here, feel free to try it out.

This problem was initially discovered by trying to build native modules for a neovim plugin and use it on Termux (corresponding issue). It was later reported to the mlua repo but similarly the author didn't know what to do with it.

I've tried on a few different combinations of runtimes and platforms and finally the problem seems to only occur on Neovim + Termux. I don't know enough about Lua to write a module in C as proposed by the mlua author, but personally I don't think mlua is the faulty one (since it runs normally with luajit).

TomJo2000 commented 2 days ago

My best guess is this is some sort of linker issue. nm shows a bunch of undefined symbols, most of them Lua related. I'm guessing luajit provides these in the way the module is expecting. While nvim does not.

// nm -uC ./target/debug/libreprod.so
                 U __cxa_atexit
                 U __cxa_finalize
                 U __errno
                 U __register_atfork
                 U __sF
                 U abort
                 U calloc
                 U clock_gettime
                 U close
                 U dl_iterate_phdr
                 U fflush
                 U fprintf
                 U free
                 U fstat
                 U fwrite
                 U getcwd
                 U getenv
                 U getpid
                 U lseek64
                 U luaL_callmeta
                 U luaL_error
                 U luaL_getmetafield
                 U luaL_ref
                 U lua_checkstack
                 U lua_close
                 U lua_concat
                 U lua_createtable
                 U lua_error
                 U lua_gc
                 U lua_getallocf
                 U lua_getinfo
                 U lua_getmetatable
                 U lua_getstack
                 // There's the symbol from the initial bug report
                 U lua_gettop
                 U lua_insert
                 U lua_isnumber
                 U lua_isstring
                 U lua_newthread
                 U lua_newuserdata
                 U lua_next
                 U lua_pcall
                 U lua_pushboolean
                 U lua_pushcclosure
                 U lua_pushfstring
                 U lua_pushinteger
                 U lua_pushlightuserdata
                 U lua_pushlstring
                 U lua_pushnil
                 U lua_pushnumber
                 U lua_pushthread
                 U lua_pushvalue
                 U lua_rawequal
                 U lua_rawget
                 U lua_rawset
                 U lua_remove
                 U lua_replace
                 U lua_setmetatable
                 U lua_settable
                 U lua_settop
                 U lua_toboolean
                 U lua_tolstring
                 U lua_tonumber
                 U lua_topointer
                 U lua_tothread
                 U lua_touserdata
                 U lua_type
                 U lua_typename
                 U lua_xmove
                 U malloc
                 U memcmp
                 U memcpy
                 U memmove
                 U memset
                 U mmap
                 U munmap
                 U open
                 U posix_memalign
                 U pthread_getspecific
                 U pthread_key_create
                 U pthread_key_delete
                 U pthread_rwlock_rdlock
                 U pthread_rwlock_unlock
                 U pthread_rwlock_wrlock
                 U pthread_setspecific
                 U read
                 U readlink
                 U realloc
                 U realpath
                 U sched_yield
                 U stat
                 U strerror_r
                 U strlen
                 U syscall
                 U write
                 U writev

I did also validate the module against Lua 5.1, which did work as expected.

I'll see if I can get any useful information out of straceing the ./with_neovim example on my PC, which does work.

TomJo2000 commented 2 days ago

image PC on the left. Termux on the right.

Here's the full text for both if you feel like dissecting them.

PC (Arch Linux):

```styl wait4(-1, Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.01s [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 52818 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7ca0a897e1d0}, {sa_handler=0x57d13c1d51e0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7ca0a897e1d0}, 8) = 0 ioctl(2, TIOCGWINSZ, {ws_row=58, ws_col=240, ws_xpixel=1920, ws_ypixel=1044}) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=52818, si_uid=1000, si_status=0, si_utime=1 /* 0.01 s */, si_stime=1 /* 0.01 s */} --- wait4(-1, 0x7ffc875f2290, WNOHANG, NULL) = -1 ECHILD (No child processes) rt_sigreturn({mask=[]}) = 0 newfstatat(AT_FDCWD, ".", {st_mode=S_IFDIR|0755, st_size=240, ...}, 0) = 0 newfstatat(AT_FDCWD, "/usr/local/sbin/nvim", 0x7ffc875f2d40, 0) = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "/usr/local/bin/nvim", 0x7ffc875f2d40, 0) = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "/usr/bin/nvim", {st_mode=S_IFREG|0755, st_size=4200472, ...}, 0) = 0 newfstatat(AT_FDCWD, "/usr/bin/nvim", {st_mode=S_IFREG|0755, st_size=4200472, ...}, 0) = 0 geteuid() = 1000 getegid() = 1000 getuid() = 1000 getgid() = 1000 access("/usr/bin/nvim", X_OK) = 0 newfstatat(AT_FDCWD, "/usr/bin/nvim", {st_mode=S_IFREG|0755, st_size=4200472, ...}, 0) = 0 geteuid() = 1000 getegid() = 1000 getuid() = 1000 getgid() = 1000 access("/usr/bin/nvim", R_OK) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, [INT TERM CHLD], [], 8) = 0 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ca0a88d0e50) = 52820 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGINT, {sa_handler=0x57d13c1d51e0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7ca0a897e1d0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7ca0a897e1d0}, 8) = 0 wait4(-1, hello, world! [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 52820 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7ca0a897e1d0}, {sa_handler=0x57d13c1d51e0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7ca0a897e1d0}, 8) = 0 ioctl(2, TIOCGWINSZ, {ws_row=58, ws_col=240, ws_xpixel=1920, ws_ypixel=1044}) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=52820, si_uid=1000, si_status=0, si_utime=0, si_stime=0} --- wait4(-1, 0x7ffc875f2290, WNOHANG, NULL) = -1 ECHILD (No child processes) rt_sigreturn({mask=[]}) = 0 read(255, "", 52) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 exit_group(0) = ? +++ exited with 0 +++ ```

Termux:

```styl wait4(-1, Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.04s [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 23667 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, {sa_handler=0x58f2994ecc, sa_mask=[], sa_flags=0}, 8) = 0 ioctl(2, TIOCGWINSZ, {ws_row=58, ws_col=240, ws_xpixel=1920, ws_ypixel=1044}) = 0 rt_sigprocmask(SIG_SETMASK, [RTMIN], [CHLD RTMIN], 8) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=23667, si_uid=10215, si_status=0, si_utime=3 /* 0.03 s */, si_stime=1 /* 0.01 s */} --- wait4(-1, 0x7fcf9a6524, WNOHANG, NULL) = -1 ECHILD (No child processes) rt_sigreturn({mask=[RTMIN]}) = 0 newfstatat(AT_FDCWD, ".", {st_mode=S_IFDIR|0700, st_size=3452, ...}, 0) = 0 newfstatat(AT_FDCWD, "/data/data/com.termux/files/usr/bin/nvim", {st_mode=S_IFREG|0700, st_size=4709192, ...}, 0) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [RTMIN], 8) = 0 rt_sigprocmask(SIG_BLOCK, [INT TERM CHLD RTMIN], [RTMIN], 8) = 0 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7005443508) = 23671 rt_sigprocmask(SIG_SETMASK, [RTMIN], [INT TERM CHLD RTMIN], 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD RTMIN], [RTMIN], 8) = 0 rt_sigprocmask(SIG_SETMASK, [RTMIN], [CHLD RTMIN], 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD RTMIN], [RTMIN], 8) = 0 rt_sigaction(SIGINT, {sa_handler=0x58f2994ecc, sa_mask=[], sa_flags=0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0 wait4(-1, E5113: Error while calling lua chunk: error loading module 'reprod' from file './target/debug/libreprod.so': dlopen failed: cannot locate symbol "lua_type" referenced by "/data/data/com.termux/files/home/git/nvim-mlua-reprod/target/debug/libreprod.so"... stack traceback: [C]: at 0x7619b27dd4 [C]: in function 'require' load.lua:3: in main chunk [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 23671 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, {sa_handler=0x58f2994ecc, sa_mask=[], sa_flags=0}, 8) = 0 ioctl(2, TIOCGWINSZ, {ws_row=58, ws_col=240, ws_xpixel=1920, ws_ypixel=1044}) = 0 rt_sigprocmask(SIG_SETMASK, [RTMIN], [CHLD RTMIN], 8) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=23671, si_uid=10215, si_status=1, si_utime=1 /* 0.01 s */, si_stime=1 /* 0.01 s */} --- wait4(-1, 0x7fcf9a6524, WNOHANG, NULL) = -1 ECHILD (No child processes) rt_sigreturn({mask=[RTMIN]}) = 0 read(255, "", 60) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD RTMIN], [RTMIN], 8) = 0 rt_sigprocmask(SIG_SETMASK, [RTMIN], [CHLD RTMIN], 8) = 0 mprotect(0x70052e8000, 4096, PROT_READ|PROT_WRITE) = 0 mprotect(0x70052e8000, 4096, PROT_READ) = 0 exit_group(1) = ? +++ exited with 1 +++ ```

To state the obvious, Termux isn't making a access("/data/data/com.termux/files/usr/bin/nvim", X_OK), call but that still doesn't explain why it isn't.

pipcet commented 2 days ago

Does

patchelf --add-needed /data/data/com.termux/files/usr/lib/libluajit.so ./target/debug/libreprod.so

make things work for you?

TomJo2000 commented 2 days ago

As expected, yes it does. Since it patches in those symbols into the module directly.

pipcet commented 2 days ago

I think this is a difference between Android and proper Linux linker/loaders

LD_PRELOAD=/data/data/com.termux/files/usr/lib/libluajit.so nvim -l load.lua also works here. I vaguely recall that the Android linker misbehaves when trying to resolve symbols exported by the executable itself, but I don't recall the details, so maybe that is what's going on here.

I think there are two workarounds:

  1. make cargo/mlua generate the DT_NEEDED ELF entry for libluajit.so. This would be traditional and I don't understand why it isn't being done, so there's probably a good reason. Worst case, we can use patchelf...
  2. make nvim dlopen the luajit library in addition to linking against it dynamically, or call itself using an LD_PRELOAD wrapper

I think this is really in the category of "lua expects Linux behavior, Android deviates from it".

TomJo2000 commented 2 days ago

Okay, good to know that I'm not just missing something very obvious here.

I guess the questions that remain are;

These two aren't mutually exclusive.

It's probably not even fair to call this a bug in mlua if it works fine in the typical desktop context.

pipcet commented 2 days ago

Okay, good to know that I'm not just missing something very obvious here.

I guess the questions that remain are;

* Is this issue specific to SO modules produced by `mlua`, in which case this is a bug in `mlua`.

I'm afraid I don't know which other packages include shared libraries for use with lua, but it would be a good idea to find one and check whether there's a DT_NEEDED entry in the ELF header.

* Does this effect linking against "off the shelf" shared libraries not directly linked against by `nvim`, in which case this is a bug in our Neovim package.

I don't think it's a bug in the neovim package.

These two aren't mutually exclusive.

It's probably not even fair to call this a bug in mlua if it works fine in the typical desktop context.

I think the issue can be worked around in mlua, but I don't think it's fair to call it a bug, at this point.

sylirre commented 2 days ago

This issue originates from https://github.com/android-ndk/ndk/issues/201, also stated in https://github.com/termux/termux-packages/wiki/Common-porting-problems. This not really specific to Termux but to Android OS in general.

Symbol visibility when opening shared libraries using dlopen() works differently. On a normal linker, when an executable linking against a shared library libA dlopen():s another shared library libB, the symbols of libA are exposed to libB without libB needing to link against libA explicitly. This does not work with the Android linker, which can break plug-in systems where the main executable dlopen():s a plug-in which does not explicitly link against some shared libraries already linked to by the executable.

TomJo2000 commented 2 days ago

So it's a bug in neither package and the symbols are still missing. Thanks Android.

TomJo2000 commented 2 days ago

This issue originates from android/ndk#201, also stated in https://github.com/termux/termux-packages/wiki/Common-porting-problems. This not really specific to Termux but to Android OS in general.

Is there anything we can do to manually "expose" the symbols?

Oh wow that is one to one the issue we're having here... > [!NOTE] > - Symbol visibility when opening shared libraries using `dlopen()` works differently. On a normal linker, when an executable linking against a shared library libA dlopen():s another shared library libB, the symbols of libA are exposed to libB without libB needing to link against libA explicitly. This does not work with the Android linker, which can break plug-in systems where the main executable dlopen():s a plug-in which does not explicitly link against some shared libraries already linked to by the executable. See [the relevant NDK issue](https://github.com/android-ndk/ndk/issues/201) for more information. https://github.com/termux/termux-packages/wiki/Common-porting-problems#android-dynamic-linker
sylirre commented 2 days ago

Solution is simple: force link it with library providing necessary symbols.

This issue affects native extensions for all scripting languages (that's why for python often suggested to specify LDFLAGS="-lpython3.12" before pip command).

TomJo2000 commented 2 days ago

Welp, guess you'll need to link the library or use @pipcet's patchelf workaround.

Is there any specific rationale for why the Android linker limits symbol visibility like this? Or just vague "security reasons".

khvzak commented 2 days ago

Solution is simple: force link it with library providing necessary symbols.

This issue affects native extensions for all scripting languages (that's why for python often suggested to specify LDFLAGS="-lpython3.12" before pip command).

Rust/Lua module for Android can be forcibly linked with LuaJIT by compiling with:

RUSTFLAGS="-C link-args=-L/path/to/lib -C link-args=-lluajit" cargo build

or setting corresponding options in .cargo/config.toml.

s-cerevisiae commented 2 days ago

Ok fine, even this exact problem has been reported before (#6383) and I've only found it through the ndk issue.

Thanks for your effort, maybe this can be closed as duplicate now. I'm glad that there are several workarounds to this problem and I'll try to find which one is feasible for cross compiling in a CI (as requested in https://github.com/Saghen/blink.cmp/issues/145)

s-cerevisiae commented 2 days ago

I'd like to ask if it's a good idea (or even permitted) to download the libluajit.so artifact from Termux in CI? If not, what should be the best practice to link the library with it?

truboxl commented 2 days ago

https://github.com/mlua-rs/mlua/blob/4891a6ac10e152625073335ad0703a6e68aa36fc/mlua-sys/build/main_inner.rs#L33-L34 It seems that by enabling feature module, this stops adding libluajit.so as DT_NEEDED. Removing the #[cfg... line will add as DT_NEEDED successfully.

s-cerevisiae commented 2 days ago

by enabling feature module this stops adding libluajit.so as DT_NEEDED.

Yes, that's the intended behavior of module feature. It ships lua headers in that mode and assumes related symbols are available at runtime, so that the resulting module doesn't depend on a particular lua runtime.

So I'd like to know if I choose to link the module to luajit instead (in any of the ways suggested above), is there a good way to build it in CI environment? Do I need to get a copy of libluajit.so from Termux repository?

truboxl commented 2 days ago

You can build by cloning this repo, ./scripts/run-docker.sh ./build-package.sh libluajit. A .deb file in output folder will be generated. But you are basically tying yourself to Termux.

s-cerevisiae commented 2 days ago

Yes, that's why I'm generally against distributing binaries. Maybe it makes most sense to set rustflags in cargo config and don't provide CI so that people can build on their own devices? It still needs RUSTC_BOOTSTRAP since the project is using unstable features and Termux does not provide nightly toolchains though...

truboxl commented 2 days ago

Termux does not provide nightly toolchains though...

There is https://github.com/termux-user-repository/tur/tree/master/tur/rustc-nightly

Install via pkg install tur-repo, pkg install rustc-nightly

pipcet commented 1 day ago

So I think the best available workaround is to make nvim a wrapper which appends "libluajit.so" to LD_PRELOAD, then calls the real nvim. Calling dlopen("libluajit.so", RTLD_GLOBAL|RTLD_NOW) from neovim doesn't appear to fix the problem, probably due to deliberate Android breakage.