remote-android / redroid-doc

redroid (Remote-Android) is a multi-arch, GPU enabled, Android in Cloud solution. Track issues / docs here
4.2k stars 299 forks source link

Seek for diagnose advice for container boot failure #758

Open firemeteorxx opened 5 days ago

firemeteorxx commented 5 days ago

Not a real bug report, but seeking for advice. Suites a discussion forum better if there is one.

I'm trying to setup a redroid docker instance nested inside a unprivileged LXC container.

The nested docker appears to work on basic images like hello-world && busybox. However, when it comes to the redroid image, the container disappears immediately without any log, which makes it very hard to diagnose.

I can confirm that the binder_linux && ashmem_linux modules are built and loaded as I have an Anbox session running in the host environment. Those /dev/*binder files have been exposed to the LXC container.

The redroid image silently disappears without any error found in docker daemon log and client log (both in debug mode). docker logs gives nothing. dmesg -T gives nothing either. The only error I can find is the 'Exited (129)' from docker ps -a. But I have no idea how to go further with this info.

I also attempted to hijack the entry-point of the image to something like /bin/sh or /bin/yes. Maybe I didn't do it correctly, these did not make any difference. Similar entry overriding trick appears to work in the busybox image though...

I'm kind of lost and would like to get some suggestions on diagnose further...

firemeteorxx commented 4 days ago

I made some progress by injecting a static busybox binary into the image and got a shell to look around.

It turns out that due to the /apex directory being empty, none of the binaries in /system/bin (except for the init) could be executed due to missing the dynamic linker:

aa236d9aad5b:/ # /system/bin/bootstrap/linker64 /bin/file /system/bin/init                                                                                                                   
/system/bin/init: ELF shared object, 64-bit LSB x86-64, dynamic (/system/bin/bootstrap/linker64), for Android 30, BuildID=c387547d5d208352a4363599a59c36db, stripped
aa236d9aad5b:/ # /system/bin/bootstrap/linker64 /bin/file /bin/toybox                                                                                                                        
/bin/toybox: ELF shared object, 64-bit LSB x86-64, dynamic (/system/bin/linker64), for Android 30, BuildID=7a9c65865018dd2cd0c4f7eb47555169, stripped
aa236d9aad5b:/ # /system/bin/bootstrap/linker64 /bin/ls -l /system/bin/bootstrap/                                                                                                            
total 3016
-rwxr-xr-x 1 root shell 1578556 2024-05-27 14:41 linker
-rwxr-xr-x 1 root shell 1503960 2024-05-27 14:41 linker64
lrwxrwxrwx 1 root shell       6 2024-05-27 14:42 linker_asan -> linker
lrwxrwxrwx 1 root shell       8 2024-05-27 14:42 linker_asan64 -> linker64
1|aa236d9aad5b:/ # /system/bin/bootstrap/linker64 /bin/ls -l /system/bin/link                                                                                                                
linker         linker64       linker_asan    linker_asan64  linkerconfig
1|aa236d9aad5b:/ # /system/bin/bootstrap/linker64 /bin/ls -l /system/bin/linker*                                                                                                             
lrwxrwxrwx 1 root shell     36 2024-05-27 14:42 /system/bin/linker -> /apex/com.android.runtime/bin/linker
lrwxrwxrwx 1 root shell     38 2024-05-27 14:42 /system/bin/linker64 -> /apex/com.android.runtime/bin/linker64
lrwxrwxrwx 1 root shell     36 2024-05-27 14:42 /system/bin/linker_asan -> /apex/com.android.runtime/bin/linker
lrwxrwxrwx 1 root shell     38 2024-05-27 14:42 /system/bin/linker_asan64 -> /apex/com.android.runtime/bin/linker64
-rwxr-xr-x 1 root shell 962328 2024-05-27 14:41 /system/bin/linkerconfig

The init process uses a special linker in /system/bin/bootstrap.

I tried to execute it manually and it immediately exit with code 6, probably due to it expects to see PID == 1.

If I do exec /init this reproduce the syndrome I see with direct docker run -it --privileged redroid/redroid:11.0.0-latest, i.e. an exit code of 129.

Sounds like I'll need to dig into the AOSP source code for the meaning of error code 129.

firemeteorxx commented 4 days ago

Figured the reason why init log is missing -- The /dev/kmsg was not made available. It appears that the boot sequence fails at the uevent socket creation step...

[18127.664725] init: mkdir("/dev/pts", 0755) failed File exists
[18127.664734] init: chmod("/proc/cmdline", 0440) failed Operation not permitted
[18127.664737] init: mount("sysfs", "/sys", "sysfs", 0, NULL) failed Device or resource busy
[18127.664740] init: mount("selinuxfs", "/sys/fs/selinux", "selinuxfs", 0, NULL) failed No such file or directory
[18127.664742] init: mknod("/dev/kmsg", S_IFCHR | 0600, makedev(1, 11)) failed File exists
[18127.664744] init: mknod("/dev/kmsg_debug", S_IFCHR | 0622, makedev(1, 11)) failed Operation not permitted
[18127.664746] init: mknod("/dev/random", S_IFCHR | 0666, makedev(1, 8)) failed File exists
[18127.664749] init: mknod("/dev/urandom", S_IFCHR | 0666, makedev(1, 9)) failed File exists
[18127.664751] init: mknod("/dev/ptmx", S_IFCHR | 0666, makedev(5, 2)) failed File exists
[18127.664753] init: mknod("/dev/null", S_IFCHR | 0666, makedev(1, 3)) failed File exists
[18127.664755] init: init first stage started!
[18127.664782] init: Unable to open /lib/modules, skipping module loading.
[18127.664819] init: [libfs_mgr]ReadFstabFromDt(): failed to read fstab from dt
[18127.664847] init: [libfs_mgr]ReadDefaultFstab(): failed to find device default fstab
[18127.664849] init: Failed to fstab for first stage mount
[18127.664863] init: Using Android DT directory /proc/device-tree/firmware/android/
[18127.664892] init: Could not open uevent socket
[18127.665100] init: InitFatalReboot: signal 6
[18127.668590] init: #00 pc 00000000000edcfd  /system/bin/init (android::init::InitFatalReboot(int)+205)
[18127.668595] init: #01 pc 000000000006f430  /system/bin/init (android::init::InitAborter(char const*)+32)
[18127.668600] init: #02 pc 0000000000019cfc  /system/lib64/libbase.so (android::base::SetAborter(std::__1::function<void (char const*)>&&)::$_3::__invoke(char const*)+60)
[18127.668603] init: #03 pc 00000000000192a0  /system/lib64/libbase.so (android::base::LogMessage::~LogMessage()+368)
[18127.668606] init: #04 pc 00000000000f78d0  /system/bin/init (android::init::UeventListener::UeventListener(unsigned long)+208)
[18127.668609] init: #05 pc 00000000000ac784  /system/bin/init (android::init::BlockDevInitializer::BlockDevInitializer()+52)
[18127.668611] init: #06 pc 00000000000a6298  /system/bin/init (android::init::FirstStageMount::Create()+504)
[18127.668614] init: #07 pc 00000000000aa075  /system/bin/init (android::init::DoFirstStageMount()+165)
[18127.668616] init: #08 pc 00000000000a1d7e  /system/bin/init (android::init::FirstStageMain(int, char**)+5550)
[18127.668619] init: #09 pc 000000000003b0ce  /system/bin/init (main+142)
[18127.668621] init: #10 pc 0000000000050bf8  /system/lib64/bootstrap/libc.so (__libc_init+104)
[18127.668624] init: Reboot ending, jumping to kernel
firemeteorxx commented 4 days ago

Here is the offender: SO_RCVBUFFORCE requires the CAP_NET_ADMIN capability... I don't know why but the buf_sz here is hard coded to 16MB which appears to be very large. The default buffer size is roughly ~400KB.

115     /* Only if SO_RCVBUF was not effective, try SO_RCVBUFFORCE. Generally, we
116      * want to avoid SO_RCVBUFFORCE, because it generates SELinux denials in
117      * case we don't have CAP_NET_ADMIN. This is the case, for example, for
118      * healthd. */
119     if (buf_sz_readback < 2 * buf_sz) {
120         if (setsockopt(s, SOL_SOCKET, SO_RCVBUFFORCE, &buf_sz, sizeof(buf_sz)) < 0) {
121             close(s);
122             return -1;
123         }
124     }