google / android-emulator-container-scripts

1.88k stars 262 forks source link

Pre-built Android 11 Emulator consistently crashes partway through test run #271

Open kelly987654 opened 3 years ago

kelly987654 commented 3 years ago

Using the below start command, the emulator in the docker container consistently crashes partway through our test runs (not on the same test). The docker container works fine for a while, then the tests lose access to the emulator, and the container status shows as 'unhealthy'. The qemu process on both the Linux VM host and within the docker container show 100% cpu usage, and the emulator is no longer accessible via adb.

System info:

Start command:

DOCKER_IMAGE=us-docker.pkg.dev/android-emulator-268719/images/30-google-x64-no-metrics:latest

PORT=5555

container_id=$(docker run -d \
      --device /dev/kvm \
      --restart unless-stopped \
        --publish $PORT:5555/tcp \
          -e EMULATOR_PARAMS="-partition-size 8192 -no-boot-anim" \
            $DOCKER_IMAGE)

Logs immediately after crash:

kernel: [  961.430096] rcu: INFO: 
kernel: [  961.430632] sched: RT thrrcu_preempt self-detected stall on CPUottling activated
kernel: [  961.439453] rcu:     3-...!: (1 GPs behind) idle=dd2/1/0x4000000000000002 softirq=66875/66877 fqs=0 last_accelerate: 64b2/84c6, Nonlazy posted: ..D
kernel: [  961.449848]  (t=8120 jiffies g=125393 q=29)
kernel: [  961.453230] rcu: rcu_preempt kthread starved for 8121 jiffies! g125393 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=2
kernel: [  961.461412] rcu: RCU grace-period kthread stack dump:
kernel: [  961.465561] rcu_preempt     R  running task    13976    10      2 0x80004000
kernel: [  961.471210] Call Trace:
kernel: [  961.473711]  __schedule+0x3f4/0x6f0
kernel: [  961.476399]  schedule+0x81/0xb0
kernel: [  961.478704]  schedule_timeout+0xb5/0x150
kernel: [  961.481810]  ? __prandom_timer+0x8/0x8
kernel: [  961.484838]  rcu_gp_kthread+0x69d/0xe70
kernel: [  961.487836]  ? usb_stor_control_thread+0x8/0x8
kernel: [  961.491247]  kthread+0x180/0x190
kernel: [  961.493649]  ? usb_stor_control_thread+0x8/0x8
kernel: [  961.496937]  ? kthread_worker_fn.cfi_jt+0x8/0x8
kernel: [  961.500240]  ret_from_fork+0x35/0x40
kernel: [  961.503141] NMI backtrace for cpu 3
kernel: [  961.505832] CPU: 3 PID: 414 Comm: adbd Tainted: G           O      5.4.61-android11-2-00064-g4271ad6e8ade-ab6991359 #1
kernel: [  961.513669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
kernel: [  961.522394] Call Trace:
kernel: [  961.524961]  <IRQ>
kernel: [  961.526731]  dump_stack+0x95/0xe5
kernel: [  961.529490]  nmi_trigger_cpumask_backtrace+0x109/0x190
kernel: [  961.533418]  rcu_dump_cpu_stacks+0xcd/0xe0
kernel: [  961.536506]  rcu_sched_clock_irq+0x635/0xbd0
kernel: [  961.539872]  ? __raise_softirq_irqoff+0xab/0xd0
kernel: [  961.543408]  update_process_times+0x61/0xa0
kernel: [  961.546642]  tick_sched_timer+0x91/0xe0
kernel: [  961.549748]  ? alarmtimer_fired+0x8/0x8
kernel: [  961.552634]  __run_hrtimer+0x1fc/0x3f0
kernel: [  961.555805]  ? __raise_softirq_irqoff+0xab/0xd0
kernel: [  961.559302]  hrtimer_interrupt+0x1a2/0x450
kernel: [  961.562378]  ? kvm_sched_clock_read+0x19/0x40
kernel: [  961.565864]  ? clockevents_register_device.cfi_jt+0x8/0x8
kernel: [  961.569845]  smp_apic_timer_interrupt+0x16c/0x390
kernel: [  961.573312]  apic_timer_interrupt+0xf/0x20
kernel: [  961.576351]  </IRQ>
pokowaka commented 3 years ago

Tracking in: https://issuetracker.google.com/issues/199772019, so we can get some more folks to look at this.

kelly987654 commented 3 years ago

Here's another example. Tests were running fine for a while, then it got stuck (I can tell just from watching the Linux VM CPU, because it goes from normal spikes to a flat-line at 60% with all usage from the qemu process, Docker container shows 99% all from the qemu process).

From adb log immediately after recovery (no logs during the stuck period):

`09-13 16:42:27.190   757   757 I wpa_supplicant: wlan0: CTRL-EVENT-BEACON-LOSS 
09-13 16:42:27.226   548   548 W Looper  : Slow delivery took 269752ms main h=android.app.PropertyInvalidatedCache$AutoCorker$1 c=null m=0
09-13 16:42:27.226   548   548 W Looper  : Slow dispatch took 109ms main h=android.app.PropertyInvalidatedCache$AutoCorker$1 c=null m=0
09-13 16:42:27.249   548   596 I libprocessgroup: Successfully killed process cgroup uid 10157 pid 10604 in 271882ms
09-13 16:42:27.254   305   305 I Zygote  : Process 10604 exited due to signal 9 (Killed)
09-13 16:42:27.264   513   513 E netmgr  : qemu_pipe_open_ns:62: Could not connect to the 'pipe:qemud:network' service: Invalid argument
09-13 16:42:27.264   513   513 E netmgr  : Failed to open QEMU pipe 'qemud:network': Invalid argument
09-13 16:42:27.281   548   633 W SensorService: Dropping 12 cached events (968/984) to save 28/28 new events. 0 events previously dropped
09-13 16:42:27.281   548   633 W SensorService: Dropping 28 cached events (984/984) to save 28/28 new events. 28 events previously dropped
09-13 16:42:27.283   548   633 I chatty  : uid=1000(system) SensorService identical 24 lines
09-13 16:42:27.283   548   633 W SensorService: Dropping 28 cached events (984/984) to save 28/28 new events. 28 events previously dropped
09-13 16:42:27.283   548   633 W SensorService: Dropping 5 cached events (984/984) to save 5/5 new events. 28 events previously dropped
09-13 16:42:27.289   548  1895 W system_server: Long monitor contention with owner Binder:548_16 (1892) at android.os.BinderProxy android.os.BinderProxy.getInstance(long, long)(BinderProxy.java:411) waiters=0 in android.os.BinderProxy android.os.BinderProxy.getInstance(long, long) for 310ms
09-13 16:42:27.302   548   633 W SensorService: Dropping 28 cached events (984/984) to save 28/28 new events. 28 events previously dropped
09-13 16:42:27.302   548   633 I chatty  : uid=1000(system) SensorService identical 1 line
09-13 16:42:27.302   548   633 W SensorService: Dropping 28 cached events (984/984) to save 28/28 new events. 28 events previously dropped
09-13 16:42:27.302   517   517 E wifi_forwarder: qemu_pipe_open_ns:62: Could not connect to the 'pipe:qemud:wififorward' service: Invalid argument
09-13 16:42:27.302   548   633 W SensorService: Dropping 28 cached events (984/984) to save 28/28 new events. 28 events previously dropped
09-13 16:42:27.303   548   633 I chatty  : uid=1000(system) SensorService identical 4 lines
09-13 16:42:27.303   548   633 W SensorService: Dropping 28 cached events (984/984) to save 28/28 new events. 28 events previously dropped
09-13 16:42:27.303   517   517 E wifi_forwarder: RemoteConnection failed to initialize: RemoteConnection failed to open pipe
09-13 16:42:27.303   548   633 W SensorService: Dropping 28 cached events (984/984) to save 28/28 new events. 28 events previously dropped
09-13 16:42:27.304   548   633 I chatty  : uid=1000(system) SensorService identical 16 lines
09-13 16:42:27.304   548   633 W SensorService: Dropping 28 cached events (984/984) to save 28/28 new events. 28 events previously dropped
09-13 16:42:27.346   548   633 W SensorService: Dropping 1 cached events (984/984) to save 1/1 new events. 20 events previously dropped
09-13 16:37:55.665     0     0 I binder  : undelivered transaction 1524010, process died.
09-13 16:42:27.355     0     0 E rcu     : INFO: rcu_preempt detected stalls on CPUs/tasks:
09-13 16:42:27.361     0     0 W         : (detected by 3, t=67923 jiffies, g=525421, q=405)
09-13 16:42:27.366     0     0 E rcu     : All QSes seen, last rcu_preempt kthread activity 67924 (4295905529-4295837605), jiffies_till_next_fqs=1, root ->qsmask 0x0
09-13 16:42:27.377     0     0 I         : R  running task    12504   384      1 0x00000000
09-13 16:42:27.383     0     0 W Call Trace:  
09-13 16:42:27.385     0     0 W         : <IRQ>
09-13 16:42:27.387     0     0 W         : sched_show_task+0x1b9/0x1d0
09-13 16:42:27.391     0     0 W         : rcu_sched_clock_irq+0x9d8/0xbd0
09-13 16:42:27.395     0     0 W         : ? __raise_softirq_irqoff+0xab/0xd0
09-13 16:42:27.399     0     0 W         : update_process_times+0x61/0xa0
09-13 16:42:27.402     0     0 W         : tick_sched_timer+0x91/0xe0
09-13 16:42:27.405     0     0 W         : ? alarmtimer_fired+0x8/0x8
09-13 16:42:27.409     0     0 W         : __run_hrtimer+0x1fc/0x3f0
09-13 16:42:27.412     0     0 W         : ? __raise_softirq_irqoff+0xab/0xd0
09-13 16:42:27.415     0     0 W         : hrtimer_interrupt+0x1a2/0x450
09-13 16:42:27.419     0     0 W         : ? kvm_sched_clock_read+0x19/0x40
09-13 16:42:27.422     0     0 W         : ? clockevents_register_device.cfi_jt+0x8/0x8
09-13 16:42:27.427     0     0 W         : smp_apic_timer_interrupt+0x16c/0x390
09-13 16:42:27.431     0     0 W         : apic_timer_interrupt+0xf/0x20
09-13 16:42:27.434     0     0 W         : </IRQ>`
kelly987654 commented 3 years ago

@pokowaka - Thanks for looking at this. I have been testing today with the 29-google-x64-no-metrics:30.3.4 container and have not been able to reproduce the issue, so I am inclined to think it is something related to the Android 11 container. Is there any suggestion for something I could try? I saw the other comments, we are just using the generic command I mentioned, and nothing USB related.