panda-re / panda

Platform for Architecture-Neutral Dynamic Analysis
https://panda.re
Other
2.48k stars 479 forks source link

PANDA hangs using network tap with QEMU vexpress-a9 ARM board #1244

Open wpence opened 2 years ago

wpence commented 2 years ago

I have been running into an issue where it appears that QEMU/PANDA hangs after issuing the command to start my VM where there is no output and nothing happens. This may be some sort of race condition as it only happens sometimes. I have narrowed the issue down to using a network tap with the vexpress-a9 ARM board. I am attaching a sample VM and script where this behavior can be reproduced.

To reproduce this issue with vm_sample.tar.gz:

# first setup a network tap
sudo tunctl -t tap1 -u ${USER}
sudo ip link set tap1 up
sudo ip addr add 192.168.1.2/24 dev tap1
# extract sample VM
tar xvf vm_sample.tar.gz
# run PANDA, need sudo for net tap
sudo ./run.sh

I've crafted a simple bash script (run.sh) to attempt to start the VM over and over again and then exit when it fails to start. Using gdb to attach to the PANDA pid when it hangs, I've gotten the following backtrace:

Attaching to process 40651
[New LWP 40653]
[New LWP 40655]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__lll_lock_wait (futex=futex@entry=0x7fab8fc91340 <qemu_global_mutex>, 
    private=0) at lowlevellock.c:52
52  lowlevellock.c: No such file or directory.
(gdb) bt
#0  __lll_lock_wait
    (futex=futex@entry=0x7fab8fc91340 <qemu_global_mutex>, private=0)
    at lowlevellock.c:52
#1  0x00007fab882490a3 in __GI___pthread_mutex_lock
    (mutex=0x7fab8fc91340 <qemu_global_mutex>)
    at ../nptl/pthread_mutex_lock.c:80
#2  0x00007fab8f2936bd in qemu_mutex_lock
    (mutex=mutex@entry=0x
7fab8fc91340 <qemu_global_mutex>)
    at /home/user/panda/util/qemu-thread-posix.c:60
#3  0x00007fab8eed2e81 in qemu_mutex_lock_iothread ()
    at /home/user/panda/cpus.c:1459
#4  0x00007fab8f28ffd3 in os_host_main_loop_wait (timeout=<optimized out>)
    at /home/user/panda/util/main-loop.c:262
#5  main_loop_wait (nonblocking=<optimized out>)
    at /home/user/panda/util/main-loop.c:540
#6  0x00007fab8f004334 in main_loop () at /home/user/panda/vl.c:1971
#7  0x00007fab8f0095c8 in main_aux
    (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>, pmm=PANDA_NORMAL) at /home/user/panda/vl.c:5070
#8  0x00007fab8e926083 in __libc_start_main (main=
    0x55fc1ea37060 <main>, argc=27, argv=0x7ffe1298b038, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe1298b028)
    at ../csu/libc-start.c:308
#9  0x000055fc1ea3709e in _start ()

For reference the PANDA command being used is:

sudo QEMU_AUDIO_DRV=none panda-system-arm -m 1024 -M vexpress-a9,secure=off -cpu cortex-a9 -dtb vexpress-v2p-ca9.dtb -kernel zImage.armel -initrd initramfz.eabi -append "console=ttyS0" -serial file:serial.log -display none -net nic -net tap,ifname=tap1,script=no

I'm getting the same behavior in both my local PANDA build as well as the latest pandare/panda:latest docker container image. I have found that this issue only happens when using the vexpress-a9 QEMU ARM machine. If I switch to -M virt or -M versatilepb it seems to boot fine every time. It only happens with the tap network device. For example, if I change the net config to -net nic -net user it boots fine every time. Using either -netdev tap or -net tap seems to trigger this behavior on the vexpress-a9 board. It happens whether the filesystem is an initramfs or -drive device. I have reproduced this with several different versions of Linux targeting vexpress-a9.

zestrada commented 2 years ago

We have run into this same issue but the fix still has some race conditions: https://github.com/panda-re/panda/pull/1232

Not sure if your issue is arriving at the race condition in the same way, but one workaround we had was to add an lpj=kernel argument. Just observed boot without the network tap to grab a sane value from the console.

wpence commented 2 years ago

We have run into this same issue but the fix still has some race conditions: #1232

Not sure if your issue is arriving at the race condition in the same way, but one workaround we had was to add an lpj=kernel argument. Just observed boot without the network tap to grab a sane value from the console.

Interesting, this is probably a duplicate of your issue then. If I boot with lpj=2912256 it seems to boot reliably.