Replaying a recording fails with assert error on g_main_context_acquire or segfault

jstarink commented 11 months ago

Description

I am trying to get a basic record new or replay mechanism running. Following the documentation, I believe this should be either just a call to panda.run_replay or panda.record_cmd. However, there seems to be some nondeterminism in replaying the previously recorded traces with (Py)PANDA when doing so. Recording works fine, but replaying results in either a segfault or other assertion errors. This behavior seems consistent with both i386 and x86_64 images of both Linux and Windows. Am I doing something wrong or is this a bug?

How to reproduce

Run a new panda instance from docker hub

$ docker run --rm -it -v $(pwd):/local pandare/panda /bin/bash

Create a new basic test.py file in /local with the following contents:

from pandare import Panda
import os

panda = Panda(generic='x86_64')

@panda.queue_blocking
def driver():
    recording_name = "linux64recording"
    # Does a recording exist? If so, replay it. Otherwise, make a new recording.
    if os.path.isfile(f"{recording_name}-rr-snp"):
        print("Replaying")
        panda.run_replay(recording_name)
    else:
        print("Recording")
        panda.revert_sync('root')
        panda.record_cmd("whoami", recording_name=recording_name)

    panda.stop_run()

panda.run()

Run the script once and observe normal execution with output similar to the following:

root@5a4e2fa4f486:/local# python test.py
using generic x86_64
Downloading required file: https://www.dropbox.com/s/4avqfxqemd29i5j/bionic-server-cloudimg-amd64-noaslr-nokaslr.qcow2?dl=1
/root/.panda/bionic-server-cloud 100%[========================================================>]   2.84G  22.2MB/s    in 2m 33s
Validating file hash
os_name=[linux-64-ubuntu:4.15.0-72-generic-noaslr-nokaslr]
PANDA[core]:os_familyno=2 bits=64 os_details=ubuntu:4.15.0-72-generic-noaslr-nokaslr
[PYPANDA] Panda args: [/usr/local/lib/python3.8/dist-packages/pandare/data/x86_64-softmmu/libpanda-x86_64.so -L /usr/local/share/panda /root/.panda/bionic-server-cloudimg-amd64-noaslr-nokaslr.qcow2 -display none -m 1024 -serial unix:/tmp/pypanda_sf2yccvkw,server,nowait -monitor unix:/tmp/pypanda_mxmdsta9o,server,nowait]
Replaying
loading snapshot
... done.
opening nondet log for read :   ./linux64recording-rr-nondet.log
./linux64recording-rr-nondet.log:     3253417 instrs total.
Replay completed successfully
Exiting cpu_handle_execption loop

Now run the script a second time, and notice it crashing. I have seen several variations of output for this, but the following two seem to be the most prevalent:

root@5a4e2fa4f486:/local# python test.py
using generic x86_64
os_name=[linux-64-ubuntu:4.15.0-72-generic-noaslr-nokaslr]
PANDA[core]:os_familyno=2 bits=64 os_details=ubuntu:4.15.0-72-generic-noaslr-nokaslr
[PYPANDA] Panda args: [/usr/local/lib/python3.8/dist-packages/pandare/data/x86_64-softmmu/libpanda-x86_64.so -L /usr/local/share/panda /root/.panda/bionic-server-cloudimg-amd64-noaslr-nokaslr.qcow2 -display none -m 1024 -serial unix:/tmp/pypanda_sifghicd1,server,nowait -monitor unix:/tmp/pypanda_m8xy0ak1w,server,nowait]
Replaying
loading snapshot
loading snapshot
rdev-iowatch-serial0: Illegal RAM offset ffc000
rdev-iowatch-serial0: error while loading state section id 4(ram)
Failed to load vmstate
Failed to start replay
rdev-iowatch-serial0: Unknown savevm section 4
Failed to load vmstate
Failed to start replay
free(): corrupted unsorted chunks
Segmentation fault (core dumped)

root@5a4e2fa4f486:/local# python test.py
using generic x86_64
os_name=[linux-64-ubuntu:4.15.0-72-generic-noaslr-nokaslr]
PANDA[core]:os_familyno=2 bits=64 os_details=ubuntu:4.15.0-72-generic-noaslr-nokaslr
[PYPANDA] Panda args: [/usr/local/lib/python3.8/dist-packages/pandare/data/x86_64-softmmu/libpanda-x86_64.so -L /usr/local/share/panda /root/.panda/bionic-server-cloudimg-amd64-noaslr-nokaslr.qcow2 -display none -m 1024 -serial unix:/tmp/pypanda_s079elggc,server,nowait -monitor unix:/tmp/pypanda_mdbfu5knm,server,nowait]
Replaying
loading snapshot
loading snapshot
... done.
opening nondet log for read :   ./linux64recording-rr-nondet.log
... done.
opening nondet log for read :   ./linux64recording-rr-nondet.log
./linux64recording-rr-nondet.log:     3253417 instrs total.
Replay completed successfully
Exiting cpu_handle_execption loop
**
GLib:ERROR:../../../glib/gmain.c:3375:g_main_context_acquire: assertion failed: (context->owner_count == 0)
Bail out! GLib:ERROR:../../../glib/gmain.c:3375:g_main_context_acquire: assertion failed: (context->owner_count == 0)
Aborted (core dumped)

Interestingly enough, the second output seems to indicate the replay completed successfully, but then still crashes afterwards.

Additional Info

Host OS: MANJARO 23.1.0 Kernel: 6.1.64 Docker Version: 24.0.7 panda-system-x86_64 --version: QEMU emulator version 2.9.1 (-dirty) CPU: Intel i7-11800H RAM: 16GB

AndrewFasano commented 11 months ago

I have a guess as to what's going wrong. When the recording exists, you call run_replay which runs the replay. Then you still hit the call to run which will try to resume the emulation from where things left off after the replay which will be something weird. If that's the issue, I think we should raise an error instead of crashing in this strange way.

Does the issue go away if you refactor things to avoid the call to run after run_replay, like this:

from pandare import Panda
import os

panda = Panda(generic='x86_64')
recording_name = "linux64recording"

if os.path.isfile(f"{recording_name}-rr-snp"):
    print("Replaying")
    panda.run_replay(recording_name)
else:
    @panda.queue_blocking
    def driver():
          print("Recording")
          panda.revert_sync('root')
          panda.record_cmd("whoami", recording_name=recording_name)
          panda.stop_run()
    panda.run()

jstarink commented 11 months ago

Perfect, that seems to have fixed the problem!

This makes me a bit confused about the naming of the available functions though. Considering that there are other run_xxx methods (such as run_serial_cmd) provided in the examples folder that are actually put into a panda.queue_blocking tagged function. Is run_replay substantially different from these other run methods? When should something be put in a queue and when shouldn't it be queued?

Also, indeed, a more descriptive error would've been nice :).

AndrewFasano commented 11 months ago

The run method runs a live guest while run_replay runs a replay, generally you want one or the other, not both.

I definitely see what you're saying where it's confusing how the run_serial_cmd method has a similar style name and being totally different! Open to suggestions for how we could improve these names in the future!

jstarink commented 11 months ago

@AndrewFasano Thanks for the explanation, that definitely clears it up!

I definitely see what you're saying where it's confusing how the run_serial_cmd method has a similar style name and being totally different! Open to suggestions for how we could improve these names in the future!

Some ideas come to mind:

I would consider distinguishing between running a PANDA analysis, and actually executing a program in the guest VM, perhaps using the run_ and exec_/execute_ prefixes.
I would also consider renaming run() to something like run_live_guest(), to make it more explicit and no confusion can happen.

panda-re / panda