panda-re / panda

Platform for Architecture-Neutral Dynamic Analysis
https://panda.re
Other
2.5k stars 479 forks source link

Replaying a recording fails with assert error on g_main_context_acquire or segfault #1391

Open jstarink opened 11 months ago

jstarink commented 11 months ago

Description

I am trying to get a basic record new or replay mechanism running. Following the documentation, I believe this should be either just a call to panda.run_replay or panda.record_cmd. However, there seems to be some nondeterminism in replaying the previously recorded traces with (Py)PANDA when doing so. Recording works fine, but replaying results in either a segfault or other assertion errors. This behavior seems consistent with both i386 and x86_64 images of both Linux and Windows. Am I doing something wrong or is this a bug?

How to reproduce

  1. Run a new panda instance from docker hub
    $ docker run --rm -it -v $(pwd):/local pandare/panda /bin/bash
  2. Create a new basic test.py file in /local with the following contents:

    from pandare import Panda
    import os
    
    panda = Panda(generic='x86_64')
    
    @panda.queue_blocking
    def driver():
        recording_name = "linux64recording"
        # Does a recording exist? If so, replay it. Otherwise, make a new recording.
        if os.path.isfile(f"{recording_name}-rr-snp"):
            print("Replaying")
            panda.run_replay(recording_name)
        else:
            print("Recording")
            panda.revert_sync('root')
            panda.record_cmd("whoami", recording_name=recording_name)
    
        panda.stop_run()
    
    panda.run()
  3. Run the script once and observe normal execution with output similar to the following:
    root@5a4e2fa4f486:/local# python test.py
    using generic x86_64
    Downloading required file: https://www.dropbox.com/s/4avqfxqemd29i5j/bionic-server-cloudimg-amd64-noaslr-nokaslr.qcow2?dl=1
    /root/.panda/bionic-server-cloud 100%[========================================================>]   2.84G  22.2MB/s    in 2m 33s
    Validating file hash
    os_name=[linux-64-ubuntu:4.15.0-72-generic-noaslr-nokaslr]
    PANDA[core]:os_familyno=2 bits=64 os_details=ubuntu:4.15.0-72-generic-noaslr-nokaslr
    [PYPANDA] Panda args: [/usr/local/lib/python3.8/dist-packages/pandare/data/x86_64-softmmu/libpanda-x86_64.so -L /usr/local/share/panda /root/.panda/bionic-server-cloudimg-amd64-noaslr-nokaslr.qcow2 -display none -m 1024 -serial unix:/tmp/pypanda_sf2yccvkw,server,nowait -monitor unix:/tmp/pypanda_mxmdsta9o,server,nowait]
    Replaying
    loading snapshot
    ... done.
    opening nondet log for read :   ./linux64recording-rr-nondet.log
    ./linux64recording-rr-nondet.log:     3253417 instrs total.
    Replay completed successfully
    Exiting cpu_handle_execption loop
  4. Now run the script a second time, and notice it crashing. I have seen several variations of output for this, but the following two seem to be the most prevalent:
    root@5a4e2fa4f486:/local# python test.py
    using generic x86_64
    os_name=[linux-64-ubuntu:4.15.0-72-generic-noaslr-nokaslr]
    PANDA[core]:os_familyno=2 bits=64 os_details=ubuntu:4.15.0-72-generic-noaslr-nokaslr
    [PYPANDA] Panda args: [/usr/local/lib/python3.8/dist-packages/pandare/data/x86_64-softmmu/libpanda-x86_64.so -L /usr/local/share/panda /root/.panda/bionic-server-cloudimg-amd64-noaslr-nokaslr.qcow2 -display none -m 1024 -serial unix:/tmp/pypanda_sifghicd1,server,nowait -monitor unix:/tmp/pypanda_m8xy0ak1w,server,nowait]
    Replaying
    loading snapshot
    loading snapshot
    rdev-iowatch-serial0: Illegal RAM offset ffc000
    rdev-iowatch-serial0: error while loading state section id 4(ram)
    Failed to load vmstate
    Failed to start replay
    rdev-iowatch-serial0: Unknown savevm section 4
    Failed to load vmstate
    Failed to start replay
    free(): corrupted unsorted chunks
    Segmentation fault (core dumped)
    root@5a4e2fa4f486:/local# python test.py
    using generic x86_64
    os_name=[linux-64-ubuntu:4.15.0-72-generic-noaslr-nokaslr]
    PANDA[core]:os_familyno=2 bits=64 os_details=ubuntu:4.15.0-72-generic-noaslr-nokaslr
    [PYPANDA] Panda args: [/usr/local/lib/python3.8/dist-packages/pandare/data/x86_64-softmmu/libpanda-x86_64.so -L /usr/local/share/panda /root/.panda/bionic-server-cloudimg-amd64-noaslr-nokaslr.qcow2 -display none -m 1024 -serial unix:/tmp/pypanda_s079elggc,server,nowait -monitor unix:/tmp/pypanda_mdbfu5knm,server,nowait]
    Replaying
    loading snapshot
    loading snapshot
    ... done.
    opening nondet log for read :   ./linux64recording-rr-nondet.log
    ... done.
    opening nondet log for read :   ./linux64recording-rr-nondet.log
    ./linux64recording-rr-nondet.log:     3253417 instrs total.
    Replay completed successfully
    Exiting cpu_handle_execption loop
    **
    GLib:ERROR:../../../glib/gmain.c:3375:g_main_context_acquire: assertion failed: (context->owner_count == 0)
    Bail out! GLib:ERROR:../../../glib/gmain.c:3375:g_main_context_acquire: assertion failed: (context->owner_count == 0)
    Aborted (core dumped)

Interestingly enough, the second output seems to indicate the replay completed successfully, but then still crashes afterwards.

Additional Info

Host OS: MANJARO 23.1.0 Kernel: 6.1.64 Docker Version: 24.0.7 panda-system-x86_64 --version: QEMU emulator version 2.9.1 (-dirty) CPU: Intel i7-11800H RAM: 16GB

AndrewFasano commented 11 months ago

I have a guess as to what's going wrong. When the recording exists, you call run_replay which runs the replay. Then you still hit the call to run which will try to resume the emulation from where things left off after the replay which will be something weird. If that's the issue, I think we should raise an error instead of crashing in this strange way.

Does the issue go away if you refactor things to avoid the call to run after run_replay, like this:

from pandare import Panda
import os

panda = Panda(generic='x86_64')
recording_name = "linux64recording"

if os.path.isfile(f"{recording_name}-rr-snp"):
    print("Replaying")
    panda.run_replay(recording_name)
else:
    @panda.queue_blocking
    def driver():
          print("Recording")
          panda.revert_sync('root')
          panda.record_cmd("whoami", recording_name=recording_name)
          panda.stop_run()
    panda.run()
jstarink commented 11 months ago

Perfect, that seems to have fixed the problem!

This makes me a bit confused about the naming of the available functions though. Considering that there are other run_xxx methods (such as run_serial_cmd) provided in the examples folder that are actually put into a panda.queue_blocking tagged function. Is run_replay substantially different from these other run methods? When should something be put in a queue and when shouldn't it be queued?

Also, indeed, a more descriptive error would've been nice :).

AndrewFasano commented 11 months ago

The run method runs a live guest while run_replay runs a replay, generally you want one or the other, not both.

I definitely see what you're saying where it's confusing how the run_serial_cmd method has a similar style name and being totally different! Open to suggestions for how we could improve these names in the future!

jstarink commented 11 months ago

@AndrewFasano Thanks for the explanation, that definitely clears it up!

I definitely see what you're saying where it's confusing how the run_serial_cmd method has a similar style name and being totally different! Open to suggestions for how we could improve these names in the future!

Some ideas come to mind: