thalium / icebox

Virtual Machine Introspection, Tracing & Debugging
MIT License
558 stars 89 forks source link

Ensuring resumed VM upon asynchronious termination/signaling of IceBox (python) #31

Closed pruzko closed 4 years ago

pruzko commented 4 years ago

Suppose we have an example inspired by the getting started post.

with vm.break_on(addr, print_hit):
    while True:
        vm.exec() # blocks here

The objective is to asynchronously stop the while True loop. This can be achieved using processes or threads.

A thread solution could look like this:

# master thread
stop = False # shared
t = start_worker_thread()
do_my_stuff()
stop = True
t.join()
# worker thread
...
with vm.break_on(addr, print_hit):
    while not stop:
        vm.resume()
        vm.wait_for(3000)     # timeout 3s

The solution does not work, because the wait_for function is not exported to python and looking at the source codes it is apparent, that the function is going to block until a break point is hit (which I can't assume).

The solution with processes almost works:

# master process
p = start_worker_process()
do_my_stuff()
p.terminate()     # send a SIGTERM signal
p.join()
# worker process
...
signal.signal(signal.SIGTERM, cleanup_routine)    # set a SIGTERM handler
with vm.break_on(addr, print_hit):
    while True:
        vm.exec()    # blocks here

def cleanup_routine(signum, frame):
        vm.release()
        exit(0)

The worker process can receive SIGTERM in multiple scenarios. If it was received while waiting for break point, then the VM is not blocked and everything works well. However, if the signal was received during execution of a break point call-back function then the VM is blocked and needs to be resumed (see cleanup_routine).

The problem with this solution is that when I override a handler for SIGTERM , the signal is never delivered. I also tried sending (various) signals with kill command but no luck. If the handler is not overridden, then the library prints some logs and terminates (as expected).

The objective here is to stop the introspection and ensure the VM is running. The signals/threads/processes are only some failed attempts and maybe they are not necessary at all. Please, what should I do?

PS.: I use some synchronization primitives to ensure the worker entered the while True loop before the master sends a signal. Also, a solution where the master waits until the worker's call-back function has finished is acceptable.

bamiaux commented 4 years ago

We could add

// stop vm execution, thread-safe
vm.interrupt()

// returns true if the vm is running
vm.is_running()

so you would be able to do

# main thread
while vm.is_running():
    vm.exec()

# and in another thread
vm.interrupt()
bamiaux commented 4 years ago

I'd need to modify state::exec, but I agree interrupting vm execution is currently missing and we need it in many scenarios

bamiaux commented 4 years ago

Untested branch: https://github.com/thalium/icebox/tree/vm_interrupt

pruzko commented 4 years ago

Thank you for a swift reply. The vm.interrupt is a good idea.

I have installed and tested the vm_interrupt branch. src/icebox/icebox_py/__init__.py is missing the following export:

def interrupt(self):
        """Interrupt vm"""
        _icebox.interrupt()

I will add it in a PR.

However, the interrupt concept does not work yet. Consider the following example (that will be part of the PR as well):

import time
import threading
import icebox

class Worker(threading.Thread):
    def __init__(self):
        super().__init__()
        self.vm = None
        self.quit = False

    def run(self):
        try:
            self.vm = icebox.attach('Win10')
            proc = self.vm.processes.find_name('csrss.exe')
            addr = proc.symbols.address('ntdll!DbgPrint')
            phys_addr = proc.memory.physical_address(addr)

            with self.vm.break_on_physical(phys_addr, self.callback):
                print('Wroker: Start')
                while not self.quit:
                    print('About to exec')
                    self.vm.exec()

                print('Worker: Done')
        finally:
            print('finally block')
            self.vm.resume()

    def stop(self):
        print('stopping')
        self.quit = True
        self.vm.interrupt()

    def callback(self):
        print('OK')

worker = Worker()
print('Master: Starting worker')
worker.start()               // blocks the whole process for some reason

for i in range(5):
    print(f'Sleeping {i}')
    time.sleep(1)

worker.stop()
worker.join()
print('Master: Done')

The interrupt indeed breaks the vm.exec() operation. The problem is that vm.exec() blocks both of the threads. I don't know how that's possible, but the for i in range(5) loop does not execute until you hit a break-point. A single iteration is executed once you hit a break-point, so if you trigger it 5 times the vm.interrupt() gets executed and everything works as expected.

It seems that it's the try_wait function that is toublesome but I can't find the cause. I was able to figure out, that the following part of try_wait is blocking the other thread:

try_wait:
        while(!d.interrupted)
        {
            std::this_thread::yield();    // yielding does not seem to affect the problem, just saves CPU
            const auto ok = fdp::state_changed(d.core);
            if(!ok)
                continue;        // this gets executed
        ...

fdp::state_changed:
    const auto ret = FDP_GetStateChanged(core.shm_->ptr);
    if(!ret)
        return false;
    ...

FDP_GetStateChanged:
    ???

The FDP_GetStateChanged uses some spin-locks to protect some shared memory, but I have no idea why that would block the whole process.

Please, do you have any idea where to look next?

bamiaux commented 4 years ago

Thanks for the detailed bug report ! I'll try to look at it, but the issue is probably around the python bindings where a native extension is blocking. Maybe there is something that can be done to release the python interpreter

bamiaux commented 4 years ago

Probably https://docs.python.org/dev/c-api/init.html#releasing-the-gil-from-extension-code

bamiaux commented 4 years ago

I've pushed a tentative fix, where I release the python thread before waiting. It should explain your bug, but I am not able to check it until later next week

pruzko commented 4 years ago

The current solution throws SIGSEGV, because the constructor of struct Handle accesses a reference to core, while you construct it with a shared pointer pointing to null.

I got it running as follows.

binding.cpp 1) Add null check in the constructor of Handle

Handle(const std::shared_ptr<core::Core>& c)
{
    core = c;
    if (c)
    {
        state::on_blocking_call(*c, [=](state::blocking_e blocking)
        {
            if(blocking == state::blocking_e::begin)
                thread_state = PyEval_SaveThread();
            else
                PyEval_RestoreThread(thread_state);
        });
    }
}

2) replace member initialization lists with a call to constructors

new(handle) Handle{core} --> new(handle) Handle(core); // 2 places
// replacing new(handle) Handle{{}}
auto core = std::shared_ptr<core::Core>(nullptr);
new(handle) Handle(core);

state.cpp

  1. if you return false in try_wait you'll get a run time exception RuntimeError: error
    if(d.interrupted)
    return true;

It works pretty well, except for one bug that is a little out of my reach I guess. Breakpoints are removed while the VM is running if you leave the break-point context after interrupt.

with self.vm.break_on_physical(addr, cb):
    self.vm.exec() // gets interrupted

The warning: INFO fdp: fdp::unset_breakpoint called on is_running vm

bamiaux commented 4 years ago

Yes, it was an untested quick&dirty patch, I will fix it properly soon. About breakpoint warnings, we probably need to pause the vm when we interrupt it so we can remove bps properly

bamiaux commented 4 years ago

Please try the latest vm_interrupt branch. It's still untested, so the try_wait fix may not work (but it will hopefully ^^)

bamiaux commented 4 years ago

You may still get weird behaviors inside python during breakpoint handling (def callback in your example), if you do not enable WinPE mode during windows boot and need to use page faults. I will fix it too eventually

pruzko commented 4 years ago

It works well, thanks a lot!