Proposal: Drakvuf test framework

mtarral commented 5 years ago

Hi,

I would like to propose a small test framework that I developed for testing Drakvuf. ATM, the drakvuf repo doesn't have an official test suite.

I developed mine based on pytest, in Python, so it's easily to extend and it will feel familiar for Python developers as well.

Basically, it manages the Drakvuf process, collect each line of the stdout as JSON, and pushes it to a queue that the test can access to validate a behavior.

All of this while tracking the birth and death of a specific process in the guest. A basic test can look like this, asserting that the process completed its execution, and no crashes happened in the meantime:

def test_injection(ev_queue):
    for event in iter(ev_queue['queue'].get, None):
        assert(event['Plugin'] != 'crashmon')
        assert(event['Plugin'] != 'bsodmon')
    assert(event['completed'].is_set())

Example: test-suite-example

Furthermore, it could help reproduce the issues some of us have been experiencing with Windows and repeated testing:

cc @skvl, @icedevml you might be interested as well.

@tklengyel What do you think about this idea ?
If you would agree to include this in Drakvuf, would you prefer a test/ directory or a submodule with its own repo for this project ?

Thanks !

icedevml commented 5 years ago

I'm not sure what @tklengyel already has in his CI, but I suppose that these are just a generic end-to-end tests for DRAKVUF. I support the idea of end-to-end testing of particular plugins.

We do already have some systems at CERT.PL which are e2e tested against some malware samples or just toy programs and it's pretty good at detecting some unexpected consequences of changing some code.

tklengyel commented 5 years ago

@mtarral this is a worthwhile project but I would like to keep it in a separate repository

@icedevml my CI code is at https://github.com/tklengyel/drakvuf-ci

mtarral commented 5 years ago

@tklengyel so, the thing is, I have constraints on how I can contribute to open-source repositories in my professional environment.

And creating a new repo is part of these constraints, I just can't do that. :man_shrugging: So if you could create one, I could make a PR to push the code.

Thanks.

mtarral commented 4 years ago

I'm happy to publish my pytest-based test framework for Drakvuf ! :tada: https://github.com/mtarral/drakvuf_test_suite

This can be interesting for everyone working on the project and who wishes to build an automated test suite in an easy and flexible way. cc @skvl , @icedevml , @disaykin

An example run:

sudo ./venv/bin/pytest --domain win7 --profile /etc/libvmi/win7.json --inject-method createproc -k injection -x --log-level=INFO --log-file=pytest.log -v --count 200

mtarral commented 4 years ago

Also this test framework can generate the BSOD and the various app crashes I was referencing in https://github.com/tklengyel/drakvuf/issues/622#issuecomment-489105859

And no, I tested again today, the holy patch https://github.com/tklengyel/drakvuf/pull/708 hasn't fixed those issues :cry:

icedevml commented 4 years ago

@mtarral I'm interested in the crashes. As far as I understand, right now it is very hard to reproduce any single problem that occurs from time to time?

mtarral commented 4 years ago

@icedevml, the crashes are not deterministic. If you want to reproduce them, put as much VCPUs as you can (4 in my case), because I think they are related to a race condition, and the likelyhood of triggering the BSOD is increased by the number of VCPUs.

open taskmgr in the VM, (I configured it as a default target process injection choice in conftest.py because it was very stable, it can be customized in the tests ofc), and run the injection test 5000 times.

.. --count 5000

It should trigger a BSOD at some time. Or if you are lucky and application crash (explorer.exe has stopped working)

Also tell me if you any issue to setup the test suite, I might have skipped some steps that were natural to me as I wrote it :)

tklengyel commented 4 years ago

@mtarral does the crash happen during injection or just some time afterwards? What if you do no injection?

Generally speaking injection is the most intrusive aspect of DRAKVUF since it actively hijacks the execution of a running thread, modifies the stack, etc. Might be worthwile keeping track which thread got hijacked when the crash occurs. It might be the case that injection just happened to grab a thread that was not in a good state (since it just grabs the first that executes in the target process). Might also want to test with specifying the thread id directly instead of leaving it up to chance.

mtarral commented 4 years ago

@tklengyel I'm positive the injection is not at fault.

I developed the ansible "injection" method for this purpose. It is using the WinRM service to connect and execute a powershell script, so nothing intrusive, and I still have the BSOD (PAGE_FAULT_IN_NONPAGED_AREA)

tklengyel commented 4 years ago

Ah, so its not drakvuf's injection you are testing with. Well, sounds like we might have another lingering remapped entry or similar issue during shutdown. That issue was so inline with the effect we are seeing that I wouldn't be surprised if we still have a cornercase somewhere where it happens. Just have to find it. Perhaps a way to debug it would be to add some debug call to Xen to check the state of altp2m and verify that all remappings have been reversed, etc.

icedevml commented 4 years ago

@mtarral So if I understood correctly, you are running DRAKVUF on the same VM circa 2000 times? I'm pretty sure it doesn't exit perfectly clearly at all times and this number of runs is huge enough to trigger a problem even with a single leaked mapping.

I was not caring that much about this problem, because my primary workflow is about running a fresh VM per each DRAKVUF run.

Anyway, this testing approach looks very promising - I bet this would make it much easier to find bugs that were hard to encounter normally.

mtarral commented 4 years ago

@icedevml in this specific case, yes, I'm running and stopping the Drakvuf process thousands of time to evaluate the robustness and find hidden issues like this BSOD.

But this is just an example of using the test suite. The initial goal is test if the plugins are working as intended, and write the test cases in Python.

For example, in the dumb test_injection.py, I'm iterating over each event generated by the plugins, and simply asserting that none of them is either crashmon or bsodmon.

You could add your own logic and test whether a given binary generates a set of events (file, process, registry) that you would expect.

I hope it's more clear now :)

tklengyel commented 4 years ago

simply asserting that none of them is either crashmon or bsodmon

I think the culprit in your crashes is crashmon itself. It is quite easy to trigger a bsod with that plugin, like after 2-3 restarts the VM bluescreens with page fault in non-paged area. I have been running bsodmon in a loop of stop/restart and after 50 iterations I saw no issues.

mtarral commented 4 years ago

@tklengyel I tried disabling crashmon, and the guest doesn't crash anymore, at least for 600 tests, with Ansible injection.

But why would it make the guest BSOD ? Drakvuf is already listening for CR3 changes, and crashmon is just a callback in cr3_cb loop ?

t is quite easy to trigger a bsod with that plugin, like after 2-3 restarts the VM bluescreens with page fault in non-paged area

Why so ? :thinking:

tklengyel commented 4 years ago

Yea, I'm digging into it right now too. It didn't really make much sense to me either, CR3 events don't cause any issue like that by default. With just bsodmon running I never see a crash. But as soon as I also add crashmon - even if I modify crashmon to do nothing but print a line on the cr3 callback - i get the bluescreen eventually. I suspect there might be some memory corruption happening.

mtarral commented 4 years ago

even if I modify crashmon to do nothing but print a line on the cr3 callback

Tried the same thing, returning 0 in check_crashreporter callback, and still BSOD. I'm doing tests by commenting parts of the code in cr3_cb to see which one is responsible.

Maybe process_free_requests(drakvuf) ? I don't know.

tklengyel commented 4 years ago

Don't think so.. It's of course one of those heisenbugs that as soon as you enable ASAN or run it through valgrind it never happens :)

mtarral commented 4 years ago

I'm not able to run Drakvuf managed by Valgrind. It complains of unimplemented ioctls or syscalls, and libvmi failed to start.

Does it work for you ?

tklengyel commented 4 years ago

You need to use https://github.com/tklengyel/valgrind

tklengyel commented 4 years ago

Btw

Drakvuf is already listening for CR3 changes, and crashmon is just a callback in cr3_cb loop ?

That's no longer the case. There is no default cr3 listener anymore, so here the plugin specifically enables it just for its own callback.

tklengyel commented 4 years ago

So it definitely looks like the cr3 event is the culprit behind these bluescreens, but only when used with altp2m. I ran a plain cr3 event enable/disable loop dozens of times and had no issue. With altp2m it triggers usually within the first 10 tries. I now suspect this to be a deeper rooted issue, perhaps a stale TLB in the hardware itself that is brought on by the trapping of mov-to-cr3 and the pagetables being wiped at shutdown. If the hardware kept going with a stale tlb afterwards that would explain the issue. Should be easy to test by bumping the VPID after altp2m is disabled.

tklengyel commented 4 years ago

So I have a patch to Xen that explicitely requests a TLB flush every time the altp2m changes (PTE gets propagated or gfn is remapped) and also after it is deactivated. I've ran a small test and so far I haven't seen a bluescreen. However, I'm about to board my flight so can't really test it much further for now:

https://github.com/tklengyel/xen/tree/altp2m_tlb_flush

tklengyel commented 4 years ago

Interestingly with this https://github.com/tklengyel/libvmi/blob/altp2m_test/examples/event-example.c I can't trigger the bsod. It does the same thing as the crashmon+bsodmon plugin pretty much. Enables CR3 events and does altp2m remapping. With drakvuf I still see the crash. So the bug must still be somewhere else.. Perhaps if we get an unlocky shutdown when we just sent an event reply with a singlestep + altp2m switch request but pause the domain before it had a chance to process that reply, while at the same time we tear down altp2m?

tklengyel commented 4 years ago

I'm also fairly confident at this point that this isn't a stale TLB issue with Xen. I tried forcing a TLB flush after every altp2m switch, after every gfn-change and PTE propagation, and every time altp2m is enabled/disabled. No luck.

icedevml commented 4 years ago

I can confirm this is easily reproducible on my hardware.

xl destroy win7-60 && \
zfs rollback kuku1/vms/vm60@monero && \
xl restore lol.sav && \
drakvuf -t 5 -a crashmon -a bsodmon -r profiles/windows7-sp1.rekall.json -d win7-60

inside the VM (12 VCPUs backed by real cores) I run a web monero miner inside Google Chrome. After DRAKVUF stops, the Chrome almost always displays Aw snap!. Seems like the most CPU-consuming process is usually the victim and not necessarily the kernel.

...and crashmon alone is enough to trigger these problems.

tklengyel commented 4 years ago

@icedevml could you verify that this only happens if crashmon is enabled?

icedevml commented 4 years ago

@tklengyel -t 5 -a crashmon alone is enough to visibly crash Chrome in max. 5 repetitions of the command.

If both dkommon and crashmon are disabled (-t 5 -x dkommon -x crashmon) then everything works fine. And I guess the issue is somewhere in LibVMI because enabling the CR3 trap alone is enough to crash the VM, even when the plugin callbacks in cr3_cb are commented out.

tklengyel commented 4 years ago

Cool, at least we are making progress. The problem is unlikely to be LibVMI, this must be something in Xen, specifically around disabling the cr3 event when altp2m is in use.

icedevml commented 4 years ago

Such patch substantially reduces likehood of Chrome crash. Still it's not perfect - crash happens after 10-15 runs instead of 2-3. I wonder if there is some race that the event gets disabled and dispatched at the same time? Especially given the fact that the higher amount of vCPUs is helping to trigger this issue.

TLDR: when DRAKVUF is interrupted then it waits for the next CR3 callback to hit, then disables cr3 event (inside cr3_cb) and does real shutdown.

diff --git a/src/libdrakvuf/private.h b/src/libdrakvuf/private.h
index d5e5ac6..cf2f243 100644
--- a/src/libdrakvuf/private.h
+++ b/src/libdrakvuf/private.h
@@ -201,6 +201,7 @@ struct drakvuf
     GHashTable* remove_traps;

     int interrupted;
+    int cr3_cleared;
     page_mode_t pm;
     unsigned int vcpus;
     uint64_t init_memsize;
diff --git a/src/libdrakvuf/vmi.c b/src/libdrakvuf/vmi.c
index 5a8c9be..c844c0b 100644
--- a/src/libdrakvuf/vmi.c
+++ b/src/libdrakvuf/vmi.c
@@ -606,6 +606,12 @@ event_response_t cr3_cb(vmi_instance_t vmi, vmi_event_t* event)
         PRINT_DEBUG("CR3 cb on vCPU %u: 0x%" PRIx64 "\n", event->vcpu_id, event->reg_event.value);
 #endif

+    if (drakvuf->interrupted)
+    {
+        control_cr3_trap(drakvuf, 0);
+        drakvuf->cr3_cleared = 1;
+    }
+
     event->x86_regs->cr3 = event->reg_event.value;

     GTimeVal timestamp;
@@ -1317,7 +1323,7 @@ void drakvuf_loop(drakvuf_t drakvuf)
     drakvuf->interrupted = 0;
     drakvuf_force_resume(drakvuf);

-    while (!drakvuf->interrupted)
+    while (!drakvuf->interrupted || !drakvuf->cr3_cleared)
         drakvuf_poll(drakvuf, 1000);

     vmi_pause_vm(drakvuf->vmi);

tklengyel commented 4 years ago

I might have a fix.. Can you guys give https://github.com/tklengyel/drakvuf/tree/bsod_fix a try? Make sure you install the LibVMI version tagged in the submodule (https://github.com/tklengyel/libvmi/tree/unmask_fix).

icedevml commented 4 years ago

Seems to be more stable, but the kernel/program crashes after few dozen DRAKVUF start-stop cycles.

tklengyel commented 4 years ago

Hm, yea, on my rig it got up to 747 stop/restarts before doing a bsod.

icedevml commented 4 years ago

A hint is to use some simple yet heavy load inside VM and to have as much vCPUs as it's possible. The issue is much better reproducible.

I wonder if it would be possible to catch the issue into the debugger. As far as I seen if the VM is heavy loaded then chances are high that the CPU-heavy application will crash on access violation and not the kernel. This may give some hint about what's happening. I will check.

icedevml commented 4 years ago

I wrote a test program:

#include <windows.h>
#include <iostream>

DWORD WINAPI myThread(LPVOID lpParameter)
{
    unsigned int& myCounter = *((unsigned int*)lpParameter);

    while (1) {
        myCounter++;

        for (int i = 0; i < myCounter; i++) {
            myCounter += i;
        }
    }

    return 0;
}

int main() {
    unsigned int myCounter = 0;
    DWORD myThreadID;

    for (int i = 0; i < 20; i++)
        CreateThread(0, 0, myThread, &myCounter, 0, &myThreadID);

    while (1) {
        printf("%d\n", myCounter);
    }

    return 0;
}

and for now I run this instead of Chrome to ensure it runs in more predictable way. I have caught a crash of my program inside x32dbg, for a few times in the same place which looks like right after x64 "heaven's gate". (This is with bsod_fix branch)

crash

So now let's recompile it into x64 maybe...

icedevml commented 4 years ago

I've went back to the master branch DRAKVUF and recompiled the test program to x64. Right now I get access violation randomly inside the loop in myThread function.

Is there no execute permission? Or something is messed with memory mapping?

tklengyel commented 4 years ago

Is there no execute permission? Or something is messed with memory mapping?

There certainly seems like something like this. But after a close the altp2m views are destroyed and the default view is never touched. We only ever change mappings/permissions in the altp2m views. So there isn't anything there in terms of permission issues. The shadow pages are also removed. I also added a TLB flush in Xen at every conceivable point (including after enabling/disabling CR3 events) and it didn't make a difference so I don't think there is an issue there either. I've also been able to reproduce the issue with just LibVMI but only if altp2m is used with an active remapping and CR3 events being enabled. With just altp2m being used with a non-default view + CR3 there is no issue. With just CR3 there is no issue. This also doesn't seem to relate to anything KPTI related as I'm running an old Win7 where PCID is not used.

One interesting aspect I wasn't 100% able to verify is that to me the issue only seemed to have appeared if the shadow page that will be used for remapping was added after the altp2m view was created and switched to. If the shadow page is added to the physmap before altp2m is activated I haven't seen a bsod.

icedevml commented 4 years ago

@mtarral Check this out:

https://github.com/icedevml/xen/tree/RELEASE-4.13.0-patch-vmevent

it's the current stable Xen with three smart patches by @tklengyel. Seems like we've finally figured out what's wrong. Running latest DRAKVUF&LibVMI and the abovementioned Xen version I no longer experience guest VM crashes when running DRAKVUF with -a bsodmon or -a crashmon.

TLDR there was a race condition which was triggering on LibVMI destruction

discussion -> https://github.com/libvmi/libvmi/issues/899

manorit2001 commented 3 years ago

@tklengyel is there any update on this testing framework btw? ( like including it in ci )

mtarral commented 2 years ago

@manorit2001 sorry for the late feedback. This test frameworks assumes you have a Xen running with a VM, which is impossible at this point in Github Actions since it doesn't support nested virtualizations for example.

The only CI that supports it is Travis, and we use it for running KVM-VMI build tests: https://app.travis-ci.com/github/KVM-VMI/kvm-vmi

I hope this helps.

manorit2001 commented 2 years ago

@manorit2001 sorry for the late feedback.

no worries

The only CI that supports it is Travis, and we use it for running KVM-VMI build tests: https://app.travis-ci.com/github/KVM-VMI/kvm-vmi

doesn't jenkins support it? coz I think that some e2e tests are run in drakvuf as well as drakvuf-sandbox which should depend on xen

mtarral commented 2 years ago

doesn't jenkins support it? coz I think that some e2e tests are run in drakvuf as well as drakvuf-sandbox which should depend on xen

You are right ! I totally forgot that there is a custom Jenkins worker running for every PR here.

So, yes, it could be integrated into the test suite with custom tests based on some sample execution, and expected plugin output.

tklengyel / drakvuf

Proposal: Drakvuf test framework #662