skvl commented 2 years ago

In #1450 it have been noticed several messages Failed to read address from syscalls plug-in. This start to occur after making libusermode to inject MmCopyVirtualMemory instead of page faults.

I have researched the error and found several architectural issues.

Issue 1. If two or more plug-ins injects function calls then undefined behavior occur.

The scenario:

Two plug-ins hook the same function. E.g. S1.
On int3 event the libdrakvuf/vmi.c :: int3_cb calls the callback of plug-in P1 then P2
The P1 injects call for some function F1. This means that info->regs is changed (RIP equals the address of F1 and other registers contains arguments) and some values are pushed onto the stack.
Then P2 injects call for F2. The registers and stack are changed again.
On F2 completion the P2 would complete its job. And restores state.
But the restored state is the state with values for F1 injection.
But P1 would wait until the expected state (PID, TID, RSP).
This could break the VM state.

Issue 2. If one plug-in injects function call then other plug-ins could read the invalid state.

The scenario:

Two plug-ins hook the same function. E.g. S1.
On int3 event the libdrakvuf/vmi.c :: int3_cb calls the callback of plug-in P1 then P2
The P1 injects function call for F1. This changes the info->regs.
The P2 reads VM's state. Thus it reads modified registers.

The example trace:

# callback for USERHOOK

1655200441.922093 [VMI] [39241] Enter callback for BREAKPOINT
1655200441.922131 [USERHOOK] [hook_dll:576] [39241] [2532:2924:0xfffff88004c23a68] Enter
1655200441.968551 [USERHOOK] Found DLL which is worth processing 77bd0000: \Windows\System32\ntdll.dll
1655200441.969080 [USERHOOK] Start processing this dll_meta
1655200441.969239 Breakpoint VA 0xfffff80002bfa1b0 -> PA 0x2bfa1b0
1655200441.969367 [USERHOOK] Export info accessible OK 77cd6000
1655200441.969467 [USERHOOK] Export info accessible OK 77cd7000
1655200441.969569 [USERHOOK] Export info accessible OK 77cd8000
1655200441.969671 [USERHOOK] Export info accessible OK 77cd9000
1655200441.969772 [USERHOOK] Export info accessible OK 77cda000
1655200441.969870 [USERHOOK] Export info accessible OK 77cdb000
1655200441.969998 [USERHOOK] Export info accessible OK 77cdc000
1655200441.970107 [USERHOOK] Export info accessible OK 77cdd000
1655200441.970213 [USERHOOK] Export info accessible OK 77cde000
1655200441.970320 [USERHOOK] Export info accessible OK 77cdf000
1655200441.970427 [USERHOOK] Export info accessible OK 77ce0000
1655200441.970539 [USERHOOK] Export info accessible OK 77ce1000
1655200441.970643 [USERHOOK] Export info accessible OK 77ce2000
1655200441.970763 [USERHOOK] Export info accessible OK 77ce3000
1655200441.970874 [USERHOOK] Export info accessible OK 77ce4000
1655200441.971000 [USERHOOK] Export info accessible OK 77ce5000
1655200441.971107 [USERHOOK] Export info accessible OK 77ce6000
1655200441.974875 [USERHOOK] Trap page not accessible, inject copy memory 77ce7000. rcx=0xfffffa8002936640 rdx=0x77c87930 r8=0xfffffa8002936640 r9=0xfffff88004c23a60
1655200441.974892 [USERHOOK] [hook_dll:605] [39241] [2532:2924:0xfffff88004c23a08] Exit: perform hooking with status 128
1655200441.974899 [VMI] [39241] Exit callback for BREAKPOINT

# callback for SYSCALL

1655200441.974906 [VMI] [39241] Enter callback for BREAKPOINT
1655200441.975179 [39241] [2532:2924] Failed to read address (0x77c87930)
1655200441.975208 syscall EventID=39241 rcx=0xfffffa8002936640, rdx=0x77c87930, r8=0xfffffa8002936640, r9=0xfffff88004c23a60
1655200441.975342 [VMI] [39241] Exit callback for BREAKPOINT

One could see that failed address 0x77c87930 is the same as RDX after the injection of MmCopyVirtualMemory.

I have patch for this in tesing.

Issue 3. If one plug-in injects other plug-ins coult get parasit events on return from injected function.

The scenario:

Two plug-ins hook the same function. E.g. S1.
On int3 event the libdrakvuf/vmi.c :: int3_cb calls the callback of plug-in P1 then P2
The P1 injects function call for F1. This sets return address to entry point of S1.
If F1 is hooked by P2 or other plug-in then it would be logged.
The P2 logs the hook event on S1.
On F1 return the hook event on S1 occur.
The P2 logs the hook event on S1 again.

Thus one or more parasit events could be logged.

The short example from my patched version:

# Enter "callback" of "libusermode"
1655367817.962046 [LIBDRAKVUF] Callback enter. Trap=0x6080007278a0
1655367817.962071 [USERHOOK] [map_view_of_section_ret_cb_2:389] [33123] [2600:3552:0xfffff880043f2a10] Enter
1655367817.962087 [USERHOOK] [hook_dll:617] [33123] [2600:3552:0xfffff880043f2a10] Enter
1655367817.962503 [USERHOOK] Continue processing this dll_meta. Trap=0x6080007278a0

# "libusermode" finish the "DLL" and request registers restore.
# The request sets "info->regs_modified". The "info->regs" is the register state on `MmCopyVirtualMemory` exit.
1655367817.988264 [USERHOOK] Hook RtlGetVersion (vaddr = 0x77de873a, dll_base = 0x77db0000, result = OK)
1655367817.988273 [USERHOOK] Done, flag DLL as hooked
1655367817.988741 [APIMON] DLL hooked - done

# Dump of info->regs.
1655367817.988767 Dump info->regs
1655367817.988775 rax:  0000000000000000
1655367817.988781 rcx:  fffff880043f26e8
1655367817.988787 rdx:  fffffffffffffd70
1655367817.988793 rbx:  fffffa8002d7ab50
1655367817.988798 rsp:  fffff880043f2a10
1655367817.988805 rbp:  fffff880043f2b60
1655367817.988811 rsi:  00000000001cdfe8
1655367817.988817 rdi:  fffff880043f2a88
1655367817.988822 r8:   0000000000000000
1655367817.988828 r9:   0000000000000000
1655367817.988833 r10:  0000000000000001
1655367817.988839 r11:  fffff880043f2a60
1655367817.988845 r12:  000000000014ef78
1655367817.988851 r13:  ffffffffffffffff
1655367817.988857 r14:  0000000000000002
1655367817.988862 r15:  000000000014ef88
1655367817.988868 rflags:       0000000000000282
1655367817.988874 dr6:  0000000000000000
1655367817.988880 dr7:  0000000000000400
1655367817.988885 rip:  fffff80002bfa1b0
1655367817.988892 cr0:  0000000080050031
1655367817.988898 cr2:  0000000077e4f75f
1655367817.988903 cr3:  00000000493bc000
1655367817.988909 cr4:  00000000000406f8

# Dump of info->regs_modified.
1655367817.988915 Dump info->regs_modified
1655367817.988920 rax:  000000000014ef88
1655367817.988927 rcx:  ffffffffffffffff
1655367817.988932 rdx:  00000000001ce010
1655367817.988938 rbx:  fffffa8002d7ab50
1655367817.988943 rsp:  fffff880043f2a68
1655367817.988949 rbp:  fffff880043f2b60
1655367817.988954 rsi:  00000000001cdfe8
1655367817.988960 rdi:  fffff880043f2a88
1655367817.988965 r8:   00000000001ce008
1655367817.988970 r9:   0000000000000002
1655367817.988976 r10:  fffff80002bfa1b0
1655367817.988981 r11:  fffff800028dc138
1655367817.988986 r12:  000000000014ef78
1655367817.988991 r13:  ffffffffffffffff
1655367817.988996 r14:  0000000000000002
1655367817.989001 r15:  000000000014ef88
1655367817.989007 rflags:       0000000000000246
1655367817.989012 dr6:  0000000000000000
1655367817.989018 dr7:  0000000000000400
1655367817.989023 rip:  fffff80002bfa1b0
1655367817.989029 cr0:  0000000080050031
1655367817.989036 cr2:  0000000077eb1000
1655367817.989059 cr3:  00000000493bc000
1655367817.989066 cr4:  00000000000406f8

# Exit "callback" of "libusermode"
1655367817.989071 [USERHOOK] [hook_dll:648] [33123] [2600:3552:0xfffff880043f2a10] Exit: perform hooking with status 0
1655367817.989081 [USERHOOK] [map_view_of_section_ret_cb_2:397] [33123] [2600:3552:0xfffff880043f2a10] Exit. Finish.
1655367817.989090 [LIBDRAKVUF] Callback exit. Trap=0x6080007278a0

# Enter "callback" of "syscall"
1655367817.989113 [LIBDRAKVUF] Callback enter. Trap=0x608000717ca0

# One could see "0xfffffffffffffd70" is the same as "rdx" on MmCopyVirtualMemory exit.
Failed to read address (0xfffffffffffffd70)
1655367817.989534 syscall EventUID=33123,Trap=0x608000717ca0
1655367817.989763 [LIBDRAKVUF] Callback exit. Trap=0x608000717ca0

This could be drawn like this (https://excalidraw.com/): drakvuf-error

So with the patch the "Issue 2" is fixed but "Issue 3" remains.

I'm looking for better patch.

skvl commented 2 years ago

One possible solution that I could imagine is to use MTF (monitor trap flag) to move one instruction forward from entry point before injection.

Though it looks complicated. And without checks.

I would be glad if anybody could give me some advise for this issue.

tklengyel commented 2 years ago

IMHO the solution is pretty clearly that on a given vCPU a single injection ought to happen at a time. The first plugin that injects ought to prevent subsequent plugins from overriding the in-flight injection.

skvl commented 2 years ago

IMHO the solution is pretty clearly that on a given vCPU a single injection ought to happen at a time. The first plugin that injects ought to prevent subsequent plugins from overriding the in-flight injection.

This is already done by the patch mentioned above. This fixes issues 1 and 2. But the question is how to fix the issue 3.

I think I would extend libinjector to know about all injections in progress. And prevent vmi.c to call other plug-ins until injection complete.

tklengyel commented 2 years ago

The idea always was that injection happens only during the setup phase of the VM and during runtime no injection should take place as the state of the OS in no longer guaranteed to meet sane parameters. Disabling callbacks to other plugins while an injection is taking place may not be the best solution as injection may take a long time and cause other events to be missed by other plugins. For now with case 3 I think its better to just have some duplicate events logged due to injection.

BonusPlay commented 2 years ago

[...] during runtime no injection should take place [...]

but that's what we do in multiple plugins. We've been aware of this issue with @chivay and @kscieslinski for a long while, but only solution we've been able to come up would require a major refactor to both drakvuf engine and drakvuf plugins.

skvl commented 2 years ago

[...] during runtime no injection should take place [...]

but that's what we do in multiple plugins. We've been aware of this issue with @chivay and @kscieslinski for a long while, but only solution we've been able to come up would require a major refactor to both drakvuf engine and drakvuf plugins.

I would try to fix this. Already in progress.

skvl commented 2 years ago

For now with case 3 I think its better to just have some duplicate events logged due to injection.

I believe that duplicate events are very bad. Such events mask real behavior of a sample.

So I could suggest other method: drop events in context of PID : TID from injection beginning until it finish. All other events would be logged as is. Thus we get much cleaner trace.

tklengyel commented 2 years ago

That's a possible solution but IMHO it would be better to keep it configurable instead of hard-coding something. My concern is that if there is a way for malware to trigger a behavior in drakvuf that skips logging events than it would be tempting to try to trigger it to hide certain malware behavior. For example, if injection is needed to page memory back in and it won't issue any plugin callbacks until that is finished, well, now you have a pid:tid that goes dark as far as drakvuf is concerned. What if you have another tid that anticipates drakvuf's injection and hijacks it right after, making the thread totally dark as the drakvuf injection never returns? Would be a perfect anti-drakvuf setup. So I think any time injection is taking place on the vm more caution needs to be performed and blindly dropping events by itself is going to cause issues down the road. At least if its left configurable malware can't target a potentially vulnerable default configuration.

skvl commented 2 years ago

I'm agree.

tklengyel / drakvuf

Plug-ins intersection on injection breaks results #1469

Issue 1. If two or more plug-ins injects function calls then undefined behavior occur.

Issue 2. If one plug-in injects function call then other plug-ins could read the invalid state.

Issue 3. If one plug-in injects other plug-ins coult get parasit events on return from injected function.