Using SimpleVisor to run dynamic recompiled code emulating non-x86-64 code

hlide commented 7 years ago

I see no other way for discussing how it may be possible for SimpleVisor to run dynamic recompiled code emulating non-x86-64 code.

I have written an emulator which runs some PSP programs (based on a customized MIPS32 ISA) and I would like to extend some of my realizations to other guest ISAs.

The emulator is using an HLE (High Level Emulation) principle: dynamically recompiling a user-land guest code into a native host code; kernel and hardware relative guest code are directly compiled from native host source so there is no need to emulate any functionality or hardware at lower level. Basically, a syscall will call a native host function instead of trying to emulate guest instructions step by step inside the syscall.

The dynamic compiler emits x86-64 instructions and uses its own ABI if I can say this way: up to 12 GPR registers are available for the integer register allocator. By not trying to comply with the usual Windows 64-bit ABI, I can allow faster emulation. The key is the chains of basic blocks is totally built by the dynarec so the usual ABI is not needed to be saved and restored between host basic blocks (only when calling a syscall). Some details can be found here and there.

I have different fields I would like to address with a hypervisor like yours and check if it is possible:

1) To execute the chains of the generated code inside a dedicated logical processor. A call to a guest syscall will exit to native Windows code to execute a functionality or recompile a new guest code. Idealistically I want that generated code being inside the first virtual 4GB address range to keep the ICache array with 32-bit function pointers as entries since most ISAs I want to emulate have 32-bit pointers. I sometimes wonder if this logical processor may need its own memory mapping or not. Is that possible with SimpleVisor or would it be better to keep the same memory mapping as the running Windows program?

2) To get a perfect memory emulation which mimics the guest memory mapping. I used some Windows specific tricks to be able to have a very fast memory access. While it may be enough for emulating PSP , it may not for other architectures. Another possibility I can see is to use fs segment when running generated code in its logical processor (no sure if gs can also be used for another purpose as long as logical processor doesn't exit, or can it ?). This way, fs segment may map the whole 4GB of the guest memory (also called dcache), and a simple MOV or MOVBE can be done with a FS prefix to access this guest memory. If gs can also be used, it could also map the huge icache (for each guest address, it maps a potential recompiled basic block to jump into, or to recompiler code to create a new basic block) and would allow very fast execution of chains of host basic blocks.

For those two points, do you think SimpleVisor may help?

Best regards.

hlide commented 7 years ago

After some thoughts about it, it seems what I'm looking for is mostly how to reroute FS and GS segments in a safe way. Having a virtual memory different from a Windows process is a goal neither desirable nor necessary.

When running the generated code with its own ABI, FS and GS will respectively point upon a 4GB guest data cache and a 4GB host basic block cache. When exiting the generated code for executing external code, FS and GS must be restored to their Windows values. And should external events like interrupts happen, FS and GS should also be restored to their Windows values. I'm not fluent enough with the hypervisor to determine whether it is possible and how to implement it efficiently with, say, SimpleVisor.

The reasons why FS and GS may be perfect for emulating guest mapping are:

1) guest architectures I plan to work with - whether they use 32-bit or 64-bit registers, use a 32-bit pointer, which means the 32 upper bits are ignored in most case. See void* good_pointer = ...; void* still_good_pointer = 0x5AF89C7C00000000 ^ good_pointer;. good_pointerand still_good_pointer will point at the same place in memory. If I want to address this issue, I need to use an intermediate register to compute the 32-bit effective address (EA) before accessing memory: MOV RDX, [RSI + ofs_GPRbase]; ...; MOV ECX, EDX; MOVBE RAX, [RBX + RCX + imm] where RBX is the base of the guest data memory, RCX an intermediate register to hold EA, RAX the allocated register to hold the read value, RDX another allocated register to hold the guest register base for a load and RSI a persistent register to hold the guest registers context address. Now imagine there are several pointers in the basic block, the chance to optimize registers usage are compromised. In fact, x86-64 can deal with 32-bit EA itself by using prefix 0x67. You can issue the same sequence without the need of intermediate register: MOV RDX, [RSI + ofs_GPRbase]; ...; MOVBE RAX, FS:[EDX + imm] where we replace RBXwith FS for the base of the guest data memory. The computation of EA directly takes EDX as a base and do a 32-bit zero extension at the end. That sounds exactly how the guest processor would do : ignore the 32 upper bits. And RDX can still contain random upper bits without any consequence when using it as a base/index for load and store operations.

2) the same situation may happen for the host basic block cache. To chain basic blocks in the case of a guest indirect jump instruction: MOV RDX, [RSI + ofs_GPRtarget]; ...; MOV EBP, EDX; JMP [RDI + RBP] where RDI is the base of the host basic block cache, RDX an allocated register to hold the guest register target for an indirect jump, RBP a special register to contain the target jump register and RSI a persistent register to hold the guest registers context address. RBP is also used by the recompiler code when a basic block must be generated so it may know the guest address to recompile. If we map that cache into GS segment, we obtain: MOV RDX, [RSI + ofs_GPRtarget]; ...; MOV EBP,EDX; JMP GS:DWORD PTR[EBP] where we replace RDI with GS. We need to keep EBP so recompiler function can still get the guest address to recompile from it. If the target GPR is only used for an indirect jump in the basic block, a simpler version is done: MOV EBP, [RSI + ofs_GPRtarget]; JMP GS:DWORD PTR[EBP].

Now some questions:

1) If an external event occurs, could SimpleVisor catch it and assure FS and GS are set properly so the event can be handled by Windows properly? 2) What is the best solution to alter FS and GS in a safe way where FS and GS point on the huge memory blocks only when executing generated code?
3) Can SimpleVisor handle several virtual CPUs? Suppose we have 8 guest processors, do we need to make 8 virtual CPUs? 4) When using a win32 thread to create a guest thread (remember emulator only emulate the user-land guest code), is there a chance for all the guest threads to run in parallel according to the number of true logical processors of the Intel processor?

Best regards.

ionescu007 commented 7 years ago

Hi!

The short answer is: simple visor would probably not be a good match for your goals. I think that the BareFlank project might have closer to what you need... that being said..

If your main goal is having FS/GS pointing to something else, what you really want is a custom LDT. Why not simply do that? On x86 you can simply do it from user-mode, with documented APIs. On x64, you can use a driver.

hlide commented 7 years ago

I'm only considering x64 host machine. I'm not familiar with the idea to use a driver to do so (any pointer on a source showing how altering FS/GS segment through a driver ?). If I must pass by a ioctl to save or restore FS/GS segment, it sounds awfully slow as a method. I was hoping by a transparent use of vm exit to determine which FS/GS segments to save/restore it would be faster. Thanks anyway.

ionescu007 commented 7 years ago

Hi,

Are you familiar with the idea of an LDT?

A driver IOCTL is significantly faster than a VMEXIT.

Best regards, Alex Ionescu

On Thu, Mar 2, 2017 at 8:07 PM, hlide notifications@github.com wrote:

I'm only considering x64 host machine. I'm not familiar with the idea to use a driver to do so (any pointer on a source showing how altering FS/GS segment through a driver ?). If I must pass by a ioctl to save or restore FS/GS segment, it sounds awfully slow as a method. I was hoping by a transparent use of vm exit to determine which FS/GS segments to save/restore it would be faster. Thanks anyway.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ionescu007/SimpleVisor/issues/19#issuecomment-283764830, or mute the thread https://github.com/notifications/unsubscribe-auth/AFxIeAk-H9RxEtovpxTePEsJyVulLYEzks5rhyFugaJpZM4MMcto .

buraktamturk commented 7 years ago

I don't think that manipulating FS/GS segment from the driver would solve the issue as they are likely to be restored (maybe from ETHREAD structure?) on taskswitch. Also, Windows uses them for exception handling, syscall on wow64 processes etc,.

What you need is KVM alternative on Windows. I think KVM is what you're asking for (except it is for Linux): https://lwn.net/Articles/658511/

rianquinn commented 7 years ago

Have you taken a look at our hyperkernel? https://github.com/Bareflank/hyperkernel

It still needs a lot of work (the scheduler cannot preempt yet), but we can at least provide a very simple C/C++ environment in a guest. With what is there, you should be able to setup whatever environment you want. We should have the scheduler more complete once I get the interrupt management code into the extended APIs as I need a clock to finish the scheduler.

hlide commented 7 years ago

@ionescu007 yes I'm. I'm familiar with the real and protected mode (while it may be a long time ago), not with hypervisor mode. The LDT solution may not work as you seem to think. Again I'm only considering 64-bit user-land running code: no 32-bit running code and WOW64 stuff.

Is LDT switched between WIN32 threads? I have no clear answer.
LDTR is a privileged instruction so you need to call a kernel function.
LDTR is not enough, you must reload FS and GS segments.
If an exception or an interrupt occurs, LDT register won't be the right needed by OS if this one needs to access GS register. That point is probably the most critical. Well, I guess there is the SWAPGS case when being in kernel mode, but what about the user part?
Are not FS and GS settable by MSR registers FS.base (C000_0100h) and GS.base (C000_0101h)?

Supposedly I have a driver which allows for me to save/restore FS/GS in a specific thread. I still need to protect this thread against the interrupts (anything which can interrupt my user-land code to execute an asynchronous Windows code that may need to access the original FS/GS value).

NOTE: I think I heard somewhere FS segment is a NULL segment (selector 0) when running in a 64-bit thread and which is unconditionally set to NULL selector whenever the thread is switched.

EDIT: found it in here

I have also noted some interesting additions to recent CPUs and operating systems related to this case. One is the FSGSBASE extension instructions starting with Intel Sandy Bridge and AMD Steamroller. This allows a Ring 3 code to directly access the fs and gs registers. Another is the User Mode Scheduling added mechanism since Windows 7. This allows pure user mode thread scheduling.

I was once excited to try utilizing these mechanisms, but eventually failed. The Windows kernel does not seem to properly save/restore the fs register upon context switch. AFAIK it just clears fs out for a pure 64bit application. For gs register, if I set it to an arbitrary value, at some point the operating system (or maybe just the VS debugger) will complain about that the register value is wrong, and the application is in a bad condition.

hlide commented 7 years ago

@buraktamturk One year ago I found HAXM(the last post in this thread is mine) and a GitHub project as a sample how to use HAXM but the sample was so simple that I couldn't know if I could use it properly in lack of more information. It indeed sounds like a KVM-like solution which appears to be portable (Windows, MacOS and Linux).

hlide commented 7 years ago

@rianquinn

I have. But it raises more questions. First, is there any version running in Windows as a host? Second, it appears to use mingw64 and gcc. is it just for the driver or should I need to compile all my program with mingw64 and gcc?

rianquinn commented 7 years ago

It supports Windows 8.1 and 10. We will have support for BSD and MacOS soon. You build the driver with VS and the hypervisor is built with Clang. We got rid of GCC in master.

hlide commented 7 years ago

@rianquinn I guess I can also post my question in https://github.com/Bareflank/hyperkernel?

My main question is: can a hypervisor intercept an external event like an interrupt which needs to be executed by Windows and to make sure a VM_EXIT will restore FS/GS segments (MSR) and when returning into the dynarec code to save them and set specific values to FS/GS segments (MSR)? I suppose the dynarec code will execute in a virtual cpu and any call to external code (that is, needing Windows environment) will exit this virtual cpu so windows can execute the needed code then the hypervisor resumes the virtual cpu. What I'm not sure about is how the virtual cpu works. Is it associated to one host logical processor? is there a way to be sure any external events (exception or interrupts) can be handled outside the virtual cpu? And so on... I lack that kind of answers which may give the best solution.

hlide commented 7 years ago

@rianquinn oh and "and the hypervisor is built with Clang": does it mean i still need to use something else VS for compiling the emulator program supposedly that what you call hypervisor should be part of the emulator program?

rianquinn commented 7 years ago

Currently you build guest VMs with Clang as well. In the future we want to support PE/COFF as well. We just started the hyperkernel so there is a lot of work to do.

rianquinn commented 7 years ago

As for your other question, each guest is given its own set of vcpus so Windows has a VMCS and so does the guest. You have complete control of all of the segments and registers from a vcpu including interrupts.

ionescu007 / SimpleVisor

Using SimpleVisor to run dynamic recompiled code emulating non-x86-64 code #19