Is vmxoff path really safe/correct?

ionescu007 commented 8 years ago

Hi,

I experimented with turning VMX off by using DPC routine targeted to each processor, with synchronization barrier. If DPCs were delivered as part of SYSTEM process (such as from idle thread, or from system thread), everything worked fine. However, if a process was currently running, and the DPC interrupted the process, many times the process crashes with invalid memory access, or sometimes a kernel driver itself crashes (such as win32kfull running on the context of dwm.exe), when accessing session memory.

These problems go away if doing VMX off the way you do it, by sticking with a single thread that is already running under SYSTEM process. However, even if you have a comment that suggests you saw some strangeness.

So therefore, I have two questions:

1) Are you sure that all the register state is correctly saved and restored? You only save integer registers, but not xmm0,1, 2, 3, 4, 5 on x64, which are volatile registers (and neither the other xmm registers). Why do you only save integer registers during vmm_exit and vmxoff restore? Shouldn't FPU registers be saved too?

Must vmx_launch and vmx_off be done in the same address space? In other words, will vmx_off restore the CR3 that was saved during vmx_launch? Because if so, I can understand the danger of capturing CR3 in system process (session -1), but then doing vmx_off in dwm.exe (session 0). The totally different CR3 will certainly cause a crash. Does this mean that it's only safe to do on/off from within SYSTEM process?

Thank you.

tandasat commented 8 years ago

Thank you for the through description of questions.

As for the second question, VMXOFF does not change a value of CR3. However, it does not mean that VMXON and VMXOFF can only be done on the SYSTEM process. As long as a value of CR3 after VMXOFF remains the same as that of before VM-exit, it will be fine.

In case of HyperPlatform, it simply sets the same CR3 value for both the SYSTEM process (VmcsField::kGuestCr3) and the VMM (VmcsField::kHostCr3), and does not restore the CR3 that was being used at the time of VMCALL after VMXOFF. For this reason, VMXOFF can only be safely performed from the SYSTEM process in current implementation. The following are changes of CR3 w.r.t. VMXOFF:

Before VMCALL (CR3 == VmcsField::kGuestCr3)
VMCALL causing VM-exit CR3 is overwritten by a value stored in VmcsField::kHostCr3 during transition between VMX non-root operation, a guest context, to VMX root operation mode, a VMM context (CR3 == VmcsField::kHostCr3)
VMXOFF (CR3 == VmcsField::kHostCr3)
After VMXOFF, a guest context (CR3 == VmcsField::kHostCr3)

So, if VMCALL leading to VMXOFF is executed with a different CR3 from that of VmcsField::kHostCr3, this would cause crashes, as an original CR3 is not going to be restored even after VMXOFF.

Does this answer your question clear enough?

As for the first question, I do not have an answer right now, but it sounds good pointing out. I will look into it and will reply here shortly.

Thank you,

tandasat commented 8 years ago

As for the restoring registers, you are right. I believe that the VMM should save and restore non-interger registers too if they can be modified by the VMM since they are not automatically saved and restored by a processor on VM-exit and VM-enter.

In case of HyperPlatform, it does not appear to be an issue because those unhandled registers are not modified by the VMM; or VM-exit does not happen under a situation where consistency of those registers is required. For this reason, I would hold off updates unless an concrete issue is seen.

Thank you,

ionescu007 commented 8 years ago

Hi,

Thank you for the explanation on CR3. That makes a lot of sense. It looks like the requirement for SYSTEM could go away, if, after the vmcall, the CR3 would be captured, and overwritten in the kGuestCr3 field before the vmxoff. But I don't think it's worth doing that.

As for non-integer registers, on x64, the calling convention does support the compiler auto-generating function calls which use xmm registers. This will happen automatically in certain cases, in other cases, one must use __vectorcall as the calling convention to force this. Because of this, it is certainly possible for someone to have a vector function that is using volatile XMM registers, and hit a VMM Exit. Because you are using Visual Studio and because Windows Kernel x64 allows XMM registers, I think it's technically possible for hyperplatform.sys to end up using XMM registers -- for example if a memcpy becomes inlined + vectorized.

So I think this is taking a risk that certain optimization settings/further code might hit. It may be OK now, but not if someone adds more routines without realizing them.

Thanks for your insight!

tandasat commented 8 years ago

Thank you for sharing your knowledge on use of XMM registers. I was unaware of most of them.

I agree with your thoughts on potential risk of not fixing it. While the issue may not hit at this time, it would be tough to figure out a bug is caused by XMM registers for someone who are not aware of it. Given the fact that this project is meant to be a platform, fix should be considered. For now, I have created #5 for reference and potential future work.

I appreciate your thorough risk assessment and suggestion.

rianquinn commented 8 years ago

while working on bareflank.org, I ran into a similar issue with XMM, as I do believe they need to be addressed. The old vmxcpu code never saved these registers because it only ran on 32bit, which did not included XMM registers in the ABI, which is why we never included them in the MoRE hypervisor (as it's based on vmxcpu). On Bareflank, we use Libc++ which touches XMM and FPU registers, so these absolutely need to be saved in our case. Our current solution is to use the fxsave and fxrstor functions. In our state save area. Figured this info might help you guys in the future if you choose to fix this problem. Here is our PR that adds this code if you want an example

https://github.com/Bareflank/hypervisor/pull/123

ionescu007 commented 8 years ago

I highly recommend you use XSAVE/XRESTORE if available, otherwise you risk to miss certain registers and have hard to debug issues :)

Assuming LibC++ might use such registers (if you start using AES-NI or AVX).

tandasat commented 8 years ago

Thanks for sharing your findings. While I am not entirely familiar with the extended states, looks like an XSAVE feature set is more comprehensive ones, as far as I can see from this slides from Intel.

I originally thought a chance of hitting this issue was small and negligible, but it does not seem convincing anymore since Libc++, which one of the most common libraries, is affected. Libc++ may not going to be used inside HyperPlatform, but I would not surprise if other libraries triggered the same issue.

I will study the extended states first (ie, supported processor generations, difference between XSAVE, -OPT, -S and -C, and any implications) and then will work on implementation for fix.

rianquinn commented 8 years ago

Thanks for the advice, yeah we will be using XSAVE as well. The PR that is there will be updated with a couple of additional fixes as I found a nasty issue with -O3 as GCC uses SSE a lot with the heavy optimizations enabled. The problem is, GCC assumes the stack is 16 byte aligned, and so we were seeing segfaults as a result. The other thing I was missing was a restore on "promote" when coming out of VMX-non root. I should have my fixes in today for these.

tandasat commented 8 years ago

As far as preservation of x87, SSE and AVX are concerned (and that assumption is fine I think--need better understanding), the simplest way seems to use XSAVE and XRESTORE. It will require only checks for:

CPUID function 1:ECX.XSAVE[bit 26] == 1 to check support of the XSAVE
CPUID function 0DH, sub-function 0 to check if AVX state is supported

If either of them is not supported, using FXSAVE and FXRSTOR is the cleanest solution as they "save and restore the states of the XMM registers and the MXCSR register (SSE state), along with x87 state" (FXSAVE AND FXRSTOR INSTRUCTIONS), or RtlCaptureContext() as they seem to save SSE state too.

Some notes below.

Terms and Basics (XSAVE-SUPPORTED FEATURES AND STATE-COMPONENT BITMAPS)

Each processor state managed by the "XSAVE feature set" is called state. For example, x87 state for the x87 FPU execution environment, SSE state for the streaming SIMD extensions.
States are referred to as "state components," and managed by "state-component bitmaps."
A state-component bitmap comprises 64 bits; each bit in such a bitmap corresponds to a single state component. For example, bit 0 corresponds to the x87 state and bits in the range 62:10 are not currently defined.
- The XSAVE feature set takes state-component bitmaps from in EDX:EAX, and that value is called the "instruction mask,"
- If the bit corresponding to a state component is clear in XCR0, instructions in the XSAVE feature set will not operate on that state component, regardless of the value of the instruction mask
"user state components" can be managed by the entire XSAVE feature set, and XCR0 contains a state-component bitmap for them
"supervisor state components" can be managed only by XSAVES and XRSTORS, and IA32_XSS MSR contains a state-component bitmap for them
- All the state components corresponding to bits in the range 9:0 are user state components, except PT state
The "requested-feature bitmap (RFBM)" is a bitmap to be actually specify which components should be saved or restored.
- for user state components, RFBM == EDX:EAX & XCR0
- for supervisor state components, RFBM == EDX:EAX & (XCR0 | IA32_XSS)
The XSAVE feature set allows saving and loading processor state from a region of memory called an "XSAVE area".

Availability (ENUMERATION OF CPU SUPPORT FOR XSAVE INSTRUCTIONS AND XSAVESUPPORTED FEATURES)

If CPUID function 1:ECX.XSAVE[bit 26] == 0, no support of the XSAVE feature set at all
If CPUID function 1:ECX.XSAVE[bit 26] == 1,
- at least XGETBV, XRSTOR, XSAVE, and XSETBV are supported, and
- CR4.OSXSAVE can be set to 1
CPUID function 0DH, sub-function 0 returns
- supported "components," and
- a size of an XSAVE area for all the user state components supported by this processor
CPUID function 0DH, sub-function 1 returns availability of
- XSAVEOPT, which is simply an optimized version of XSAVE
- XSAVEC, which uses "compaction extensions" (ie, uses the "compacted format" for the extended region of the an XSAVE area) that are supposedly efficient than standard one (the "standard format").
- XSAVES, which saves the "supervisor state components" apart from the "user state components".

Data structures (XSAVE AREA)

The "legacy region" comprises the 512 bytes starting at the area’s base address. It is used to manage the state components for x87 state and SSE state
The "XSAVE header" comprises the 64 bytes starting at an offset of 512 bytes from the area’s base address
The "extended region" starts at an offset of 576 bytes from the area’s base address. It is used to manage the state components other than those for x87 state and SSE state

rianquinn commented 8 years ago

I am writing the code right now so I'll update the PR to bareflank that has at least what we are working on. With any luck I should have ours fixed by tonight. The approach we are taking is assuming that XSAVE is available as we only support SandyBridge and above. Once you have XSAVE, you can get the size of the save area by doing a

mov rax, 0x0D
cpuid
mov <state_save_size>, rcx

One thing I learned (took a bit to debug) was that xrstor will GPF if you don't zero out this memory. Since we have multi-core support, we do a lot of work on the stack to keep things thread-safe, and so not clearing the state save area was causing bits in the state save area to randomly trigger GPFs on restore. Also don't forget to make sure the state save area is 64byte aligned.

The harder part is how to handle XSETBV and CPUID. There does not seem to be a way to fake the result of XGETBV. At least, a call to XSETBV always traps, so we can ignore writes that disable bits we are using (will likely add MPX support in the future). Not really sure how best to handle XGETBV, but hopefully Windows (as Linux is ok) just enables all of the XCR0 bits and leaves it alone.

rianquinn commented 8 years ago

Ok, I updated the PR. It's not done yet as I still have to plum in the rest of the entry points, and I need to add the XSAVE logic to the exit handler, but this at least shows one way to use it

https://github.com/rianquinn/hypervisor/blob/compiler_flags/bfvmm/src/misc/src/execute_entry.asm#L88

ionescu007 commented 8 years ago

Hi,

Note that for Windows, there are APIs such as RtlGetProcessorExtendedFeatures and KeSaveExtendedProcessorState/KeRestoreExtendedProcessorState which should take care of most of this.

tandasat commented 8 years ago

Thank you for APIs, Alex. I did not realize XSTATE_MASK_GSSE meant the AVX state. Those functions are very helpful. I will write tests and update code shortly.

@rianquinn, thanks for sharing gotchas and code. Thanks to API provided by Windows, I am likely to be able to avoid direct use of the XSAVE feature set instructions. As for XCR0, if I recall correctly, I hardly saw VM-exit via XSETBV on Windows, perhaps, only at the time of system shutdown or something under a non-regular situation. So it may not be a big issue on Windows too.

tandasat commented 8 years ago

I was going to use APIs, but KeSaveExtendedProcessorState() could be used only when IRQL <= DISPATCH_LEVE, while VM-exit could be executed at any IRQL. I will use xsave/xrestore directly once availability of those instructions and AVX support is validated with RtlGetEnabledExtendedFeatures() (for some reasons, I did not find declaration of the API and had to write a prototype of it manually).

@rianquinn, you may be able to use intrinsics functions for xsave/xrestore etc depeding on compiler's support. https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_xsave

At least MSVC supports them well enough for my porpose. https://msdn.microsoft.com/en-us/library/hh977022.aspx

rianquinn commented 8 years ago

Sadly in my case, the entry point for both the Kernel -> C++ and the entry point for the exit handler need to start with a simple ASM block as I have to control each register access very carefully, so I have a bit of hand written ASM to pull that off. But yeah... One thing that I will likely work on at some point is porting some of the intrinsics code that we do have to the Intel calls, as GCC does support these, and reducing the amount of ASM code we have would be a good thing. The code I linked above not only has to run xsave, but it also has to do MS64 ABI to System V ABI conversions (not in there yet), so some of that ASM work that you see is preperation for that code.

tandasat commented 8 years ago

I have not faced an issue and found documents w.r.t zero-ed out memory on XRSTOR either. Instead, I had to debug a bit to noticed that XSAVE/XRSTOR raises an exception when CR0.TS[bit 3] is 1. I only saw this situation on x86 Windows for now. As workaround, code like below worked.

  Cr0 cr0 = { __readcr0() };
  cr0.fields.ts = 0;
  __writecr0(cr0.all);
  _xsave(stack->processor_data->xsave_area, stack->processor_data->xsave_inst_mask);

It is kind of ugly though.

rianquinn commented 8 years ago

In our case, we only support 64bit, so we don't have to worry about TS as it has no purpose on x86_64 since task switching is not available. That being said, Windows doesn't use x86 tasks, so you should be ok disabling it.

ionescu007 commented 8 years ago

Actually Windows (like Linux and OS X) uses tasks for Double Faults and NMIs, the latter which are recoverable.

Additionally, video mode switching is implemented, in some cases, with the help of INT 10 / VIDEO ROM code running in V8086 mode, with a special TSS that has an IOPM to allow direct I/O access. All of the above may react badly to turning of the TS bit.

Satoshi: IRQL requirements don't mean much when running in VMM context. For example, even if you enter at PASSIVE_LEVEL, you are technically at HIGH_LEVEL for all intents and purposes, as interrupts are disabled. So by that analogy, you should be calling almost zero Windows APIs, except if they are documented to run at HIGH_LEVEL.

Best regards, Alex Ionescu

On Mon, Jul 4, 2016 at 11:16 AM, Rian Quinn notifications@github.com wrote:

In our case, we only support 64bit, so we don't have to worry about TS as it has no purpose on x86_64 since task switching is not available. That being said, Windows doesn't use x86 tasks, so you should be ok disabling it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tandasat/HyperPlatform/issues/3#issuecomment-230336589, or mute the thread https://github.com/notifications/unsubscribe/AFxIeDHfyJGM0aI6IzjxLQJtLdc3WP4wks5qSU33gaJpZM4HtbQh .

tandasat commented 8 years ago

Thank you for the note, Alex.

I do not think I understand why you cannot call API when interrupts are disabled. The eflags.IF is cleared when VM-exit happened but the IF only affects hardware interrupt, and exceptions can still occur. I tested that even page-in was processed fine in the VMM-context if IRQL is PASSIVE_LEVEL. To my knowledge, the IRQL requirement mostly stems from if page-in can be processed--in other words, if the process can enter wait state--, and interrupts are irrelevant.

I bet that I am missing something and appreciate if you could explain a bit more about why disabling interrupts is technically the same as being IRQL==HIGH_LEVEL.

tandasat commented 8 years ago

A test program attached. ConsoleApplication1.zip

ionescu007 commented 8 years ago

Hi Satoshi,

Wow -- I cannot believe you were crazy enough to try page-in from VMM context! Let me explain to you why VMM context == HIGH_LEVEL :-)

1) Is it safe for you to be context switched by the OS while in the middle of VMM mode? Of course not... So you are at least at DISPATCH_LEVEL. Is it safe for you to "wait" on an object while at VMM mode? Of course not -- you would be context switched to another thread/idle thread which would now be running as VMM Host!!!

2) Is it safe/OK for you to receive DPCs while in the middle of VMM mode? Again, of course not. Another reason why you are at least at DISPATCH_LEVEL. Could you receive a DPC, even if you wanted to? Nope -- receiving a DPC requires an interrupt, and IF is off, so Local APIC will never deliver it

3) Will you receive any Device Interrupts? Nope, because EFLAGS IF is off. Would you want to be interrupted in the middle of VMM mode? Also nope. So you are at least at MAX_DIRQL.

4) Will you receive the clock interrupt? Nope (also why you hit a CLOCK WATCHDOG BSOD sometimes)... So you are at least at CLOCK_LEVEL.

5) Will you receive IPIs? Nope, because IF is off, so LAPIC will never send them. You also probably don't want to be running IPI while inside VMM host... So you are at least at IPI_LEVEL.

6) Technically because you are not in the middle of handling an IPI, but rather you've disabled interrupts completely, you are at IPI_LEVEL + 1, aka HIGH_LEVEL.

In other words, if you call, for example, ExAllocatePoolWithTag, and this is PAGED POOL, you can get unlucky and this will require page-in which requires blocking your thread, and now, some other thread will run in VMM host mode... Sure, you can get lucky and control will come back to you, but this is insane... If you request NON PAGED POOL, it will "appear to work"... And then in one situation, a TLB flush will be required, which sends an IPI... Which can't be delivered... And so it will hang. Etc., etc...

Hope this makes sense.

Best regards, Alex Ionescu

On Mon, Jul 4, 2016 at 9:21 PM, Satoshi Tanda notifications@github.com wrote:

Thank you for the note, Alex.

I do not think I understand why you cannot call API when interrupts are disabled. The eflags.IF is cleared when VM-exit happened but the IF only affects hardware interrupt, and exceptions can still occur. I tested that even page-in was processed fine in the VMM-context if IRQL is PASSIVE_LEVEL. To my knowledge, the IRQL requirement mostly stems from if page-in can be processed--in other words, if the process can enter wait state--, and interrupts are irrelevant.

I bet that I am missing something and appreciate if you could explain a bit more about why disabling interrupts is technically the same as being IRQL==HIGH_LEVEL.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tandasat/HyperPlatform/issues/3#issuecomment-230384767, or mute the thread https://github.com/notifications/unsubscribe/AFxIeLDFbUDm4AYlTbJIfPnugLNrwIxTks5qSdu6gaJpZM4HtbQh .

tandasat commented 8 years ago

Wow, thank you for the clear explanation. I understand what you meant now.

I knew that VMM context should not be context switched and that it did not receive the clock interrupt and IPI, but was not able to think of a connection between IPI and interrupt. It is obvious that IPI relies on a hardware interrupt. And if IPI did not work, TLB flush would not work neither.

While I started to write an initial version of VMM, I noticed that calling API from VMM-context made the VMM noticeably unstable or caused funny symptoms like this even if I obeyed requirements of IRQL. So I decided to avoid using them as much as possible, but I was not able to explain well how it could cause trouble. The scenario of ExAllocatePoolWithTag + NonPagedPool is a such good example.

By the way, the test for paging-in was needed to make my question clearer to ask you for explanation. Besides, such experiments are always fun :)

rianquinn commented 8 years ago

By the way, the test for paging-in was needed to make my question clearer to ask you for explanation. Besides, such experiments are always fun :)

Second that!!!

ionescu007 commented 8 years ago

Oh yeah, I meant "good crazy", not "stupid crazy" ;-)

Best regards, Alex Ionescu

On Tue, Jul 5, 2016 at 9:57 AM, Rian Quinn notifications@github.com wrote:

By the way, the test for paging-in was needed to make my question clearer to ask you for explanation. Besides, such experiments are always fun :)

Second that!!!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tandasat/HyperPlatform/issues/3#issuecomment-230537288, or mute the thread https://github.com/notifications/unsubscribe/AFxIeLc4xH5PjiBEmg0Ig-D_pQmujB5hks5qSozkgaJpZM4HtbQh .

ionescu007 commented 8 years ago

By the way @tandasat there is new Windows 10 API: KeGetEffectiveIrql() which will return HIGH_LEVEL at all times while in VMM mode ;-) Proving my point :)

tandasat commented 8 years ago

That is beautiful :) I should probably document that fact.

tandasat / HyperPlatform