Permit MMIO exits to bypass the emulation.

cracyc commented 5 years ago

It appears to be impossible for a program to handle a mmio exit in userspace rather than round tripping to kernel every time. If the registers are reloaded after HAX_EXIT_FAST_MMIO they will likely then be corrupted in em_emulate_insn. I also tried to use HAX_EPT_PERM_NONE but it seems to not support small memory regions.

raphaelning commented 5 years ago

I guess what you want is something like the macOS Hypervisor.framework API, which allows user space to access most VMCS registers, so the QEMU HVF accelerator (which runs in user space) can fetch the MMIO instruction, emulate it, and then resume the guest from the next instruction.

The HAXM API provides a higher-level abstraction by design, which is why HAX_EXIT_FAST_MMIO isn't suitable for your purpose (it assumes most of the instruction emulation work is done in the kernel).

HAX_EXIT_PAGEFAULT (assuming the MMIO region can be protected with HAX_RAM_PERM_NONE) might be a better alternative. However, right now the GPA protection mechanism is limited to non-MMIO regions, and even if it worked on MMIO regions, you would still need to make additional "round trips" to the kernel:

One for reading the vCPU register state to get the current instruction pointer (RIP), before you can fetch the MMIO instruction.
One for writing the vCPU register state to make RIP point to the next instruction, after you emulate the MMIO instruction.

With the HAXM API, you can't read/write the guest RIP alone, but have to sync a larger set of vCPU registers (again that's by design), so these round trips will be very costly.

In fact, if the instruction only accesses one MMIO address (which is the most common case), I think the HAX_EXIT_FAST_MMIO approach already guarantees the minimum number of kernel/user context switches (kernel => user => kernel), with pretty low overhead.

cracyc commented 5 years ago

To be clear I'm trying to emulate VGA which requires an MMIO region 64K in size. HAX_EXIT_FAST_MMIO requires a round trip for every vram read and write and there'll be thousands while a frame is drawn then none for a long period. Try a vga program that uses unchained mode such as doom in qemu with haxm and you'll see the performance is horrid, far slower than emulation so I was trying to do it like dosemu which exit kvm at the first vram access and emulates until it appears the program is finshed. As you say HAX_EXIT_PAGEFAULT would likely be the best solution but it appears that it has a 2MB granularity which is far to large for the 64k vga vram window and will put the entire vm address space within 2MB when running in real mode.

raphaelning commented 5 years ago

[...] so I was trying to do it like dosemu which exit kvm at the first vram access and emulates until it appears the program is finshed.

I see, so you don't need to sync vCPU state very often. Is "kvm" a typo? I wonder if dosemu uses KVM at all, since it predates KVM. And I'm curious whether KVM API (KVM_EXIT_MMIO, etc.) meets your requirements, since it's similar to HAXM.

As you say HAX_EXIT_PAGEFAULT would likely be the best solution but it appears that it has a 2MB granularity which is far to large for the 64k vga vram window and will put the entire vm address space within 2MB when running in real mode.

True. The 2MB granularity stems from the fact that HAXM divides each RAM block (host buffer backing guest memory) into 2MB chunks:

https://github.com/intel/haxm/blob/0d3922d8a64da41487ecab7ae19150ce838d6085/core/include/memory.h#L37-L38

Theoretically, it's possible to switch to 4KB chunks, but last time we tried that, we ran into some stability issue. Maybe you could try to make it work?

cracyc commented 5 years ago

Is "kvm" a typo? I wonder if dosemu uses KVM at all, since it predates KVM. And I'm curious whether KVM API (KVM_EXIT_MMIO, etc.) meets your requirements, since it's similar to HAXM.

Dosemu2 supports KVM but the original dosemu presumably had the same problem but with v86 mode instead. I think dosemu2 uses v86 mode in KVM for compatibility with the original dosemu and then they use regular page faults to trap vga access. That makes it basically impossible to support any programs which need their own paging.

Maybe you could try to make it work?

Well, I can take a stab at it I suppose.

krytarowski commented 5 years ago

@cracyc I wanted to port/update dosemu support for NetBSD some time ago but I've faced rather very legacy code that needed modernization in regards of used interfaces (e.g. switch to mcontext from sigcontext). We have also dropped v86 support from the NetBSD kernel and dosemu was using it.

Was this situation changed? Is dosemu(2) switching to HAXM now? KVM restricts users to Linux while HAXM works now on 4 major Operating Systems (including NetBSD). Unfortunately I had to resign from my porting efforts previously as it demanded too many generic improvements beyond adding compatibility code.

@raphaelning I got a report from @polprog that OpenBSD in HAXM is terribly slow.. is this VGA related?

cracyc commented 5 years ago

Not that I know of, what I'm working on is a bit different.

raphaelning commented 5 years ago

@raphaelning I got a report from @polprog that OpenBSD in HAXM is terribly slow.. is this VGA related?

I have no idea. Other desktop OSes seem to run smoothly. Since VGA rendering is probably done with MMIO, you may be able to identify the bottleneck by observing exits to QEMU of type HAX_EXIT_FAST_MMIO.

krytarowski commented 5 years ago

@raphaelning NetBSD with X Window is also unusable due to slowness... I will file a dedicated PR for it in future once all the booting issues will be solved.

cracyc commented 5 years ago

If it's using basic vga mode (4 plane/16 color) it's (at least from my experimentation) going to be very slow. It appears the svga cirrus emulation in linear framebuffer mode doesn't mmio exit for vram writes which would be far faster.

polprog commented 5 years ago

From what I have noticed, the apparent VGA slowness is because of general emulation slowness in case of having extremely verbose logging enabled. On NetBSD, caused by setting the following debug level variable to 0 (That is not the default value!).

int default_hax_log_level = 0; in platforms/netbsd/hax_wrapper.c L47

krytarowski commented 5 years ago

From what I have noticed, the apparent VGA slowness is because of general emulation slowness in case of having extremely verbose logging enabled. On NetBSD, caused by setting the following debug level variable to 0 (That is not the default value!).

int default_hax_log_level = 0; in platforms/netbsd/hax_wrapper.c L47

Are you sure that performance bottlenecks disappear if we set default_hax_log_level to 0? I don't see many messages in dmesg(8) myself..

krytarowski commented 5 years ago

OK, @polprog explained off-list that if we add extra logging performance is reduced.. but reverting to HEAD it's back to the current state. Unfortunately this doesn't solve our primary issue with MMIO bottleneck.

polprog commented 5 years ago

ad "off-list explanation": Setting the log level to 0 makes it print every message, including hax_debug() calls - those printf-s are blocking so they stall the whole VM for the time they print the message, and there's several of them per VM exit, so they slow things down.

To add a little context - when I was debugging, I would change that value back and forth as I needed to see more or less info in dmesg(8) and the VM would run either slow or fast.

My test box is not the fastest, and with loglevel set to zero (most verbose) I could literally see QEMU's BIOS print the messages line-by-line as if it was a 9600bps terminal! And that also caused longer kernel load times - just as if the emulated CPU clock was a couple times slower. - This is not a bug, since implementing those debug messages in a non-blocking way would be just plain over-engineering and they are suspressed in most cases anyway.

This is a proof that it's not MMIO related, I just feel that it should be mentioned :)

krytarowski commented 5 years ago

Do you mean that your report of slowness of OpenBSD was caused by debug level and not MMIO bottleneck?

polprog commented 5 years ago

Yes, exactly.

leecher1337 commented 4 years ago

I'm facing the same problem, one MMIO exit for every write operation in the VGA area is incredibly slow, so in fact it's unusable. dosemu approach of writing a whole emulator just for VGA emulation seems to be a ridiculous amount of work and way too complicated.

No, it's not loglevel related, the roundtrips ARE costly, there are thousands of them per second!

Now what I have tried and at least improves the situation a bit is when I attach the VGA area to a page, check if it changed on every haxm exit (mark already read bytes) and write changed values to the emulated VGA. At least the performance is somewhere near usable, but there still is the problem of detecting changes (I tried GetWriteWatch() API, but it doesn't fire when memory area gets modified by HAXM Guest more than once, I can provide a simple test application as proof).

KVM has a module called "coalesced IO", which collects all MMIO writes into a ringbuffer that can be read and flushed by userspace. Still, I'm not convinced by this concept, as it still has 1 exit per write which is still costly when there are thousands of them per second.

Now I had the following idea, maybe someone can tell me if it would work and if it can be implemented in HAXM that way: 1) VirtualAlloc a PAGE_READWRITE memory page in userspace (Host) 2) Attach the page to the VGA area at Seg A000, BUT mark it WRITE-ONLY in guest EPT (currently I cannot tell HAXM to mark a page write-only, why?). Add that page to a linked list of pages that need to be checked in 3) 3) On every VM exit, check the dirty-bit of the EPT pages attached to VGA. If it is marked dirty, make an Exit to userspace and notify host that it should check VGA area for modifications and sync it to emulated adapter. After call to userspace, clear Dirty-Bit in pagetable. 4) If the VM attempts a read operation (fortunately, this is less likely than write), an exception would be generated which can in turn be translated into an HAX_EXIT_FASTMMIO call, so reads work like with normal MMIO area. There isn't much we can do here, they cannot be coalesced like the writes.

Would that work and is someone here knowledgeable enough to enahance HAXM functionality this way?

leecher1337 commented 4 years ago

I tried to implement coalesced MMIO writes to HAXM for evaluation. I got a preformance gain by up to 50% (of course depending on the tested aplpication), but as expected, the video performance is still unacceptable. Should anybody be interested in the experimental coalesced MMIO implementation, please tell me and I will commit it to my repository accordingly. Coalesced MMIO can be turned on via a flag in my patch, so it shouldn't break compatibility.

The comparison approach works at least somewhat acceptable (even though you have a 1/255 chance that you miss a write), but it fails to detect reads,as Intel SDM says, that you can't have writeonly-Pages in EPT, d'oh :-(

frymezim commented 4 years ago

@AlexAltea maybe can help you with this.

AlexAltea commented 4 years ago

@leecher1337 Sorry for the late reply, I missed these GitHub notifications until @frymezim pinged me.

I'd be happy to look over your patch. Coalesced MMIO sounds an awesome feature, and I'd be happy to add support for it in QEMU's HAXM backend.

leecher1337 commented 4 years ago

Just for comparison purposes, I also tried to run QEMU with HAXM (and installing MS-DOS into it), and it suffers from exactly the same performance problem. I also read about Intel vGPU, but documentation states: "Legacy VGA is not supported in the vGPU device model", so this is also a dead end. So I'm running out of ideas now.

I will create a branch with the coalesced MMIO support for your review, shall I create a PULL request for it too or just drop a link here for review?

AlexAltea commented 4 years ago

@leecher1337 Feel free to drop a link for now! Whether you want to submit a pull request is up to you (and up to the Intel team, if that could be merged).

leecher1337 commented 4 years ago

Hi, Here is a first draft. I tried to port it to the new HAXM release, as my current working copy is 6 months old, so I hope I merged the code correctly. Feeld free to comment on it:

https://github.com/leecher1337/haxm/commit/5d6a603438f9c1b571794b3b6a93b3fe20191bb7 https://github.com/leecher1337/haxm/commit/3c68566d09498534076dc0783c5cc69bd7f93255

/ We have to following assumption on coalesced writes:

The usermode application is responsible for checking of REP prefix on
HAX_EXIT_FAST_MMIO and handle it accordingly. The result of such an
instruction will NOT be added to coalesced MMIO queue, as usermode
application can handle it faster by knowing counter in ECX and ESI->EDI
directly.
Usermode application must update ESI, EDI and ECX accordingly when
handling it!
When consuming the coalesced queue, caller has to reset count to 0! */

Here is an example for handling from NTVDMx64 HAXM implementation:

        case HAX_EXIT_FAST_MMIO:
            hax_handle_fastmmio((struct hax_fastmmio *)iobuf);
            break;
        case HAX_EXIT_COALESCED_MMIO:
            hax_handle_fastmmio_coalesced(*(struct hax_coalesced_mmio **)iobuf);
            break;

void hax_handle_fastmmio_coalesced(struct hax_coalesced_mmio *coal)
{
    uint32_t i;

    for (i=0; i<coal->size; i++)
        hax_handle_fastmmio_op(&coal->mmio[i]);
    coal->size = 0;
}

void hax_handle_fastmmio_op(struct hax_fastmmio *hft)
{
    switch (hft->direction)
    {
    case 0: /* Read */
        switch (hft->size)
        {
        case 1: *((PBYTE)&hft->value) = sas_PR8((DWORD)hft->gpa); break;
        case 2: *((PWORD)&hft->value) = sas_PR16((DWORD)hft->gpa); break;
        case 4: *((PDWORD)&hft->value) = sas_PR32((DWORD)hft->gpa); break;
        default: haxmvm_panic("Seems hax_handle_fastmmio also gets size %d", hft->size);
        }
        break;
    case 1: /* Write */
        switch (hft->size)
        {
        case 1: sas_PW8((DWORD)hft->gpa, (BYTE)hft->value); break;
        case 2: sas_PW16((DWORD)hft->gpa, (WORD)hft->value); break;
        case 4: sas_PW32((DWORD)hft->gpa, (DWORD)hft->value); break;
        default: haxmvm_panic("Seems hax_handle_fastmmio also writes size %d", hft->size);
        }
        break;
    case 2: /* gpa -> gpa2 memcpy */
        sas_PRWS((DWORD)hft->gpa, (DWORD)hft->gpa2, hft->size);
        break;
    }
}

void hax_handle_fastmmio(struct hax_fastmmio *hft)
{
    UCHAR *pCmd = ((UCHAR *)Sim32GetVDMPointer(
                   (state._cs.selector << 16) | state._eip,
                   1, ISPESET));

    if (*pCmd == 0xF3)
    {
        BOOL bHandled = FALSE;
        DWORD bytes;

        switch (hft->direction)
        {
        case 0: /* Read */
            switch (pCmd[1])
            {
            case 0xA4:  /* Move (E)CX bytes from DS:[(E)SI] to ES:[(E)DI].*/
                if (getDF()) break; // Not implemented yet
                sas_PRWS((DWORD)hft->gpa, RMSEGOFFTOLIN(getES(), getEDI()), getECX());
                setEDI(getEDI() + getECX());
                setESI(getESI() + getECX());
                setECX(0);
                bHandled = TRUE;
                break;

            case 0xA5:  /* Move (E)CX words from DS:[(E)SI] to ES:[(E)DI].*/
                if (getDF()) break; // Not implemented yet
                sas_PRWS((DWORD)hft->gpa, RMSEGOFFTOLIN(getES(), getEDI()), getECX() * hft->size);
                setEDI(getEDI() + getECX() * hft->size);
                setESI(getESI() + getECX() * hft->size);
                setECX(0);
                bHandled = TRUE;
                break;
            }
            break;
        case 1: /* Write */
            switch (pCmd[1])
            {
            case 0xA4:  /* Move (E)CX bytes from DS:[(E)SI] to ES:[(E)DI].*/
            case 0xA5:  /* Move (E)CX words from DS:[(E)SI] to ES:[(E)DI].*/
                if (getDF()) break; // Not implemented yet
                sas_PRWS((DWORD)RMSEGOFFTOLIN(getDS(), getESI()), (DWORD)hft->gpa, getECX() * hft->size);
                setEDI(getEDI() + getECX() * hft->size);
                setESI(getESI() + getECX() * hft->size);
                setECX(0);
                bHandled = TRUE;
                break;
            case 0xAA:  /* Fill (E)CX bytes at ES:[(E)DI] with AL. */
                sas_fills((DWORD)hft->gpa, getAL(), getECX());
                setECX(0);
                bHandled = TRUE;
                break;
            case 0xAB:  /* Fill (E)CX words at ES:[(E)DI] with AX. */
                switch (hft->size)
                {
                case 2: 
                    sas_fillsw((DWORD)hft->gpa, getAX(), getECX());
                    setECX(0);
                    bHandled = TRUE;
                    break;
                case 4: 
                    sas_fillsdw((DWORD)hft->gpa, getEAX(), getECX());
                    setECX(0);
                    bHandled = TRUE;
                    break;
                }
                break;
            }
            break;
        }
        if (bHandled)
        {
            if (!DeviceIoControl(hVCPU, HAX_VCPU_SET_REGS, &state, sizeof(state), NULL, 0, &bytes, NULL));
            return;
        }
    }

    hax_handle_fastmmio_op(hft);
}

I hope that explains its usage.

intel / haxm

Permit MMIO exits to bypass the emulation. #164