jam3st / vfio-hd4600

0 stars 0 forks source link

Boot screen does not display on monitor #1

Closed zomabies closed 4 years ago

zomabies commented 4 years ago

First at all, thank you for making the patches that make this possible

I found this repo when I saw a comment that using UEFI q35 guest is possible: https://old.reddit.com/r/VFIO/comments/ib2idf/passthrough_of_igp/g1wr5d4/ which I find a post here that show the steps: https://bbs.archlinux.org/viewtopic.php?pid=1870700#p1870700

There is similar solution I found but the link provided is dead: https://forums.unraid.net/topic/61504-intel-hd-4600-passthrough-problem/?tab=comments#comment-610819

Here's my Host details:

IGP HDMI output are plugged into monitor.

I have applied the kernel, QEMU patches and provided OVMF bios before making the virtual machine.

In Linux, I am able to view the display on the monitor after the kernel driver has taken over, but unable to see the boot screen. On windows, the display does not show at all.

The monitor does not show the boot logo either in Linux or Windows.

More info of how I perform the steps. Outupt of `lspci -xxx -s 0:2` ``` 00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06) 00: 86 80 12 04 03 00 90 00 06 00 00 03 00 00 00 00 10: 04 00 40 f7 00 00 00 00 0c 00 00 d0 00 00 00 00 20: 01 f0 00 00 00 00 00 00 00 00 00 00 58 14 00 d0 30: 00 00 00 00 90 00 00 00 00 00 00 00 0b 01 00 00 40: 09 00 0c 01 6d a0 04 62 d0 00 44 56 00 00 00 00 50: 11 02 00 00 39 00 00 00 00 00 00 00 01 00 20 cb 60: 00 00 02 01 00 00 00 00 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90: 05 d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 00 00 00 00 13 00 06 03 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 01 a4 22 00 03 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 80 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 06 00 18 e0 be c1 ``` NVS regions, taken from dmesg ``` [ 0.100103] PM: Registering ACPI NVS region [mem 0x9defb000-0x9df01fff] (28672 bytes) [ 0.100103] PM: Registering ACPI NVS region [mem 0xc1ac9000-0xc1c0cfff] (1327104 bytes) Picked this because `0xc1bee018` is in this range ``` In vfio_iommu_type1.c.patch, ```diff +//#error "Fix the line below" +#define _MY_BIOS_STOLEM_MEMORY_ADDRESS_ 0xc1ac9000 + ret = iommu_map(domain->domain, _MY_BIOS_STOLEM_MEMORY_ADDRESS_, _MY_BIOS_STOLEM_MEMORY_ADDRESS_, 0x10000000, IOMMU_READ | IOMMU_WRITE); + if (ret) + goto out_detach; +#undef _MY_BIOS_STOLEM_MEMORY_ADDRESS_ ``` I set `_MY_BIOS_STOLEM_MEMORY_ADDRESS_` to `0xc1ac9000` which is the start of second NVS region. Then, I changed `0x8000000` (128MB) to `0x10000000` (256MB) because I found it's the size parameter in the header (iommu.h). I have 256MB graphics memory allocated in bios. (Should this not change regardless of the size of allocated memory?) Then, using the arch build system to build both kernel and qemu. After I patched and install the customised version, I created the machine through `virt-manager` **dmesg log**, initial boot before creating vm: https://hastebin.com/afamobezic.log **libvirt xml** (windows 8.1): https://hastebin.com/azorigolog.xml **QEMU log** when launched: https://hastebin.com/edozisewub.log this qemu log shows the patch is working ``` address is 00000018 MMAP 34 returned for offset 90000000000 ret 0x7f244c27d000 IntelGraphicsMe ```

Future troubleshooting

I have tried with both UEFI only and with CSM enabled (legacy video OpRom) on the host. It does not give any effect.

UEFI only (CSM Disabled):

It will show the error only in first launch, subsequent launch does not show the error (from journalctl)

kernel: DMAR: DRHD: handling fault status reg 3
kernel: DMAR: [DMA Write] Request device [00:02.0] PASID ffffffff fault ad>
kernel: DMAR: DRHD: handling fault status reg 3
kernel: DMAR: [DMA Write] Request device [00:02.0] PASID ffffffff fault ad>
kernel: DMAR: DRHD: handling fault status reg 3
kernel: DMAR: [DMA Write] Request device [00:02.0] PASID ffffffff fault ad>
kernel: DMAR: DRHD: handling fault status reg 3
kernel: initializing vfio_pci_igd_init
kernel: MASK is 0xfffff000
kernel: OFFSET is 0x00000018
kernel: vfio-pci 0000:00:03.0: enabling device (0000 -> 0002)

subsequent launch

[ 5853.549050] initializing vfio_pci_igd_init
[ 5853.549065] MASK is 0xfffff000
[ 5853.549066] OFFSET is 0x00000018

UEFI with CSM (legacy video OpRom):

kernel log, which does not show the error on first vm launch

[  255.143573] vfio-pci 0000:00:02.0: enabling device (0002 -> 0003)
[  255.246299] initializing vfio_pci_igd_init
[  255.246314] MASK is 0xfffff000
[  255.246315] OFFSET is 0x00000018
[  256.790151] virbr0: port 2(vnet0) entered learning state
[  258.838059] virbr0: port 2(vnet0) entered forwarding state
[  258.838060] virbr0: topology change detected, propagating

Is there a prerequisite before compiling and install the patched version, such as making bios changes? Or I have some steps that are missed.

Thanks.

jam3st commented 4 years ago

Are you using https://github.com/jam3st/vfio-hd4600/blob/master/ovmf/OVMF_HD4600.fd as your BIOS for qemu (and set you PC BIOS to UEFI only)?

zomabies commented 4 years ago

Yes, I'm using https://github.com/jam3st/vfio-hd4600/blob/master/ovmf/OVMF_HD4600.fd as the BIOS and PC BIOS is UEFI only.

jam3st commented 4 years ago

Can you try with a vanilla 5.9 kernel?

zomabies commented 4 years ago

I have tried with vanilla 5.9 kernel from here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tag/?h=v5.9 (only change is the name to prevent conflict on installed version)

When booted the kernel, I run the virtual machine and the same issue still present, the DMAR error on dmesg and no boot screen.

So I changed the address to 0xC1AC8000 in attempt to fix it. Unfortunately, it still persist.

+#define _MY_BIOS_STOLEM_MEMORY_ADDRESS_ 0xC1AC8000 //(0xC1AC9000 - start of NVS)
+   ret = iommu_map(domain->domain, _MY_BIOS_STOLEM_MEMORY_ADDRESS_, _MY_BIOS_STOLEM_MEMORY_ADDRESS_, 0x10000000 /*256MB*/,  IOMMU_READ | IOMMU_WRITE);
+   if (ret)
+       goto out_detach;
+#undef _MY_BIOS_STOLEM_MEMORY_ADDRESS_

This is executed after the address has changed.

QEMU log: https://hastebin.com/niwanojaho.log

Full dmesg log: https://hastebin.com/isuqiyazok.log

At line 978 from the full dmesg, the same issue occurs, but now with the fault address.

[  300.452792] DMAR: DRHD: handling fault status reg 3
[  300.452795] DMAR: [DMA Read] Request device [00:02.0] PASID ffffffff fault addr cb940000 [fault reason 06] PTE Read access is not set
[  300.453292] DMAR: DRHD: handling fault status reg 2
[  300.453293] DMAR: [DMA Read] Request device [00:02.0] PASID ffffffff fault addr cb940000 [fault reason 06] PTE Read access is not set
[  300.453294] DMAR: DRHD: handling fault status reg 2
[  300.453295] DMAR: [DMA Read] Request device [00:02.0] PASID ffffffff fault addr cb940000 [fault reason 06] PTE Read access is not set
[  300.453297] DMAR: DRHD: handling fault status reg 2
[  300.821112] initializing vfio_pci_igd_init
[  300.821128] MASK is 0xfffff000
[  300.821129] OFFSET is 0x00000018
[  302.391358] virbr0: port 2(vnet0) entered learning state
[  304.525054] virbr0: port 2(vnet0) entered forwarding state
[  304.525063] virbr0: topology change detected, propagating
[  310.762788] kauditd_printk_skb: 16 callbacks suppressed

I have question about the calculation, I get the size from the graphics memory which is 256M and I converted to hex (in bytes). If I did not change the size (from 0x8000000 in original patch), I will get lots of DMAR error and cannot see the kernel boot log in linux.

Is it the correct way to perform it? Below is the output where I get the size.

00:02.0 VGA compatible controller [0300]: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller [8086:0412] (rev 06) (prog-if 00 [VGA controller])
    Subsystem: Gigabyte Technology Co., Ltd Device [1458:d000]
    Flags: fast devsel, IRQ 16, IOMMU group 2
    Memory at f7400000 (64-bit, non-prefetchable) [size=4M]
    Memory at d0000000 (64-bit, prefetchable) [size=256M] <- picked this size
    I/O ports at f000 [size=64]
    Capabilities: <access denied>
    Kernel driver in use: vfio-pci
    Kernel modules: i915

By the way, I noticed on your previous comment which you say not to use 5.8. Is there any reason since your commit has mention its for 5.8?

jam3st commented 4 years ago

The 5.7 kernels (not 5.8 needed some other things removed.). I checked what I am using and I suggest that you try using this:

+#define _MY_BIOS_STOLEM_MEMORY_ADDRESS_ 0xCB000000 //(0xC1AC9000 - start of NVS)
+   ret = iommu_map(domain->domain, _MY_BIOS_STOLEM_MEMORY_ADDRESS_, _MY_BIOS_STOLEM_MEMORY_ADDRESS_, 0x8000000 /**/,  IOMMU_READ | IOMMU_WRITE);
+   if (ret)
+       goto out_detach;
+#undef _MY_BIOS_STOLEM_MEMORY_ADDRESS_

Notice that this isn't the 256MB used by the HD4600, it is some hardcoded address that device accesses for some undocumented reason. The stolen memory is allocated from pages that are correctly handled by the IOMMU (at least that is what the windows driver does).

Also you say that you don't see the tiano core boot logo? That is a bit strange.

zomabies commented 4 years ago

I have changed the address to your suggestion. After I boot to the updated 5.9 kernel, I noticed there is no DMAR error occurred when starting the machine. I guess that the error is solved?

This is the kernel log after the change: https://paste.ee/p/zIlzP The first vm boot is on line 987, second boot is on line 1028

Also you say that you don't see the tiano core boot logo? That is a bit strange.

Yes, I did not see the tiano core boot logo on the monitor for both Linux and Windows.

On Linux, the logo and the grub menu did not show up until the display of kernel message. On Windows, I did not see anything after it has loaded. I have to force shutdown it. I have use Windows 10 to check if there is problem with the previous installation media.

The QEMU Logs:

jam3st commented 4 years ago

Never really sure what the rombar meant, but I think that with rombar=1 you are mapping the ROM into the PCI BAR for the device. Try rombar=0. Not sure if that will help. Can you try running without an OS and see if the screen gets cleared slowly by the bios.

Are you sure you are patching qemu?

Are you using and HDMI cable? If not then use the HDMI output.

zomabies commented 4 years ago

I have run without an OS, and used plain QEMU to start the machine. The rombar=0 does not affect the outcome. The screen does not display anything (Monitor display message to check for input).

Full dmesg Log: https://paste.ee/p/DhThY

The command line for starting. Have to use sudo to able to pass the pci device.

sudo /usr/bin/qemu-system-x86_64 \
    -d trace:vfio_pci_igd_opregion_enabled \
    -L /home/username/Desktop/libvirtd/ \
    -drive if=pflash,format=raw,readonly,file=/home/username/Desktop/libvirtd/OVMF_HD4600.fd \
    -machine q35,accel=kvm,usb=off,vmport=off \
    -enable-kvm \
    -m 2048 \
    -smp 4,sockets=4,cores=1,threads=1 \
    -cpu host \
    -boot menu=on,strict=on \
    -nic none \
    -vga none \
    -device vfio-pci,host=0000:00:02.0,addr=0x2,x-vga=off,rombar=0,x-igd-opregion=on

Log:

qemu-system-x86_64: -device vfio-pci,host=0000:00:02.0,addr=0x2,x-vga=off,rombar=0,x-igd-opregion=on: IGD device 0000:00:02.0 cannot support legacy mode due to existing devices at address 1f.0
address is 00000018
MMAP 20 returned for offset 90000000000 ret 0x7f38e3d09000 IntelGraphicsMe
1564@1603038335.361301:vfio_pci_igd_opregion_enabled 0000:00:02.0
On the QEMU input command line, I listed the PCIs.

``` QEMU 5.1.0 monitor - type 'help' for more information (qemu) info pci Bus 0, device 0, function 0: Host bridge: PCI device 8086:29c0 PCI subsystem 1af4:1100 id "" Bus 0, device 2, function 0: VGA controller: PCI device 8086:0412 PCI subsystem 1458:d000 IRQ 11, pin A BAR0: 64 bit memory at 0x810000000 [0x8103fffff]. BAR2: 64 bit prefetchable memory at 0x800000000 [0x80fffffff]. BAR4: I/O at 0x6040 [0x607f]. id "" Bus 0, device 31, function 0: ISA bridge: PCI device 8086:2918 PCI subsystem 1af4:1100 id "" Bus 0, device 31, function 2: SATA controller: PCI device 8086:2922 PCI subsystem 1af4:1100 IRQ 10, pin A BAR4: I/O at 0x6080 [0x609f]. BAR5: 32 bit memory at 0x90000000 [0x90000fff]. id "" Bus 0, device 31, function 3: SMBus: PCI device 8086:2930 PCI subsystem 1af4:1100 IRQ 10, pin A BAR4: I/O at 0x6000 [0x603f]. id "" ``` I noticed the VGA controller has different subsystem. It is `1458` but the others is `1af4`.

Are you sure you are patching qemu?

I have applied the patches and verify it is patched because I see these messages when starting.

address is 00000018
MMAP 20 returned for offset 90000000000 ret 0x7f38e3d09000 IntelGraphicsMe

Are you using and HDMI cable? If not then use the HDMI output.

Yes, I'm using HDMI cable and connected to a monitor.

Future info:

I added -chardev stdio,id=debug -device isa-debugcon,iobase=0x402,chardev=debug to the args to get more debug info. Not sure if this debug information can help.

sudo /usr/bin/qemu-system-x86_64 \
    -d trace:vfio_pci_igd_opregion_enabled \
    -L /home/username/Desktop/libvirtd/ \
    -drive if=pflash,format=raw,readonly,file=/home/username/Desktop/libvirtd/OVMF_HD4600.fd \
    -machine q35,accel=kvm,usb=off,vmport=off \
    -enable-kvm \
    -m 2048 \
    -smp 4,sockets=4,cores=1,threads=1 \
    -cpu host \
    -boot menu=on,strict=on \
    -nic none \
    -vga none \
    -chardev stdio,id=debug \
    -device isa-debugcon,iobase=0x402,chardev=debug \
    -device vfio-pci,host=0000:00:02.0,addr=0x2,x-vga=off,rombar=0,x-igd-opregion=on

The log with the added debug: https://paste.ee/p/GEOjl

jam3st commented 4 years ago

Does your BIOS boot up using UEFI or CSM?

zomabies commented 4 years ago

UEFI only

jam3st commented 4 years ago

It is getting very late here. Only thing that I can think of is to give a kernel config but before that I scanned your dmesg and noticed this:

[ 0.154197] pci 0000:00:02.0: vgaarb: setting as boot VGA device [ 0.154197] pci 0000:00:02.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none [ 0.154201] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none [ 0.154202] pci 0000:00:02.0: vgaarb: no bridge control possible [ 0.154202] pci 0000:01:00.0: vgaarb: bridge control possible [ 0.154203] pci 0000:01:00.0: vgaarb: overriding boot device [ 0.154203] vgaarb: loaded

Can you try disabling VGA_ARB? Means you will have no graphics at all on console. Everything else says that it is working.

zomabies commented 4 years ago

Hi, I have disabled VGA_ARB and here is the results. In each run I have run qemu to check for vm display but unfortunately it still display nothing. The qemu command line used remain unchanged.

UEFI only

After booting the changed, it stuck on the grub loading screen. The host system seems to have booted but no display. Have to force reboot and have to boot backup kernel. journalctl log: https://paste.ee/p/8ljl7

With CSM

But If I enable CSM, I can see the display and login to the host system. kernel log: https://paste.ee/p/HYZhK


So I look at the code and try to remove igd from being added by vgaarb since I cannot view the host screen with UEFI only.

diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
index 5180c5687ee5..ffedadfa758e 100644
--- a/drivers/gpu/vga/vgaarb.c
+++ b/drivers/gpu/vga/vgaarb.c
@@ -658,6 +658,12 @@ static bool vga_arbiter_add_pci_device(struct pci_dev *pdev)
    struct pci_dev *bridge;
    u16 cmd;

+   /* Hacks: Patch out i915 from vga */
+   if (pdev->device == 0x0412) {
+       vgaarb_info(&pdev->dev, "Ignoring VGA device\n");
+       return false;
+   }
+
    /* Only deal with VGA class devices */
    if ((pdev->class >> 8) != PCI_CLASS_DISPLAY_VGA)
        return false;

Then, I set back to UEFI only and I can see the host screen. The log: https://paste.ee/p/08CPX#s=0&l=401 (on line 401)

[    0.158794] pci 0000:00:02.0: vgaarb: Ignoring VGA device
[    0.158794] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    0.158794] pci 0000:01:00.0: vgaarb: bridge control possible
[    0.158794] pci 0000:01:00.0: vgaarb: setting as boot device
[    0.158794] vgaarb: loaded

I run qemu and the output of the vm remains blank.


https://paste.ee/p/A6a7b - kernel config that used to compile (config taken from https://www.archlinux.org/packages/core/x86_64/linux/). I have to turn off VGA_SWITCHEROO since it depends on VGA_ARB, otherwise it will not disabled.

Bios setting of the initial display (selected nvidia): https://i.imgur.com/fFPj7Mh.png

By the way, can you post your kernel parameter and the qemu command because I'm want to check if I missed something? Thanks