intel / thunderbolt-software-user-space

Other
104 stars 24 forks source link

Thunderbolt Attached Device Needs Rescan #52

Open Queuecumber opened 6 years ago

Queuecumber commented 6 years ago

First off I'm almost sure this isn't the best place to report this issue but I hope someone here can point me in the right direction, so sorry in advance.

I use a thunderbolt 3 attached nvidia GPU on my laptop running Ubuntu 17.10 (Dell XPS 15 9560). After resuming from sleep the external GPU has all kinds of problems, it is detected from lspci but if I try to use it, for example by running nvidia-smi the card is not detected, and passing it through to a VM just gives colored noise on the monitor. After some googling I found that running

$ sudo sh -c "echo 1 > /sys/bus/pci/rescan"

After resuming from sleep fixes the issue. I can set this up to run automatically after resume (probably) but I bet this is some sort of bug. Now the really weird thing is that this happens if the laptop has ever slept, not just when it sleeps with the GPU attached, replugging the thunderbolt cable doesnt help. In other words, if I cold boot the laptop undocked, do some things still undocked, sleep the laptop, then unsleep it and dock, the GPU is still unusable unless I reboot or run above command.

Anyway any thoughts would be appreciated and please let me know the right place to report this bug.

ybernat commented 6 years ago

@Queuecumber

  1. The kernel version can be helpful for debugging this.
  2. It's interesting to hear that passing the exGPU through to a VM generally works :)

@westeri Do you want to take a look?

westeri commented 6 years ago

There should be no need for running rescan manually.

Can you try the following kernel patch?

https://bugzilla.kernel.org/attachment.cgi?id=273919

westeri commented 6 years ago

Before doing that, do you have distro kernel or some custom build? In case of latter you need to have following in your kernel .config:

CONFIG_HOTPLUG_PCI=y CONFIG_HOTPLUG_PCI_ACPI=y

Queuecumber commented 6 years ago

@ybernat Right so I definitely should have told you the kernel version, my mistake. I am using 4.14.4 with the ACS override patch applied. I was actually wondering how common this use case comes up but the short story is that as long as you have ACS (or the patch applied) PCI passthrough of thunderbolt devices works exactly as it does with normal PCI devices. In the windows VM I actually get very good performance for gaming.

@westeri I can add that patch into my auto build script, and I think I have those flags enabled already. I build the kernel myself with the ACO override patch (see https://queuecumber.gitlab.io/linux-acs-override/) which works as long as gitlab CI is in a good mood. It should be trivial to add that patch into my pipeline. Can you tell me what it does?

westeri commented 6 years ago

The patch adds some debugging (in case hotplug still does not work) and in addition to that, it will scan also the PCI function 0 in case there is nothing on the "ACPI slot". I found that some Dell systems at least need it in order to find the PCIe switch upstream port.

Queuecumber commented 6 years ago

Just wanted to let you guys know I havent forgotten about this, I've been trying to fix my kernel build CI. I don't know what happened to gitlab lately but their CI has barely been functioning for me for a few months now

Queuecumber commented 6 years ago

@westeri I tested your patch and it didn't seem to solve the problem on its own. Here is my procedure.

First: For testing the problem, I prefer to boot my VM with the exGPU connected to it in passthrough mode. When the bug is happening, I see colored noise on the monitor for that GPU. For the below procedure, the phrase "PCIe passthrough works" means that external monitor output was normal, and the phrase "PCIe passthrough is broken" means I saw the colored noise.

Test procedure

  1. Remove my rescan hack script from /lib/systemd/system-sleep
  2. Reboot into 4.14.4
  3. Confirm PCIe passthrough works
  4. Sleep and wake
  5. Confirm PCIe passthrough is broken
  6. Reboot into 4.15.3 w/ patch
  7. Confirm PCIe passthrough works
  8. Sleep and wake
  9. Find that PCIe passthrough is broken
  10. Confirm patch was applied in kernel code (spot checking lines from the patch with lines from the source I built from)
  11. manually rescan (echo 1 | sudo tee /sys/bus/pci/rescan)
  12. Confirm that PCIe passthrough works

Let me know how I can provide the debug info your patch outputs in case its helpful

westeri commented 6 years ago

Sorry for the delay - I was on vacation.

So we need to first try to get the host part working. Can you reproduce the issue with my patch applied but so that you don't do any VM stuff but the just the steps to reproduce the missing PCI devices? Then send me full dmesg and output of 'lspci -vv' before and after it breaks.