pop-os / nvidia-graphics-drivers

Pop!_OS NVIDIA Graphics Drivers
134 stars 7 forks source link

Text & Icon corruption after resume from suspend on Nvidia #133

Open Rick1029 opened 2 years ago

Rick1029 commented 2 years ago

After resuming from suspend on Nvidia graphics, about 60% of the time it results in major corruption of text and icons requiring a shell restart to fix. This seems to be true independent of any other factor. This has only ever occurred to me on Nvidia graphics. Only gnome and gnome extensions is affected - all apps I've seen work normally. See example photos below.

Pop 21.10 Gnome 40.4.0 X11 (have not tried Wayland) Cosmic version 1 nvidia-driver-495: Installed: 495.46-0ubuntu0.21.10.1 Candidate: 495.46-0ubuntu0.21.10.1

Login screen after resume. image

Desktop in workspace overview. image

Photo of arcmenu. The Cosmic launchers have the same corruption but I can't screenshot them as they disappear when I click to take the screenshot. image

Portion of the status bar. image

Some elements in the calendar are fine. image

wagnerck commented 2 years ago

I've been documenting this problem after encountering it via tickets, customer calls, etc. It's also happening on my personal work laptop (an oryp6). It can occur both after suspend+resume and when switching users, and it appears to occur with both on the 470 and 510 Nvidia binary drivers. It seems to only happen in full Nvidia graphics mode, and not Hybrid.

The workaround when it happens is to press Alt + F2 to bring up the GNOME command prompt, press "r", and then press Enter. This restarts the GNOME shell but leaves running applications intact.

I've been tagging related tickets with the nvidia-gnome-suspend tag, and this includes the following tickets: 50264, 56831, 58336 (all oryp8), 56571 (galp5), 58582 (kudu6), as well as one person running Pop!OS on third-party hardware who reported the issue via the chat system (https://chat.pop-os.org/pop-os/pl/3hzx5iufsbg9pmi77o1c1iaqyw). There's definitely more but those are the ones I've got saved.

This is not a problem limited to Pop!OS either. The same problem is documented in other tracking systems and forums, and here's a small number of the many threads and reports:

The system76-power daemon does appear to be doing the right thing, at least according to Nvidia's power management documents: http://us.download.nvidia.com/XFree86/Linux-x86_64/510.54/README/powermanagement.html

Per the source at (https://github.com/pop-os/system76-power/blob/master/src/graphics.rs) we're using NVreg_PreserveVideoMemoryAllocations=1 on the nvidia kernel module on S3 systems (like my oryp6), and we're using NVreg_EnableS0ixPowerManagement=1 instead on S0iX systems (like the oryp8 and kudu6, I believe). We're also using NVreg_DynamicPowerManagement=0x02 in Hybrid and Compute modes, which allows us to power down the discrete GPU when it's not in use, but this is not enabled in the Nvidia graphics mode where the corruption problem occurs so it shouldn't affect this problem

While testing in Nvidia graphics mode, I commented out the options nvidia NVreg_PreserveVideoMemoryAllocations=1 line in /etc/modprobe.d/system76-power.conf, rebuilt the boot image, and restarted, and for several days now I have not experienced the corruption problem after suspending or switching users. This seems to contradict the documentation from Nvidia, where that module option should prevent the problem instead.

I do not have a system with S0iX support to test with, so I don't know if disabling the NVreg_EnableS0ixPowerManagement would do something similar on that equipment. The problem may actually have a differnet underlying cause in those cases.

Some additional details:

[edit] The kudu6 may be an S3 suspend laptop, but I don't have one to test with. The command cat /sys/power/mem_sleep will show if that's the case; if it says [s2idle] shallow it's an S0iX system, and if it says s2idle [deep] then it's an S3 system.

If the kudu6 is an S3 system then the fix that works on my oryp6 may work on the Kudu as well. Note that changing graphics modes will we-write the /etc/modprobe.d/system76-power.conf file and that line will have to be commented out again, and the boot image rebuilt with update-initramfs.

[more edit] Per customer testing, the kudu6 is an S0iX system like the oryp6 is.

peterHoburg commented 2 years ago

I own a Kudu6, and I am having this issue. 58582 is my ticket, and I thought I could chime in.

When running cat /sys/power/mem_sleep I get [s2idle]. Yes, that is the entire output.

I ended up following the intel guide found here to confirm that the kudu6 IS S0ix.

I would be happy to disable the NVreg_EnableS0ixPowerManagement flag and rebuild the boot image. Unfortunately, I don't know how to rebuild the image, so some instruction would be really helpful, or just point me to the makefile/docs.

wagnerck commented 2 years ago

@peterHoburg I've responded in the ticket conversation.

For anyone else reading this issue, rebuilding the boot image is just a matter of running sudo update-initramfs -c -k all after making the manual changes to the config files, and then rebooting.

peterHoburg commented 2 years ago

This purposed fix does not work on the kudu6.

It was easy to reboot and undo these changes.

wagnerck commented 2 years ago

@peterHoburg Thank you for sharing that information. We'll continue to work on this via the service ticket as well.

wagnerck commented 2 years ago

New Nvidia 510 series driver is out: https://www.nvidia.com/download/driverResults.aspx/187162/en-us

From the v510.60.02 release notes:

Fixed a regression that could cause OpenGL applications to hang or render incorrectly after suspend/resume cycles or VT-switches

This is not yet in the packaging system so any immediate testing will require using the Nvidia installer, which we don't recommend to end users.

wagnerck commented 2 years ago

Initial testing on an oryp6 with the unpackaged Nvidia drivers has v510.60.02 performing identically to the previous revision v510.54. The same fix to system76-power.conf works as well. These new drivers may have a fix for S0iX systems but we'll need to wait until we have proper packaging to test safely.

wagnerck commented 2 years ago

Additional testing with packaged v510.60.02 drivers from a staging repo (https://github.com/pop-os/nvidia-graphics-drivers/tree/nvidia-510.60.02) show exactly the same problem on on oryp6. Going to look into having someone test on an S0iX system using the same staging repo.

wagnerck commented 2 years ago

Adding both Pop!OS testing repositories linux-5.17.1 and nvidia-510.60.02 appears to have resolved the issue on an oryp6 with S3 suspend. With options nvidia NVreg_PreserveVideoMemoryAllocations=1 enabled, the system suspends and resumes without the GNOME corruption. I'm doing additional testing on this system, and will be looking into additional testing on a S0iX suspend system.

wagnerck commented 2 years ago

Additional testing shows that the GNOME corruption still occurs, but only after a significantly longer suspend time (maybe 30m, possibly less). Back to the drawing board...

wagnerck commented 2 years ago

Initial testing shows that the corruption after suspend+resume still occurs with the Pop!OS v22.04 beta and driver 510.60.02. Continuing testing.

n3m0-22 commented 2 years ago

On the gaze15 with 22.04 this only affects encrypted installs, but works fine on an non-encrypted.

wagnerck commented 2 years ago

@n3m0-22 That is extremely interesting to hear. That's going to make troubleshooting it a little harder because we can't really un-encrypt a drive to try to replicate it, but it gives us some additional things to look at.

SUPERCILEX commented 2 years ago

FYI I'm on an XPS unencrypted and the issue still occurs.

thor314 commented 2 years ago

Running a System76 Oryx pro w/ PopOS in nvidia mode also produces this issue on resume from suspend.

wagnerck commented 2 years ago

Initial testing suggests that the v515 Nvidia drivers may resolve this issue, at least on an S3 suspend mode laptop like an oryp6.

We do not recommend downloading the drivers directly from the Nvidia site, as they are not packaged specifically for the OS, will be harder to uninstall if something goes wrong, and may have other unintended effects. We have a testing repository with pre-release packaging, and folks who want to test this for themselves can do so with the following instructions. Please only do so at your own risk on non-critical systems, as these drivers have not been extensively tested.

Run these commands in the terminal, one at a time:

cd ~/Downloads
git clone https://github.com/pop-os/pop
sudo ./pop/scripts/apt add nvidia-515.48.07
sudo apt update
sudo apt purge ~nnvidia
sudo apt install nvidia-driver-515

If the system is a hybrid graphics laptop, run the following as well:

system76-power graphics nvidia

Either way, then reboot the system.

Eventually, the testing repo will be deleted, which will result in apt and/or the Pop!Shop complaining that it can't find it, which may prevent system updates from going through. The repository can be most easily removed by opening up the Pop!Shop, clicking on the gear in the upper right to open the "Repoman" tool, and then removing the "Pop Development Branch nvidia-515.48.07" repository. There are additional details about "Repoman" here, along with screenshots.

shkm commented 2 years ago

Initial testing suggests that the v515 Nvidia drivers may resolve this issue, at least on an S3 suspend mode laptop like an oryp6.

We do not recommend downloading the drivers directly from the Nvidia site, as they are not packaged specifically for the OS, will be harder to uninstall if something goes wrong, and may have other unintended effects. We have a testing repository with pre-release packaging, and folks who want to test this for themselves can do so with the following instructions. Please only do so at your own risk on non-critical systems, as these drivers have not been extensively tested.

Run these commands in the terminal, one at a time:

cd ~/Downloads
git clone https://github.com/pop-os/pop
sudo ./pop/scripts/apt add nvidia-515.48.07
sudo apt update
sudo apt purge ~nnvidia
sudo apt install nvidia-driver-515

If the system is a hybrid graphics laptop, run the following as well:

system76-power graphics nvidia

Either way, then reboot the system.

Eventually, the testing repo will be deleted, which will result in apt and/or the Pop!Shop complaining that it can't find it, which may prevent system updates from going through. The repository can be most easily removed by opening up the Pop!Shop, clicking on the gear in the upper right to open the "Repoman" tool, and then removing the "Pop Development Branch nvidia-515.48.07" repository. There are additional details about "Repoman" here, along with screenshots.

Thanks for this, @cwsystem76. Just wanted to report that there's no change for me on my (custom) desktop and Wayland. Seems to work for some, as reported on the Nvidia forums.

wagnerck commented 2 years ago

The nvidia-515.48.07 staging repo was removed over the weekend, possibly because it's going to be going live this week. If you added it and are now getting errors, you can remove the repo via the Pop!Shop as described previously.

wagnerck commented 2 years ago

The package nvidia-driver-515 should be available in the main repos now.

That said, this problem is only partially resolved, it appears: switching users will still cause the problem to come back, although it's a bit different. Some GNOME apps (noteably the Settings app and GNOME terminal) will be all black until GNOME is reset with the Alt-F2 shortcut. Switching TTYs does not appear to trigger the problem.

rstanuwijaya commented 1 year ago

Any simple way to update from nvidia-510 driver to nvidia-515 driver as it is already available officially?

leviport commented 1 year ago

Any simple way to update from nvidia-510 driver to nvidia-515 driver as it is already available officially?

Pop-shop > installed. You'll see an install button for it.

Apacelus commented 1 year ago

So I had the same issue as in the initial comment, on a non-system76 system: image Driver version: 525.85.05

I added options nvidia NVreg_PreserveVideoMemoryAllocations=1 to /etc/modprobe.d/system76-power.conf, rebuilt my initramfs, rebooted and now the system wakes up from sleep without any issues.

System is clean, I did a refresh about 2 months ago.

tavinus commented 1 year ago

The fix above from @Apacelus worked for me, even though I am on X11 (instead of Wayland).
Driver version: 525.85.05

image

Just had to

sudo nano /etc/modprobe.d/system76-power.conf

Add this line to the end of the file and save/exit (Ctrl + X)

options nvidia NVreg_PreserveVideoMemoryAllocations=1

Then rebuild initramfs with

sudo update-initramfs -c -k all

Then reboot and it should be fixed.

The only problem is that this mod will probably disappear in future updates (or will it not?). Edit: Just updated to 525.89.02 and the configuration remained.

No more image corruption after waking up.
Thanks heaps!
This was getting me crazy.

serkanerip commented 1 month ago

I'm having the same issue see the screenshots. I'm going to try @Apacelus's fix but wanted to share that the issue still remains.

OS: Pop!_OS 22.04 LTS GNOME: 42.9 Windowing System: X11 Nvidia Driver: 550.67

Screenshot from 2024-05-28 23-05-37 Screenshot from 2024-05-28 23-06-13