rockowitz / ddcutil

Control monitor settings using DDC/CI and USB
http://www.ddcutil.com
GNU General Public License v2.0
978 stars 40 forks source link

Monitor resets many times on boot after new udev rule installed #428

Closed pallaswept closed 1 month ago

pallaswept commented 4 months ago

I recently upgraded my system (Tumbleweed) and upon rebooting, noticed a disconcerting number of resets from my second monitor. By this, I mean, there would be text or graphics on screen showing the boot process, and the screen would blink off and then back on again (showing the resolution change OSD as if it just changed modes) over and over rapidly, until it settled at my desktop.

I tracked the problem down to the new udev rule installed by opensuse's package, which has just had the updates from 2.0.0 through 2.1.4 applied. In trying to find the source, it seems that it is the example file from here.. (A similar rule is also present in the docs)

As best I can tell, what I'm seeing is a monitor reset for every USB HID device attached to my system. There are fewer when I unplug a few of them. Perhaps this is related to the monitor (a Viewsonic XG2703-GS) having an onboard USB hub, with several HID devices (two mice, which appear as several separate HID devices, each) ... The monitor otherwise works with DDC, I use it to alter settings without using the buttons on the side... I'm hesitant to experiment with this, though, as the monitor can not be enjoying being 'strobed' like that. I've gone ahead and commented out the udev rule for the time being.

As instructed, I ran sudo ddcutil interrogate --verbose but ahh... are you sure you want it that verbose? It's ~5000 lines and has some rather fine detail of what I'm doing right now, so I thought I should confirm that is what you really need, before I paste a giant self-doxxing spam message ;) Am I doing it wrong?

Sorry I am a little late to report this, it's taken me a while to pin down the source of the fault.

Edit: For future travellers who might have similar issues: I stumbled across this workaround on reddit just now:

systemctl --user edit plasma-powerdevil.service

# Add this:

[Service]
Environment="POWERDEVIL_NO_DDCUTIL=1"

# Restart the service:

systemctl --user restart plasma-powerdevil.service

A search after seeing this shows me that to find this, you have to be so desperate to be reading the source code for powerdevil, so I'm reposting here, so that others might know how to disable DDC in KDE until it's a bit more stable/less dangerous to hardware.

w8jcik commented 4 months ago

For me it is obvious why the author wants so many details. To avoid asking few dozens of questions. Just attach the report instead of pasting it directly and then there is no issue that it is long.

rockowitz commented 4 months ago

As @w8jcik has pointed out, the interrogate command output is voluminous because it has to handle an extreme variety of physical and software configurations It short-circuit's what otherwise could be a long back and forth of specific questions. Just attach the report or submit it in some other way that does not entail pasting it inline. That makes issues unreadable.

Also, please run sudo ddcutil usbenvironment --verbose and submit the output as a attachment. This explores the system's USB configuration.

Monitors that use USB for communication with their virtual control panel, and which adhere to the relevant specifications, are rare. Once you've run sudo ddcutil environment --verbose, try erasing or renaming file /usr/lib/udev/rules.d/60-ddcutil-usb.rules.

pallaswept commented 4 months ago

Hi there, and thanks for your help!

I must apologise, I did not want to seem to make a fuss, so I was trying to be subtle... But maybe I was too subtle. I do understand and agree why a detailed and long log may be needed, but this one is so detailed that it included such details as real names, banking details, etc. I still feel confident that this is not your intention.

I thought that I must be doing it wrong, and that the expectation was that I run the commands with my applications closed, so that this info is not collected from them - I tried this as soon as I was able, but that didn't work, either. I rebooted as soon as I could, to try again, but again, I find info from the logs it has collected, which is private and personal, and I really can't post it online... I'm sorry! I tried!

The logs that contain the sensitive info, only seem to go back a couple of days, I could maybe try again then? My availability is very low right now, so I may fail in that endeavour. Apologies for this, I am doing my best to help out now while I still can. Please let me know if there's any other useful info I could provide.

rockowitz commented 4 months ago

Please identify the logs that contain sensitive information, along with the lines that concern you (with the sensitive information redacted, of course).

pallaswept commented 4 months ago

/var/log/messages and journalctl each contained information from both my browser and pipewire (which contained info from my browser and a couple other apps).

Anyway, the personal stuff has rolled off those, so I'll get you those fresh logs as soon as I can shut down, which should be sometime today. Thanks for your help and patience!

pallaswept commented 4 months ago

Reading through these compiled logs, I noticed a pattern, and I've managed to isolate the cause (repeatably). This is a conflict between nvidia-settings and ddcutil.

Running the commands you've given actually breaks nvidia-settings ability to set fan speeds, until I restart, and this (after some digging) made me realise that, attempting to read the temp with nvidia-smi, and set the fan speed on the card using nvidia-settings at boot, when the udev rule is calling ddcutil, is the trigger for the behaviour I reported.

Fixing this by removing the usb udev rule, has removed not only the 'new' flickering (10-20 resets) on startup, but also some other flashing (5 resets) that have been a thing for...at least a year... I read were the fault of the nvidia driver/X11/sddm - but, I know better now. Now at boot I see it modeset exactly three times; 1 for the console, one for the DM (sddm on X11), and 1 for the DE (Wayland). As it should be. So I don't think this is actually new, just, the new udev rule really kicked it past 'this is mildly annoying' to 'this is potentially dangerous'.

I tried the kernel params suggested Although nvidia apparently patched the bug which is the original source for these commands, it seems they are obsolete, I tried anyway. No luck.

I'll keep trying to get a clean log to you but hopefully this helps give you something 'tangible'. Thanks again for your patience.

rockowitz commented 4 months ago

Thanks for the update. Do you know how nvidia-settings is invoked during initialization? Moving the udev rule later in execution order, i.e. by changing its name to 69-ddcutil-usb.rules, might address the problem.

I will consider not automatically installing 60-ddcutil-usb.rules, but instead leaving it for the user to install. As noted, it is relevant only in the highly unusual case of a monitor with which ddcutil can communicate using USB.

pallaswept commented 4 months ago

nvidia-settings here was invoked from a systemd service. I disabled the service to confirm the conflict in my earlier testing. I had to reboot today, so while I was doing that, I did try delaying it, hopefully allowing the ddcutil rule to run first, but I still saw a bunch of flicking, which actually kicked in at the moment the mode is usually set for the console, courtesy of the nvidia driver's fbdev option.

It reset the monitor a dozen times or so, culminating (possibly after the 10 second delay I put on the service) in kwin (or was it plasma?) freezing up the system and having to hit the power button. The kernel responded to the power button press (I didn't have to hold the switch in), so it seems that the kernel was alive, but certainly, none of my input was working (eg couldn't switch TTY, ctrl+alt+, etc).

Related: Since a while ago (a few months), I've had a weird but seemingly cosmetic-only issue with my machine, where when I shutdown/rebooted, and I would normally see a broadcast message notifying me of the shutdown; here, the line above the message, would have some seemingly-random number of @ symbols. That problem has also disappeared with the removal of the usb rule. I initially did not consider these issues might be related but now, I do.

It seems I have various means of exposing conflicts between the nvidia driver and ddcutil. This round of tests has me looking more at the framebuffer console part of the driver (courtesy of weird characters in the console, and the timing of the flickering at boot), and as far as I'm aware, it is still considered experimental, so that checks out. I'd kinda like to try it again without that option set, but I am not game to punish my machine like that again. When you are able to repro I'm sure you'll understand my hesitance, it's got a real "oof, that can't be healthy" vibe to it.

But maybe, it might be a bug for nvidia to fix. But we both know how long that can take, so for now, I've just disabled the rule. Thinking of a long term solution, I wonder if maybe it might be more desirable to have the rule only run if there is no nvidia GPU present in the machine? Edit: Come to think of it, the interrogate command broke stuff, too, so maybe such a rough approach is not accurate enough to avoid problems.

rockowitz commented 4 months ago

Offhand, checking for a nvidia gpu is not straightforward. However, looking for driver/ sys/module/nvidia is. chkusbmon could terminate without checking any hiddev device is the nvidia driver is loaded.

JL2210 commented 3 months ago

Similar issue here. Screen flashes multiple times on boot with Optimus turned off on my laptop. Can also be triggered on a TTY after boot using sudo udevadm trigger -s usbmisc. Not present after disabling the usb udev rule.

usbenvironment and interrogate also trigger it. USB Environment: http://0x0.st/XfwW.txt Interrogate: http://0x0.st/Xfw4.txt

Using the listed workaround stops the flashing on boot but not when running ddcutil directly.

I just stumbled upon this by chance because a variety of commands trigger the problem. Including but not limited to:

rockowitz commented 2 months ago

Udev rule /usr/lib/udev/rules.d/60-ddcutil-usb.rules is no longer installed. It can be installed by the user in those very rare cases where it is actually useful.

EDIT: corrected usb rules file name.

pallaswept commented 2 months ago

Thanks @rockowitz . You are doing an excellent impression of that random person in Nebraska

rockowitz commented 2 months ago

@pallaswept I must admit that I had to google "random person in Nebraska" - I don't believe I ever heard/read the phrase before. That's a very stylish way to give a compliment., and much appreciated.

JL2210 commented 2 months ago

The underlying issue still seems to occur for me if I don't have POWERDEVIL_NO_DDCUTIL=1 set in the service. To be honest I don't know why powerdevil is even using ddcutil since I'm on a laptop.

rockowitz commented 2 months ago

@pallaswept To confirm, you're saying that the problem continues to occur even with udev rule 60-ddcutil-usb.rules removed from all of /usr/lib/udev/rules.d, /usr/local/lib/udev/rules.d, and /etc/udev/rules.d?

Re Powerdevil using libddcutil even though you're on a laptop, it would have to check beforehand that there is only a laptop display and handle the case where an external display is connected. .

pallaswept commented 2 months ago

I think you maybe meant to tag @JL2210 ?

They said:

usbenvironment and interrogate also trigger it. Using the listed workaround stops the flashing on boot but not when running ddcutil directly.

I do also notice this:

Running the commands you've given actually breaks nvidia-settings ability to set fan speeds, until I restart

Since some of kwin/powerdevil's automatic executions of ddcutil were still happening (I don't know what it does. I had all the brightness controls disabled. There are lots of reports since they started doing this, of people with monitors that have the wrong brightness set at boot or after sleep or something, so kwin still uses ddcutil for other stuff, too), and I know some executions of ddcutil are capable of a clash with the nvidia driver which breaks fan control on the card, I also disabled it in kwin, with POWERDEVIL_NO_DDCUTIL=1, just to be sure.

Edit: I just want to say, to be clear, I am not flinging any mud at ddcutil here. It's very apparent to me that the nvidia driver is at fault.

JL2210 commented 2 months ago

Yeah, I deleted all of those files by hand. I did have a file named /etc/udev/rules.d/60-ddcutil-usb.rules that was empty so that the rule in /usr/lib would never be loaded (since my package manager manages it).

Also of note is that the "turning off and on again repeatedly" part only happens with my computer's iGPU disabled. Not sure why

rockowitz commented 2 months ago

@JL2210 To clarify, are you saying that the "turning off and on again" problem still occurs with 60-ddcutil-usb.rules disabled? Or that when 60-ddcutil-usb.rules is enabled the problem only occurs when the iGPU is disabled? If the former then I need to look elsewhere for the source of the problem.

JL2210 commented 2 months ago

60-ddcutil-usb.rules is completely absent on my system. The problem occurred much more frequently when it was present (every time udev was triggered) but hasn't stopped.

Running commands like ddcutil chkusbmon hiddev2, ddcutil environment --verbose, or ddcutil probe in a TTY exhibits the problem.

The issue never occurs when the iGPU is enabled, rules file or not.

The suggested workaround for Powerdevil only really works on a Wayland session. In X the problem still happens.

I can make another issue for this if you'd like. It seems to be marginally different than the one described here

pallaswept commented 2 months ago

Not that I mind if you start a new issue, but...

It seems to be marginally different than the one described here

So far what you've described exactly matches my experience. The only difference I can see is that you have been able to test using an iGPU (I don't have one). I assume if I did have an iGPU I'd see the same thing you did.

rockowitz commented 2 months ago

I have a hunch (and it's only a hunch) that in searching for /dev/i2c devices that implement DDC/CI ddcutil pokes a device that it was unable to exclude a priori (e.g. a SMBOS device) and that causes the problem. (The initial part of the "poke" is a check is an attempt to read an EDID at slave address x50, and then a check that slave address x37 is responsive.)

One way to test this is to execute a command that skips display detection and only touches a single bus. Since it's been mentioned that the probe command can trigger the failure, try executing ddcutil probe --bus .

pallaswept commented 2 months ago

I gave that a shot and have discovered I can no longer replicate this fault. Which is nice, but also leaves us holding a mystery, which sucks.

Pretty much everything has changed at this end, since the original report, and the subsequent observations of the same bug during ddcutil [interrogate|environment]: New KDE, new KDE settings, new Qt, new nvidia driver, new kernel, plus, the service which previous utilised nvidia-smi/nvidia-settings to read and set fans at boot, is now using NVML instead (which appears to be far more polished than the CLI tools). I don't think I could practically roll it all back for testing.

Sorry @rockowitz and @JL2210 I don't know if I can be of much more help here. I'll do what I can, but... it might not be much.

My current system for reference:

kde Operating System: openSUSE Tumbleweed 20240813 KDE Plasma Version: 6.1.4 KDE Frameworks Version: 6.5.0 Qt Version: 6.7.2 Kernel Version: 6.10.3-1-default (64-bit) Graphics Platform: Wayland

nvidia Driver Version: 550.100 CUDA Version: 12.4

kernel 6.10.3-1-default

No modification in plasma-powerdevil.service.d/override.conf

This in AC and battery profiles: image

JL2210 commented 2 months ago

I have a hunch (and it's only a hunch) that in searching for /dev/i2c devices that implement DDC/CI ddcutil pokes a device that it was unable to exclude a priori (e.g. a SMBOS device) and that causes the problem. (The initial part of the "poke" is a check is an attempt to read an EDID at slave address x50, and then a check that slave address x37 is responsive.)

One way to test this is to execute a command that skips display detection and only touches a single bus. Since it's been mentioned that the probe command can trigger the failure, try executing ddcutil probe --bus .

This command actually makes my screen flash regardless of what bus number I pass in. I can do sudo ddcutil probe --bus 120 (it doesn't exist) and the screen will still power cycle.

My steps to reproduce are:

I'm currently using strace to figure out what it's doing

JL2210 commented 2 months ago

I haven't managed to find the exact system call that causes the issue so far, but I did see this in the latest nvidia patch notes:

Fixed a bug that could cause memory corruption while handling ACPI events on some notebooks.

I'm in the process of updating so I'll test again and see if this fixes it.

JL2210 commented 2 months ago

I can't tell exactly what causes it. The crash happens sometime around when master_initializer forks, and GDB can't handle debugging multithreaded applications

Edit: narrowed it down again to check_all_video_adapters_implememt_drm

And again to probe_dri_device_using_drm_api, on /dev/dri/card1, which is the disabled Intel iGPU.

Seems that the line is util/drm_common.c:189 for some reason:

close(fd); // because O_CLOEXEC not recognized
JL2210 commented 2 months ago

So now I can reproduce this by doing:

echo -n '' | sudo tee /dev/dri/card1

I'm not sure if this is a ddcutil bug anymore

rockowitz commented 2 months ago

@JL2210 Thank you for the detailed debugging.

I've pushed out a change to branch 2.1.5-dev based on your report, which points to a new and experimental segment of code, which uses the drm api to determine if a device supports drm. With this change, By default, function submaster_initializer() in ddc_common_init.c no longer calls all_displays_drm_using_drm_api(). If utility option --f13 is specified, it is called. Let me know if this change eliminates the crash you're seeing. Command line option --trcfunc submaster_initializer may make it clearer what is going on.

If all_displays_drm_using_drm_api() is indeed the culprit, we can drill down into that function. Unfortunately, it and its called functions are in the lowest, utility, code layer, for which tracing cannot be turned on from the command like - it requires actually editing the code and recompiling or using gdb breakpoints.

I don't see an interrogate report for your system. In this case, only the subset of that report created by ddcutil envionment --verbose is needed. There's a segment in that report that explores the system using the drm api. (If the problem lies in the drm api usage, I wouldn't be surprised if the environment command crashes.) So please run the program and submit the output as an attachment. Thanks.

JL2210 commented 2 months ago

Interrogate report is in this message. In the meantime I have made a post in the Nvidia developer forums.

JL2210 commented 2 months ago

Now it doesn't crash in the middle of the command, but when exit is called and all file descriptors are closed it looks like it crashes then.

As to why it happens only when the file is closed, I have no idea. Not sure if it's possible to avoid opening it in the first place or not.

JL2210 commented 2 months ago

Here's the full backtrace at the point it opens the file:

``` #0 __libc_open64 (file=file@entry=0x55555565db60 "/dev/dri/card2", oflag=oflag@entry=524290) at ../sysdeps/unix/sysv/linux/open64.c:30 mode = #1 0x00005555555c365d in open (__path=0x55555565db60 "/dev/dri/card2", __oflag=524290) at /usr/include/bits/fcntl2.h:55 No locals. #2 get_drm_connector_states_by_devname (devname=0x55555565db60 "/dev/dri/card2", verbose=verbose@entry=false, collector=collector@entry=0x55555565dbb0) at drm_connector_state.c:549 debug = false __func__ = "get_drm_connector_states_by_devname" result = 0 cardno = 2 fd = rc = #3 0x00005555555c3807 in drm_get_all_connector_states () at drm_connector_state.c:579 driname = ndx = verbose = false devnames = 0x555555663d50 allstates = 0x55555565dbb0 #4 0x00005555555c388a in redetect_drm_connector_states () at drm_connector_state.c:602 No locals. #5 0x0000555555582061 in submaster_initializer (parsed_cmd=parsed_cmd@entry=0x555555661ad0) at ddc_common_init.c:469 __PRETTY_FUNCTION__ = "submaster_initializer" debug = false __func__ = "submaster_initializer" final_result = 0x0 result1 = result2 = #6 0x000055555556f747 in master_initializer (parsed_cmd=0x555555661ad0) at main.c:363 debug = false ok = false submaster_errs = bye = debug = ok = submaster_errs = bye = #7 main (argc=, argv=) at main.c:924 main_debug = s = main_rc = 1 start_time_reported = explicit_syslog_level = syslog_opened = preparse_verbose = false skip_config = parsed_cmd = 0x555555661ad0 program_start_time = 1724452913 program_start_time_s = __func__ = "main" new_argv = 0x55555565db80 new_argc = 4 untokenized_cmd_prefix = 0x0 configure_fn = 0x0 preparsed_level = __PRETTY_FUNCTION__ = "main" errs = callopts = values = end_time = 140737488349136 end_time_s = ```
rockowitz commented 2 months ago

@JL2210 Your tracing identified a missing close() statement in function get_drm_connector_states_by_devname(). I've put a fix into branch 2.1.5-dev.

pallaswept commented 2 months ago

Legendary effort, kudos to you both

rockowitz commented 2 months ago

@JL2210 Can you confirm that the recent fix in branch 2.1.5-dev resolves the crash problem? Or does it still exist?

If the bug is resolved, what happens if you invoke ddcutil with option --f13? Does that succeed or fail?

Thank you.

JL2210 commented 2 months ago

Sorry, I've been busy trying to get GPU passthrough to work on a virtual machine, which means I needed my iGPU enabled.

I'll test it soon, but I imagine the screen will turn off on that new close you added.

JL2210 commented 2 months ago

The build fails for me at the minute:

echo "// Dummy include file to force rebuilding built_timestamp.c" >
/bin/sh: -c: line 1: syntax error near unexpected token `newline'
/bin/sh: -c: line 1: `echo "// Dummy include file to force rebuilding built_timestamp.c" >'

I think it's supposed to be $@ instead of $0?

JL2210 commented 2 months ago

Seems that the screen now turns off much earlier, as I expected. When it closes the file descriptor for my Intel card /dev/dri/card2, the screen turns off momentarily.

--f13 makes no difference.

rockowitz commented 2 months ago

@JL2210 "$0" should be correct. See the make doc. It's how the example in the cited stackoverflow post is written. However, it appears that the variable is not always set. I've rewritten the echo command in file src/base/Makefile.am to explicitly redirect to build_details.h and pushed the change to 2.1.5-dev.

You wrote: " When it closes the file descriptor for my Intel card /dev/dri/card2, the screen turns off momentarily." I assume by "Intel card" you mean the iGPU. The interrogate output you sent earlier was with the iGPU disabled, so there's no card2. When convenient, please run environment --verbose with the iGPU enabled. Perhaps that will give me some clue as to what is going on.

For now, I'm going to disable the use of libdrm, except for interrogate and environment. The code is used only for comparison with sysfs, which sometimes has unexpected contents. I'll post again when the change has been made.

JL2210 commented 2 months ago

Just to clarify, the problem only happens when the (intel) iGPU is disabled. For some reason /dev/dri/card2 is created for it even when it's been disabled in the BIOS. It also tends to switch between being called card1 and card2, ie. not persistent across reboot.

Accessing either file is fine when the iGPU is enabled, only when the iGPU is disabled that I have problems.

Will post interrogate results with it enabled in the BIOS soon

rockowitz commented 2 months ago

Ahh! The problem only occurs when ddcutil tries to use /dev/dri/card1 (or 2) and there's nothing "behind it". So it's really a matter of ddcutil avoiding the /dev/dri/card device for this pathological state.

As noted, the relevant code exists only to validate what's in sysfs. It is now disabled by default. Utility option --f6 reenables it for testing purposes.

One other piece of pathology I noted in the interrogate output is that depending on where I look the EDID for the laptop displays may or may not be found on /dev/i2c-2. Does ddcutil detect even report an unsupported laptop display on /dev/i2c-2?

Finally, there was a problem with built file build_details.h not found when make was executed with option -j. That appears to having been fixed by moving its deletion from src/base/Makefile.am to src/Makefile.am.

JL2210 commented 2 months ago

Sorry for the delay, my keyboard suddenly decided to change layouts and I couldn't access a TTY to test. environment --verbose with iGPU enabled

With the iGPU enabled it does indeed detect an unsupported laptop GPU on /dev/i2c-9: det.log

Will test --f6 soon, just have to reboot

JL2210 commented 2 months ago

Also, both that make doc and Stack Overflow post clearly show @ (at) rather than 0 (zero). Might want to check your font

JL2210 commented 2 months ago

I can't reproduce anymore with or without --f6 (or --f14, I saw that was added).

I suppose it's time to put the bug with /dev/dri/cardX being created for a disabled device on the kernel mailing list now

rockowitz commented 1 month ago

@JL2210 "Might want to check your font." That puts it kindly. A case of once you "know" what you're seeing you stop really looking.

Yes, it's time to regard this as a DRM issue. Though given that it involves both involves both Intel and Nvidia graphics I expect it will be hard to get it addressed.

Thank you for your persistence in diagnosing this extended issue. We've actually dealt with multiple bugs along the way, and it's important that the problematic diagnostics are no longer enabled by default.

Unless something more comes up, I'll close this issue in a few days. Or feel free to close it yourself.

Regards, Sanford

pallaswept commented 1 month ago

Sorry but I need to give a quiet round of applause from the sidelines here.

Thanks heaps to both of you.

Edit: Can reopen the issue on request if needed :)