Closed pallaswept closed 1 month ago
For me it is obvious why the author wants so many details. To avoid asking few dozens of questions. Just attach the report instead of pasting it directly and then there is no issue that it is long.
As @w8jcik has pointed out, the interrogate command output is voluminous because it has to handle an extreme variety of physical and software configurations It short-circuit's what otherwise could be a long back and forth of specific questions. Just attach the report or submit it in some other way that does not entail pasting it inline. That makes issues unreadable.
Also, please run sudo ddcutil usbenvironment --verbose and submit the output as a attachment. This explores the system's USB configuration.
Monitors that use USB for communication with their virtual control panel, and which adhere to the relevant specifications, are rare. Once you've run sudo ddcutil environment --verbose, try erasing or renaming file /usr/lib/udev/rules.d/60-ddcutil-usb.rules.
Hi there, and thanks for your help!
I must apologise, I did not want to seem to make a fuss, so I was trying to be subtle... But maybe I was too subtle. I do understand and agree why a detailed and long log may be needed, but this one is so detailed that it included such details as real names, banking details, etc. I still feel confident that this is not your intention.
I thought that I must be doing it wrong, and that the expectation was that I run the commands with my applications closed, so that this info is not collected from them - I tried this as soon as I was able, but that didn't work, either. I rebooted as soon as I could, to try again, but again, I find info from the logs it has collected, which is private and personal, and I really can't post it online... I'm sorry! I tried!
The logs that contain the sensitive info, only seem to go back a couple of days, I could maybe try again then? My availability is very low right now, so I may fail in that endeavour. Apologies for this, I am doing my best to help out now while I still can. Please let me know if there's any other useful info I could provide.
Please identify the logs that contain sensitive information, along with the lines that concern you (with the sensitive information redacted, of course).
/var/log/messages and journalctl each contained information from both my browser and pipewire (which contained info from my browser and a couple other apps).
Anyway, the personal stuff has rolled off those, so I'll get you those fresh logs as soon as I can shut down, which should be sometime today. Thanks for your help and patience!
Reading through these compiled logs, I noticed a pattern, and I've managed to isolate the cause (repeatably). This is a conflict between nvidia-settings
and ddcutil.
Running the commands you've given actually breaks nvidia-settings ability to set fan speeds, until I restart, and this (after some digging) made me realise that, attempting to read the temp with nvidia-smi, and set the fan speed on the card using nvidia-settings at boot, when the udev rule is calling ddcutil, is the trigger for the behaviour I reported.
Fixing this by removing the usb udev rule, has removed not only the 'new' flickering (10-20 resets) on startup, but also some other flashing (5 resets) that have been a thing for...at least a year... I read were the fault of the nvidia driver/X11/sddm - but, I know better now. Now at boot I see it modeset exactly three times; 1 for the console, one for the DM (sddm on X11), and 1 for the DE (Wayland). As it should be. So I don't think this is actually new, just, the new udev rule really kicked it past 'this is mildly annoying' to 'this is potentially dangerous'.
I tried the kernel params suggested Although nvidia apparently patched the bug which is the original source for these commands, it seems they are obsolete, I tried anyway. No luck.
I'll keep trying to get a clean log to you but hopefully this helps give you something 'tangible'. Thanks again for your patience.
Thanks for the update. Do you know how nvidia-settings is invoked during initialization? Moving the udev rule later in execution order, i.e. by changing its name to 69-ddcutil-usb.rules, might address the problem.
I will consider not automatically installing 60-ddcutil-usb.rules, but instead leaving it for the user to install. As noted, it is relevant only in the highly unusual case of a monitor with which ddcutil can communicate using USB.
nvidia-settings here was invoked from a systemd service. I disabled the service to confirm the conflict in my earlier testing. I had to reboot today, so while I was doing that, I did try delaying it, hopefully allowing the ddcutil rule to run first, but I still saw a bunch of flicking, which actually kicked in at the moment the mode is usually set for the console, courtesy of the nvidia driver's fbdev
option.
It reset the monitor a dozen times or so, culminating (possibly after the 10 second delay I put on the service) in kwin (or was it plasma?) freezing up the system and having to hit the power button. The kernel responded to the power button press (I didn't have to hold the switch in), so it seems that the kernel was alive, but certainly, none of my input was working (eg couldn't switch TTY, ctrl+alt+
Related: Since a while ago (a few months), I've had a weird but seemingly cosmetic-only issue with my machine, where when I shutdown/rebooted, and I would normally see a broadcast message notifying me of the shutdown; here, the line above the message, would have some seemingly-random number of @
symbols. That problem has also disappeared with the removal of the usb rule. I initially did not consider these issues might be related but now, I do.
It seems I have various means of exposing conflicts between the nvidia driver and ddcutil. This round of tests has me looking more at the framebuffer console part of the driver (courtesy of weird characters in the console, and the timing of the flickering at boot), and as far as I'm aware, it is still considered experimental, so that checks out. I'd kinda like to try it again without that option set, but I am not game to punish my machine like that again. When you are able to repro I'm sure you'll understand my hesitance, it's got a real "oof, that can't be healthy" vibe to it.
But maybe, it might be a bug for nvidia to fix. But we both know how long that can take, so for now, I've just disabled the rule. Thinking of a long term solution, I wonder if maybe it might be more desirable to have the rule only run if there is no nvidia GPU present in the machine? Edit: Come to think of it, the interrogate
command broke stuff, too, so maybe such a rough approach is not accurate enough to avoid problems.
Offhand, checking for a nvidia gpu is not straightforward. However, looking for driver/ sys/module/nvidia is. chkusbmon could terminate without checking any hiddev device is the nvidia driver is loaded.
Similar issue here. Screen flashes multiple times on boot with Optimus turned off on my laptop. Can also be triggered on a TTY after boot using sudo udevadm trigger -s usbmisc
. Not present after disabling the usb udev rule.
usbenvironment and interrogate also trigger it. USB Environment: http://0x0.st/XfwW.txt Interrogate: http://0x0.st/Xfw4.txt
Using the listed workaround stops the flashing on boot but not when running ddcutil directly.
I just stumbled upon this by chance because a variety of commands trigger the problem. Including but not limited to:
Udev rule /usr/lib/udev/rules.d/60-ddcutil-usb.rules is no longer installed. It can be installed by the user in those very rare cases where it is actually useful.
EDIT: corrected usb rules file name.
Thanks @rockowitz . You are doing an excellent impression of that random person in Nebraska
@pallaswept I must admit that I had to google "random person in Nebraska" - I don't believe I ever heard/read the phrase before. That's a very stylish way to give a compliment., and much appreciated.
The underlying issue still seems to occur for me if I don't have POWERDEVIL_NO_DDCUTIL=1
set in the service. To be honest I don't know why powerdevil is even using ddcutil since I'm on a laptop.
@pallaswept To confirm, you're saying that the problem continues to occur even with udev rule 60-ddcutil-usb.rules removed from all of /usr/lib/udev/rules.d, /usr/local/lib/udev/rules.d, and /etc/udev/rules.d?
Re Powerdevil using libddcutil even though you're on a laptop, it would have to check beforehand that there is only a laptop display and handle the case where an external display is connected. .
I think you maybe meant to tag @JL2210 ?
They said:
usbenvironment and interrogate also trigger it. Using the listed workaround stops the flashing on boot but not when running ddcutil directly.
I do also notice this:
Running the commands you've given actually breaks nvidia-settings ability to set fan speeds, until I restart
Since some of kwin/powerdevil's automatic executions of ddcutil were still happening (I don't know what it does. I had all the brightness controls disabled. There are lots of reports since they started doing this, of people with monitors that have the wrong brightness set at boot or after sleep or something, so kwin still uses ddcutil for other stuff, too), and I know some executions of ddcutil are capable of a clash with the nvidia driver which breaks fan control on the card, I also disabled it in kwin, with POWERDEVIL_NO_DDCUTIL=1
, just to be sure.
Edit: I just want to say, to be clear, I am not flinging any mud at ddcutil here. It's very apparent to me that the nvidia driver is at fault.
Yeah, I deleted all of those files by hand. I did have a file named /etc/udev/rules.d/60-ddcutil-usb.rules
that was empty so that the rule in /usr/lib
would never be loaded (since my package manager manages it).
Also of note is that the "turning off and on again repeatedly" part only happens with my computer's iGPU disabled. Not sure why
@JL2210 To clarify, are you saying that the "turning off and on again" problem still occurs with 60-ddcutil-usb.rules disabled? Or that when 60-ddcutil-usb.rules is enabled the problem only occurs when the iGPU is disabled? If the former then I need to look elsewhere for the source of the problem.
60-ddcutil-usb.rules
is completely absent on my system. The problem occurred much more frequently when it was present (every time udev was triggered) but hasn't stopped.
Running commands like ddcutil chkusbmon hiddev2
, ddcutil environment --verbose
, or ddcutil probe
in a TTY exhibits the problem.
The issue never occurs when the iGPU is enabled, rules file or not.
The suggested workaround for Powerdevil only really works on a Wayland session. In X the problem still happens.
I can make another issue for this if you'd like. It seems to be marginally different than the one described here
Not that I mind if you start a new issue, but...
It seems to be marginally different than the one described here
So far what you've described exactly matches my experience. The only difference I can see is that you have been able to test using an iGPU (I don't have one). I assume if I did have an iGPU I'd see the same thing you did.
I have a hunch (and it's only a hunch) that in searching for /dev/i2c devices that implement DDC/CI ddcutil pokes a device that it was unable to exclude a priori (e.g. a SMBOS device) and that causes the problem. (The initial part of the "poke" is a check is an attempt to read an EDID at slave address x50, and then a check that slave address x37 is responsive.)
One way to test this is to execute a command that skips display detection and only touches a single bus. Since it's been mentioned that the probe command can trigger the failure, try executing ddcutil probe --bus
I gave that a shot and have discovered I can no longer replicate this fault. Which is nice, but also leaves us holding a mystery, which sucks.
Pretty much everything has changed at this end, since the original report, and the subsequent observations of the same bug during ddcutil [interrogate|environment]: New KDE, new KDE settings, new Qt, new nvidia driver, new kernel, plus, the service which previous utilised nvidia-smi/nvidia-settings to read and set fans at boot, is now using NVML instead (which appears to be far more polished than the CLI tools). I don't think I could practically roll it all back for testing.
Sorry @rockowitz and @JL2210 I don't know if I can be of much more help here. I'll do what I can, but... it might not be much.
My current system for reference:
kde Operating System: openSUSE Tumbleweed 20240813 KDE Plasma Version: 6.1.4 KDE Frameworks Version: 6.5.0 Qt Version: 6.7.2 Kernel Version: 6.10.3-1-default (64-bit) Graphics Platform: Wayland
nvidia Driver Version: 550.100 CUDA Version: 12.4
kernel 6.10.3-1-default
No modification in plasma-powerdevil.service.d/override.conf
This in AC and battery profiles:
I have a hunch (and it's only a hunch) that in searching for /dev/i2c devices that implement DDC/CI ddcutil pokes a device that it was unable to exclude a priori (e.g. a SMBOS device) and that causes the problem. (The initial part of the "poke" is a check is an attempt to read an EDID at slave address x50, and then a check that slave address x37 is responsive.)
One way to test this is to execute a command that skips display detection and only touches a single bus. Since it's been mentioned that the probe command can trigger the failure, try executing ddcutil probe --bus
.
This command actually makes my screen flash regardless of what bus number I pass in. I can do sudo ddcutil probe --bus 120
(it doesn't exist) and the screen will still power cycle.
My steps to reproduce are:
systemctl stop sddm
I'm currently using strace to figure out what it's doing
I haven't managed to find the exact system call that causes the issue so far, but I did see this in the latest nvidia patch notes:
Fixed a bug that could cause memory corruption while handling ACPI events on some notebooks.
I'm in the process of updating so I'll test again and see if this fixes it.
I can't tell exactly what causes it. The crash happens sometime around when master_initializer
forks, and GDB can't handle debugging multithreaded applications
Edit: narrowed it down again to check_all_video_adapters_implememt_drm
And again to probe_dri_device_using_drm_api
, on /dev/dri/card1
, which is the disabled Intel iGPU.
Seems that the line is util/drm_common.c:189 for some reason:
close(fd); // because O_CLOEXEC not recognized
So now I can reproduce this by doing:
echo -n '' | sudo tee /dev/dri/card1
I'm not sure if this is a ddcutil bug anymore
@JL2210 Thank you for the detailed debugging.
I've pushed out a change to branch 2.1.5-dev based on your report, which points to a new and experimental segment of code, which uses the drm api to determine if a device supports drm. With this change, By default, function submaster_initializer() in ddc_common_init.c no longer calls all_displays_drm_using_drm_api(). If utility option --f13 is specified, it is called. Let me know if this change eliminates the crash you're seeing. Command line option --trcfunc submaster_initializer may make it clearer what is going on.
If all_displays_drm_using_drm_api() is indeed the culprit, we can drill down into that function. Unfortunately, it and its called functions are in the lowest, utility, code layer, for which tracing cannot be turned on from the command like - it requires actually editing the code and recompiling or using gdb breakpoints.
I don't see an interrogate report for your system. In this case, only the subset of that report created by ddcutil envionment --verbose is needed. There's a segment in that report that explores the system using the drm api. (If the problem lies in the drm api usage, I wouldn't be surprised if the environment command crashes.) So please run the program and submit the output as an attachment. Thanks.
Interrogate report is in this message. In the meantime I have made a post in the Nvidia developer forums.
Now it doesn't crash in the middle of the command, but when exit is called and all file descriptors are closed it looks like it crashes then.
As to why it happens only when the file is closed, I have no idea. Not sure if it's possible to avoid opening it in the first place or not.
Here's the full backtrace at the point it opens the file:
@JL2210 Your tracing identified a missing close() statement in function get_drm_connector_states_by_devname(). I've put a fix into branch 2.1.5-dev.
Legendary effort, kudos to you both
@JL2210 Can you confirm that the recent fix in branch 2.1.5-dev resolves the crash problem? Or does it still exist?
If the bug is resolved, what happens if you invoke ddcutil with option --f13? Does that succeed or fail?
Thank you.
Sorry, I've been busy trying to get GPU passthrough to work on a virtual machine, which means I needed my iGPU enabled.
I'll test it soon, but I imagine the screen will turn off on that new close
you added.
The build fails for me at the minute:
echo "// Dummy include file to force rebuilding built_timestamp.c" >
/bin/sh: -c: line 1: syntax error near unexpected token `newline'
/bin/sh: -c: line 1: `echo "// Dummy include file to force rebuilding built_timestamp.c" >'
I think it's supposed to be $@
instead of $0
?
Seems that the screen now turns off much earlier, as I expected. When it closes the file descriptor for my Intel card /dev/dri/card2
, the screen turns off momentarily.
--f13
makes no difference.
@JL2210 "$0" should be correct. See the make doc. It's how the example in the cited stackoverflow post is written. However, it appears that the variable is not always set. I've rewritten the echo command in file src/base/Makefile.am to explicitly redirect to build_details.h and pushed the change to 2.1.5-dev.
You wrote: " When it closes the file descriptor for my Intel card /dev/dri/card2, the screen turns off momentarily." I assume by "Intel card" you mean the iGPU. The interrogate output you sent earlier was with the iGPU disabled, so there's no card2. When convenient, please run environment --verbose with the iGPU enabled. Perhaps that will give me some clue as to what is going on.
For now, I'm going to disable the use of libdrm, except for interrogate and environment. The code is used only for comparison with sysfs, which sometimes has unexpected contents. I'll post again when the change has been made.
Just to clarify, the problem only happens when the (intel) iGPU is disabled. For some reason /dev/dri/card2
is created for it even when it's been disabled in the BIOS. It also tends to switch between being called card1
and card2
, ie. not persistent across reboot.
Accessing either file is fine when the iGPU is enabled, only when the iGPU is disabled that I have problems.
Will post interrogate results with it enabled in the BIOS soon
Ahh! The problem only occurs when ddcutil tries to use /dev/dri/card1 (or 2) and there's nothing "behind it". So it's really a matter of ddcutil avoiding the /dev/dri/card device for this pathological state.
As noted, the relevant code exists only to validate what's in sysfs. It is now disabled by default. Utility option --f6 reenables it for testing purposes.
One other piece of pathology I noted in the interrogate output is that depending on where I look the EDID for the laptop displays may or may not be found on /dev/i2c-2. Does ddcutil detect even report an unsupported laptop display on /dev/i2c-2?
Finally, there was a problem with built file build_details.h not found when make was executed with option -j. That appears to having been fixed by moving its deletion from src/base/Makefile.am to src/Makefile.am.
Sorry for the delay, my keyboard suddenly decided to change layouts and I couldn't access a TTY to test. environment --verbose with iGPU enabled
With the iGPU enabled it does indeed detect an unsupported laptop GPU on /dev/i2c-9
:
det.log
Will test --f6
soon, just have to reboot
Also, both that make doc and Stack Overflow post clearly show @
(at) rather than 0
(zero). Might want to check your font
I can't reproduce anymore with or without --f6
(or --f14
, I saw that was added).
I suppose it's time to put the bug with /dev/dri/cardX
being created for a disabled device on the kernel mailing list now
@JL2210 "Might want to check your font." That puts it kindly. A case of once you "know" what you're seeing you stop really looking.
Yes, it's time to regard this as a DRM issue. Though given that it involves both involves both Intel and Nvidia graphics I expect it will be hard to get it addressed.
Thank you for your persistence in diagnosing this extended issue. We've actually dealt with multiple bugs along the way, and it's important that the problematic diagnostics are no longer enabled by default.
Unless something more comes up, I'll close this issue in a few days. Or feel free to close it yourself.
Regards, Sanford
Sorry but I need to give a quiet round of applause from the sidelines here.
Thanks heaps to both of you.
Edit: Can reopen the issue on request if needed :)
I recently upgraded my system (Tumbleweed) and upon rebooting, noticed a disconcerting number of resets from my second monitor. By this, I mean, there would be text or graphics on screen showing the boot process, and the screen would blink off and then back on again (showing the resolution change OSD as if it just changed modes) over and over rapidly, until it settled at my desktop.
I tracked the problem down to the new udev rule installed by opensuse's package, which has just had the updates from 2.0.0 through 2.1.4 applied. In trying to find the source, it seems that it is the example file from here.. (A similar rule is also present in the docs)
As best I can tell, what I'm seeing is a monitor reset for every USB HID device attached to my system. There are fewer when I unplug a few of them. Perhaps this is related to the monitor (a Viewsonic XG2703-GS) having an onboard USB hub, with several HID devices (two mice, which appear as several separate HID devices, each) ... The monitor otherwise works with DDC, I use it to alter settings without using the buttons on the side... I'm hesitant to experiment with this, though, as the monitor can not be enjoying being 'strobed' like that. I've gone ahead and commented out the udev rule for the time being.
As instructed, I ran
sudo ddcutil interrogate --verbose
but ahh... are you sure you want it that verbose? It's ~5000 lines and has some rather fine detail of what I'm doing right now, so I thought I should confirm that is what you really need, before I paste a giant self-doxxing spam message ;) Am I doing it wrong?Sorry I am a little late to report this, it's taken me a while to pin down the source of the fault.
Edit: For future travellers who might have similar issues: I stumbled across this workaround on reddit just now:
A search after seeing this shows me that to find this, you have to be so desperate to be reading the source code for powerdevil, so I'm reposting here, so that others might know how to disable DDC in KDE until it's a bit more stable/less dangerous to hardware.