nan0s7 / nfancurve

A small and lightweight POSIX script for using a custom fan curve in Linux for those with an Nvidia GPU.
GNU General Public License v3.0
314 stars 58 forks source link

Control display is undefined #12

Open rajamarwah opened 5 years ago

rajamarwah commented 5 years ago

Please help

~~~~~~` Unable to init server: Could not connect: Connection refused

ERROR: The control display is undefined; please run nvidia-settings --help for usage information.

Number of Fans detected: Unable to init server: Could not connect: Connection refused

ERROR: The control display is undefined; please run nvidia-settings --help for usage information.

Number of GPUs detected: ./temp.sh: line 184: [: : integer expression expected Submit an issue on my GitHub page... happy to fix this :D

nan0s7 commented 5 years ago

Hmm... can I have some more information about your setup? Like are you using the X display server, do you have coolbits enabled, etc.

Post the output of the two commands nvidia-settings -q dpys and nvidia-settings -q screens please.

It sounds like you have a strange display configuration, which doesn't use the default display :0.

rajamarwah commented 5 years ago

I have a 6 GPU (1080 Ti) with Asus Prime z370-a motherboard and linux 18.04. The result of both the commands is: Unable to init server: Could not connect: Connection refused

ERROR: The control display is undefined; please run nvidia-settings --help for usage information.

Output of Nvidia-smi is:

-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... On | 00000000:01:00.0 Off | N/A | | 0% 28C P8 13W / 160W | 31MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... On | 00000000:02:00.0 Off | N/A | | 0% 24C P8 10W / 160W | 9MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... On | 00000000:04:00.0 Off | N/A | | 0% 24C P8 11W / 160W | 9MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 108... On | 00000000:05:00.0 Off | N/A | | 0% 27C P8 11W / 160W | 9MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 GeForce GTX 108... On | 00000000:06:00.0 Off | N/A | | 0% 26C P8 10W / 160W | 9MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 GeForce GTX 108... On | 00000000:08:00.0 Off | N/A | | 0% 27C P8 8W / 160W | 9MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1155 G /usr/lib/xorg/Xorg 21MiB | | 0 1622 G /usr/bin/gnome-shell 7MiB | | 1 1155 G /usr/lib/xorg/Xorg 6MiB | | 2 1155 G /usr/lib/xorg/Xorg 6MiB | | 3 1155 G /usr/lib/xorg/Xorg 6MiB | | 4 1155 G /usr/lib/xorg/Xorg 6MiB | | 5 1155 G /usr/lib/xorg/Xorg 6MiB | +-----------------------------------------------------------------------------+

nan0s7 commented 5 years ago

This is weird; it seems like the nvidia drivers aren't finding your X-server. This isn't a problem with my script, but I'm happy to help as much as I can.

What's the output of lspci -nnk? It might show that the wrong drivers are in use. Also how are you controling your machine? Do you have a desktop environment running? I can see Gnome-shell is running but I'm not sure if that can happen in the background or something.

rajamarwah commented 5 years ago

Sincerely appreciate the help and support in troubleshooting.

I have a 18.04 desktop environment running but currently display is disabled (maybe due to my tweaking -- Noob).

Here's the output:

00:00.0 Host bridge [0600]: Intel Corporation Device [8086:3e1f] (rev 08) Subsystem: ASUSTeK Computer Inc. Device [1043:8694] 00:01.0 PCI bridge [0604]: Intel Corporation Skylake PCIe Controller (x16) [8086:1901] (rev 08) Kernel driver in use: pcieport Kernel modules: shpchp 00:01.1 PCI bridge [0604]: Intel Corporation Skylake PCIe Controller (x8) [8086:1905] (rev 08) Kernel driver in use: pcieport Kernel modules: shpchp 00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:3e91] Subsystem: ASUSTeK Computer Inc. Device [1043:8694] Kernel driver in use: i915 Kernel modules: i915 00:14.0 USB controller [0c03]: Intel Corporation 200 Series PCH USB 3.0 xHCI Controller [8086:a2af] Subsystem: ASUSTeK Computer Inc. 200 Series PCH USB 3.0 xHCI Controller [1043:8694] Kernel driver in use: xhci_hcd 00:16.0 Communication controller [0780]: Intel Corporation 200 Series PCH CSME HECI #1 [8086:a2ba] Subsystem: ASUSTeK Computer Inc. 200 Series PCH CSME HECI [1043:8694] Kernel driver in use: mei_me Kernel modules: mei_me 00:17.0 SATA controller [0106]: Intel Corporation 200 Series PCH SATA controller [AHCI mode] [8086:a282] Subsystem: ASUSTeK Computer Inc. 200 Series PCH SATA controller [AHCI mode] [1043:8694] Kernel driver in use: ahci Kernel modules: ahci 00:1b.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #17 [8086:a2e7] (rev f0) Kernel driver in use: pcieport Kernel modules: shpchp 00:1b.4 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #21 [8086:a2eb] (rev f0) Kernel driver in use: pcieport Kernel modules: shpchp 00:1c.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #1 [8086:a290] (rev f0) Kernel driver in use: pcieport Kernel modules: shpchp 00:1c.1 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #2 [8086:a291] (rev f0) Kernel driver in use: pcieport Kernel modules: shpchp 00:1c.4 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #5 [8086:a294] (rev f0) Kernel driver in use: pcieport Kernel modules: shpchp 00:1c.6 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #7 [8086:a296] (rev f0) Kernel driver in use: pcieport Kernel modules: shpchp 00:1d.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #9 [8086:a298] (rev f0) Kernel driver in use: pcieport Kernel modules: shpchp 00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:a2c9] Subsystem: ASUSTeK Computer Inc. Device [1043:8694] 00:1f.2 Memory controller [0580]: Intel Corporation 200 Series PCH PMC [8086:a2a1] Subsystem: ASUSTeK Computer Inc. 200 Series PCH PMC [1043:8694] 00:1f.4 SMBus [0c05]: Intel Corporation 200 Series PCH SMBus Controller [8086:a2a3] Subsystem: ASUSTeK Computer Inc. 200 Series PCH SMBus Controller [1043:8694] Kernel driver in use: i801_smbus Kernel modules: i2c_i801 00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (2) I219-V [8086:15b8] Subsystem: ASUSTeK Computer Inc. Ethernet Connection (2) I219-V [1043:8672] Kernel driver in use: e1000e Kernel modules: e1000e 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1) Subsystem: Gigabyte Technology Co., Ltd GP102 [GeForce GTX 1080 Ti] [1458:3751] Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia 01:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1) Subsystem: Gigabyte Technology Co., Ltd GP102 HDMI Audio Controller [1458:3751] Kernel driver in use: snd_hda_intel Kernel modules: snd_hda_intel 02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1) Subsystem: ZOTAC International (MCO) Ltd. GP102 [GeForce GTX 1080 Ti] [19da:4471] Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia 02:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1) Subsystem: ZOTAC International (MCO) Ltd. GP102 HDMI Audio Controller [19da:4471] Kernel driver in use: snd_hda_intel Kernel modules: snd_hda_intel 04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1) Subsystem: Gigabyte Technology Co., Ltd GP102 [GeForce GTX 1080 Ti] [1458:3751] Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia 04:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1) Subsystem: Gigabyte Technology Co., Ltd GP102 HDMI Audio Controller [1458:3751] Kernel driver in use: snd_hda_intel Kernel modules: snd_hda_intel 05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1) Subsystem: Gigabyte Technology Co., Ltd GP102 [GeForce GTX 1080 Ti] [1458:3751] Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia 05:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1) Subsystem: Gigabyte Technology Co., Ltd GP102 HDMI Audio Controller [1458:3751] Kernel driver in use: snd_hda_intel Kernel modules: snd_hda_intel 06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1) Subsystem: ZOTAC International (MCO) Ltd. GP102 [GeForce GTX 1080 Ti] [19da:2471] Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia 06:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1) Subsystem: ZOTAC International (MCO) Ltd. GP102 HDMI Audio Controller [19da:2471] Kernel driver in use: snd_hda_intel Kernel modules: snd_hda_intel 07:00.0 USB controller [0c03]: ASMedia Technology Inc. Device [1b21:2142] Subsystem: ASUSTeK Computer Inc. Device [1043:8756] Kernel driver in use: xhci_hcd 08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1) Subsystem: ZOTAC International (MCO) Ltd. GP102 [GeForce GTX 1080 Ti] [19da:2471] Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia 08:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1) Subsystem: ZOTAC International (MCO) Ltd. GP102 HDMI Audio Controller [19da:2471] Kernel driver in use: snd_hda_intel Kernel modules: snd_hda_intel

nan0s7 commented 5 years ago

Yeah I think that may be the issue; not having a display enabled. You should be able to have a faux display running if you're into that, which may convinse Nvidia enough so that you can use my script.

There may be another way around this, where I could try controlling the fans without the use of nvidia-settings. However, I don't know how long it'd take to get that working... :P

From a few threads I've found it says you need to have an x-display running on each GPU for nvidia-settings to work.

This may help point you in the right direction: https://devtalk.nvidia.com/default/topic/1024489/nvidia-settings-on-headless-server/

rajamarwah commented 5 years ago

Lol I just followed that thread an hour ago myself and manage to manually get the control over Nvidia GPU's but I guess I can't utilize your script for now (which is a pity). Thanks again for all the help and guidance.

nan0s7 commented 5 years ago

No problem! Hope you get things how you would like them :)

I'll keep this issue open to remind myself to look into fan control without nvidia-settings to see if it's possible. If it is, it shouldn't be too hard to add in. :D

dnovischi commented 4 years ago

Tring to run this script manually works as expected, however it seems it can't be used as a service due to the error described in the following logs:

>$ systemctl --user status nfancurve.service ● nfancurve.service - Nfancurve service Loaded: loaded (/etc/systemd/user/nfancurve.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Vi 2019-11-15 18:33:22 EET; 13s ago Process: 14194 ExecStart=/bin/sh /opt/nvidia-fan-control/temp.sh (code=exited, status=1/FAILURE) Main PID: 14194 (code=exited, status=1/FAILURE)

>$ journalctl _PID=14194 -- Logs begin at Vi 2019-11-15 16:22:07 EET, end at Vi 2019-11-15 18:34:46 EET. -- nov 15 18:34:28 dan-pc sh[14194]: ################################################################################ nov 15 18:34:28 dan-pc sh[14194]: # nan0s7's script for automatically managing GPU fan speed # nov 15 18:34:28 dan-pc sh[14194]: ################################################################################ nov 15 18:34:28 dan-pc sh[14194]: Configuration file: /opt/nvidia-fan-control/config nov 15 18:34:28 dan-pc sh[14194]: Failed to connect to Mir: Failed to connect to server socket: No such file or directory nov 15 18:34:28 dan-pc sh[14194]: Unable to init server: Could not connect: Connection refused nov 15 18:34:28 dan-pc sh[14194]: ERROR: The control display is undefined; please run nvidia-settings nov 15 18:34:28 dan-pc sh[14194]: --help for usage information. nov 15 18:34:28 dan-pc sh[14194]: No Fans detected

>$ nvidia-settings -q screens 1 X Screen on dan-pc:0

[0] dan-pc:0.0 (GeForce GTX 1080)

  Has the following name:
    SCREEN-0

PS: I have tried various modifications of the service file, all give the same error.

nan0s7 commented 4 years ago

Yeah this isn't a problem with the service file itself, but the way the script is run. By default, NVIDIA should set the display to ":0", but I guess since we're running it from another program, there's no default display set. You can fix this manually in your service file by adding a parameter to the execution of the script: -d 0. Not sure if it needs a colon (:0) though.

I was thinking of adding an option to set this via the config file... so I guess this is a good reason to put it in there! :P

cj360 commented 3 years ago

I think I'm having a similar issue, with the systemctl user service not starting at boot due to:

Aug 21 15:46:18 danam4 sh[33353]: Unable to init server: Could not connect: Connection refused Aug 21 15:46:18 danam4 sh[33353]: ERROR: The control display is undefined; please run nvidia-settings > Aug 21 15:46:18 danam4 sh[10437]: Fan control set back to auto mode

Starting the script myself has no such issue. Should the -d 0 be in my .service file like: ExecStart=/bin/sh /usr/bin/nfancurve -c -d 0 /etc/nfancurve.conf ?

nan0s7 commented 3 years ago

Sorry for the delay! Yeah if you start the script manually with -d 0, then you would probably need the same values in the service file.

However, if you don't usually need to specify the display when running the script manually, this issue could be related to another one that is currently open that is to do with the service file.

Try changing your service file to something like:

[Unit]
Description=Nfancurve service
After=graphical.target

[Service]
ExecStart=/bin/sh /usr/bin/nfancurve -c /etc/nfancurve.conf
KillSignal=SIGINT

[Install]
WantedBy=default.target

Let me know how that goes.

wojciechGaudnik commented 3 years ago

I have the same issue. When I tested -d :0 and :1 directly from cmd, all go smoothly, :0 works and :1 doesn't, and that is correct. My service file: [Unit] Description=Nfancurve service After=default.target

[Service] ExecStart=/bin/sh /usr/bin/nfancurve -l -d :0 -c /etc/nfancurve.conf KillSignal=SIGINT

[Install] WantedBy=default.target

when I run /bin/sh /usr/bin/nfancurve -l -d :0 -c /etc/nfancurve.conf from console it works, service doesn't. Any suggestions are welcome.

nan0s7 commented 3 years ago

I have the same issue. When I tested -d :0 and :1 directly from cmd, all go smoothly, :0 works and :1 doesn't, and that is correct. My service file: [Unit] Description=Nfancurve service After=default.target

[Service] ExecStart=/bin/sh /usr/bin/nfancurve -l -d :0 -c /etc/nfancurve.conf KillSignal=SIGINT

[Install] WantedBy=default.target

when I run /bin/sh /usr/bin/nfancurve -l -d :0 -c /etc/nfancurve.conf from console it works, service doesn't. Any suggestions are welcome.

Are your logs the same as the above? What's the actual error?

wojciechGaudnik commented 3 years ago

Error is exactly the same, Unable to init server. But I resolve my problem with: ExecStart=xinit /opt/nfancurve/temp.sh 01:00.0 run_forever -- :1 -once I don't need a monitor so for me it works.

nan0s7 commented 3 years ago

Interesting, I'll have to look into that.

riaqn commented 2 years ago

Hello, is it possible to use this script without running Xorg on the card at all? I'm using the card for deep learning only.

nan0s7 commented 2 years ago

Hello, is it possible to use this script without running Xorg on the card at all? I'm using the card for deep learning only.

Good question. I personally haven't done any playing around with it so I am not sure. It just depends on whether you can get nvidia-settings to work without Xorg (or by using some sort of dummy display).

If you find anything or figure it out please let me know.

Cabu commented 1 year ago

Hello, is it possible to use this script without running Xorg on the card at all? I'm using the card for deep learning only.

As nan0s7, If you find out, i am interested too :)

XinzeZhang commented 1 year ago

I have the same issue when using the project remotely by ssh, the error is as follow:

$ sudo bash temp.sh Configuration file: /home/xinze/Documents/Github/nfancurve/config Unable to init server: Could not connect: Connection refused ERROR: The control display is undefined; please run nvidia-settings --help for usage information. No Fans detected

==== Finally, I found the reason and the solution to this problem. As pointed in https://xinzezhang.github.io/2021/09/01/control-gpu.html, the NVIDIA controlling software generally requires logging into the GUI Desktop. To successfully execute the temp.sh, I simply complement the command with the xauth credentials as:

sudo DISPLAY=:0 XAUTHORITY=/run/user/110/gdm/Xauthority bash temp.sh

where the user id for the 'gdm' user is get as introduced in the link mentioned above.