gpustatd crash - Githubissues

biesbjerg commented 6 years ago

After running for some time I often get this (pulled from /var/log/messages):

Mar  2 09:30:41 midas systemd: Starting gpustatd fan control daemon from the minotaur project...
Mar  2 09:30:41 midas rsyslogd: fopen() failed: 'Permission denied', path: '/var/lib/rsyslog/imjournal.state.tmp'  [v8.24.0 try http://www.rsyslog.com/e/2013 ]
Mar  2 09:30:41 midas mkdir: /bin/mkdir: cannot create directory ‘/var/run/gpustatd’: File exists
Mar  2 09:30:41 midas xhost: localuser:gpustatd being added to access control list
Mar  2 09:30:41 midas systemd: Started gpustatd fan control daemon from the minotaur project.
Mar  2 09:30:41 midas gpustatd: info: gpustatd 1.1.4 starting up
Mar  2 09:30:41 midas gpustatd: info: scanning devices
Mar  2 09:30:42 midas gpustatd: Traceback (most recent call last):
Mar  2 09:30:42 midas gpustatd: File "/home/mock/rpmbuild/BUILD/gpustatd-1.1.4/gpustatd.py", line 210, in <module>
Mar  2 09:30:42 midas gpustatd: File "/home/mock/rpmbuild/BUILD/gpustatd-1.1.4/gpustatd.py", line 33, in __init__
Mar  2 09:30:42 midas gpustatd: File "/home/mock/rpmbuild/BUILD/gpustatd-1.1.4/singleton.py", line 5, in __call__
Mar  2 09:30:42 midas gpustatd: File "/home/mock/rpmbuild/BUILD/gpustatd-1.1.4/nvidia.py", line 63, in __init__
Mar  2 09:30:42 midas gpustatd: KeyError: 4

This, of course, causes all sorts of trouble for minotaur. I had negative profits because it was mining daggerhashimoto at 277 W instead of the expected 139 W etc.

gordan-bobic commented 6 years ago

Interesting, I saw the same symptoms but thought the bug is in minotaur. It's possible a gpustatd bug is behind it.

m4rkw commented 6 years ago

Seem to be an edge case where the output of:

/usr/bin/nvidia-settings -q fans

doesn't have an entry for every device on the system.

m4rkw commented 6 years ago

no wait, that's wrong. it looks like a device being returned by -q fans is not present in the output from nvidia-smi -L.

gordan-bobic commented 6 years ago

Did the device fall off the bus? That is a bastard of a problem to deal with gracefully, and it does happen with mining rigs where everything is a bodge (and in all mining rigs, almost everything is a bodge, starting with PCIe risers).

Does

dmesg | grep Xid

show anything interesting?

The sanest bodge for a workaround I have come up with is using swatch to watch syslog and issue

echo b > /proc/sysrq-trigger

when an error occurs. It's pretty awful (or DevOps as fook, depending on your point of view).

gordan-bobic commented 6 years ago

Or is this a mismatch between nvidia-smi and xorg.conf?

biesbjerg commented 6 years ago

dmesg doesn't show anything:

[midas@midas ~]$ sudo dmesg | grep Xid
[midas@midas ~]$

gordan-bobic commented 6 years ago

My best guess then is a xorg.conf mismatch with nvidia-smi.

biesbjerg commented 6 years ago

Looks okay to me:

[midas@midas ~]$ cat /etc/X11/xorg.conf
Section "ServerLayout"
    Identifier      "Layout0"
    Screen          0 "Screen0" 0 0
    Screen          1 "Screen1" RightOf "Screen0"
    Screen          2 "Screen2" RightOf "Screen1"
    Screen          3 "Screen3" RightOf "Screen2"
    Screen          4 "Screen4" RightOf "Screen3"
    Screen          5 "Screen5" RightOf "Screen4"
    Option          "Xinerama" "0"
EndSection
Section "Device"
    Identifier      "Device0"
    Driver          "nvidia"
    VendorName      "NVIDIA Corporation"
    BoardName       "GeForce GTX 1080 Ti"
    Option          "UseEDID" "false"
    Option          "AllowEmptyInitialConfiguration" "yes"
    Option          "ConnectToAcpid" "off"
    Option          "NoLogo" "1"
    Option          "Coolbits" "28"
    Option          "RegistryDwords" "PowerMizerEnable=0x1; PerfLevelSrc=0x2222; PowerMizerDefaultAC=0x1"
    BusID           "PCI:1:0:0"
EndSection
Section "Device"
    Identifier      "Device1"
    Driver          "nvidia"
    VendorName      "NVIDIA Corporation"
    BoardName       "GeForce GTX 1080 Ti"
    Option          "UseEDID" "false"
    Option          "AllowEmptyInitialConfiguration" "yes"
    Option          "ConnectToAcpid" "off"
    Option          "NoLogo" "1"
    Option          "Coolbits" "28"
    Option          "RegistryDwords" "PowerMizerEnable=0x1; PerfLevelSrc=0x2222; PowerMizerDefaultAC=0x1"
    BusID           "PCI:3:0:0"
EndSection
Section "Device"
    Identifier      "Device2"
    Driver          "nvidia"
    VendorName      "NVIDIA Corporation"
    BoardName       "GeForce GTX 1080 Ti"
    Option          "UseEDID" "false"
    Option          "AllowEmptyInitialConfiguration" "yes"
    Option          "ConnectToAcpid" "off"
    Option          "NoLogo" "1"
    Option          "Coolbits" "28"
    Option          "RegistryDwords" "PowerMizerEnable=0x1; PerfLevelSrc=0x2222; PowerMizerDefaultAC=0x1"
    BusID           "PCI:5:0:0"
EndSection
Section "Device"
    Identifier      "Device3"
    Driver          "nvidia"
    VendorName      "NVIDIA Corporation"
    BoardName       "GeForce GTX 1080 Ti"
    Option          "UseEDID" "false"
    Option          "AllowEmptyInitialConfiguration" "yes"
    Option          "ConnectToAcpid" "off"
    Option          "NoLogo" "1"
    Option          "Coolbits" "28"
    Option          "RegistryDwords" "PowerMizerEnable=0x1; PerfLevelSrc=0x2222; PowerMizerDefaultAC=0x1"
    BusID           "PCI:6:0:0"
EndSection
Section "Device"
    Identifier      "Device4"
    Driver          "nvidia"
    VendorName      "NVIDIA Corporation"
    BoardName       "GeForce GTX 1080 Ti"
    Option          "UseEDID" "false"
    Option          "AllowEmptyInitialConfiguration" "yes"
    Option          "ConnectToAcpid" "off"
    Option          "NoLogo" "1"
    Option          "Coolbits" "28"
    Option          "RegistryDwords" "PowerMizerEnable=0x1; PerfLevelSrc=0x2222; PowerMizerDefaultAC=0x1"
    BusID           "PCI:7:0:0"
EndSection
Section "Device"
    Identifier      "Device5"
    Driver          "nvidia"
    VendorName      "NVIDIA Corporation"
    BoardName       "GeForce GTX 1080 Ti"
    Option          "UseEDID" "false"
    Option          "AllowEmptyInitialConfiguration" "yes"
    Option          "ConnectToAcpid" "off"
    Option          "NoLogo" "1"
    Option          "Coolbits" "28"
    Option          "RegistryDwords" "PowerMizerEnable=0x1; PerfLevelSrc=0x2222; PowerMizerDefaultAC=0x1"
    BusID           "PCI:8:0:0"
EndSection
Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync      28.0 - 33.0
    VertRefresh    43.0 - 72.0
    Option         "DPMS"
EndSection
Section "Monitor"
    Identifier     "Monitor1"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync      28.0 - 33.0
    VertRefresh    43.0 - 72.0
    Option         "DPMS"
EndSection
Section "Monitor"
    Identifier     "Monitor2"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync      28.0 - 33.0
    VertRefresh    43.0 - 72.0
    Option         "DPMS"
EndSection
Section "Monitor"
    Identifier     "Monitor3"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync      28.0 - 33.0
    VertRefresh    43.0 - 72.0
    Option         "DPMS"
EndSection
Section "Monitor"
    Identifier     "Monitor4"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync      28.0 - 33.0
    VertRefresh    43.0 - 72.0
    Option         "DPMS"
EndSection
Section "Monitor"
    Identifier     "Monitor5"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync      28.0 - 33.0
    VertRefresh    43.0 - 72.0
    Option         "DPMS"
EndSection
Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth   24
    SubSection     "Display"
        Depth          24
    EndSubSection
EndSection
Section "Screen"
    Identifier     "Screen1"
    Device         "Device1"
    Monitor        "Monitor1"
    DefaultDepth   24
    SubSection     "Display"
        Depth          24
    EndSubSection
EndSection
Section "Screen"
    Identifier     "Screen2"
    Device         "Device2"
    Monitor        "Monitor2"
    DefaultDepth   24
    SubSection     "Display"
        Depth          24
    EndSubSection
EndSection
Section "Screen"
    Identifier     "Screen3"
    Device         "Device3"
    Monitor        "Monitor3"
    DefaultDepth   24
    SubSection     "Display"
        Depth          24
    EndSubSection
EndSection
Section "Screen"
    Identifier     "Screen4"
    Device         "Device4"
    Monitor        "Monitor4"
    DefaultDepth   24
    SubSection     "Display"
        Depth          24
    EndSubSection
EndSection
Section "Screen"
    Identifier     "Screen5"
    Device         "Device5"
    Monitor        "Monitor5"
    DefaultDepth   24
    SubSection     "Display"
        Depth          24
    EndSubSection
EndSection

[midas@midas ~]$ nvidia-smi
Fri Mar  2 11:20:22 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:01:00.0 Off |                  N/A |
| 28%   50C    P2   136W / 139W |   2812MiB / 11176MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:03:00.0 Off |                  N/A |
| 31%   52C    P2   138W / 139W |   2656MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  On   | 00000000:05:00.0 Off |                  N/A |
| 28%   51C    P2   138W / 139W |   2656MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  On   | 00000000:06:00.0 Off |                  N/A |
| 28%   51C    P2   137W / 139W |   2656MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  On   | 00000000:07:00.0 Off |                  N/A |
| 28%   51C    P2   138W / 139W |   2656MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  On   | 00000000:08:00.0 Off |                  N/A |
| 31%   51C    P2   139W / 139W |   2656MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

gordan-bobic commented 6 years ago

Yeah, that does look right...

gordan-bobic commented 6 years ago

I have just had wrong power limit again, but no similar gpustatd crash in the same time period. So the gpustatd crash and power limits being set wrong are not directly related.

gordan-bobic commented 6 years ago

And no PCIe errors on the device that was mis-set. This looks like at least two separate bugs. The power setting one is quite critical. Until that is fixed, it may be safer and possibly even more profitable to just run ethminer standalone with a static power limit.

biesbjerg commented 6 years ago

minotaur.log

2018-03-02 12:58:59: [info] initialising
2018-03-02 12:59:00: [info] new version available: v1.1.2
2018-03-02 12:59:00: [info] scanning devices
2018-03-02 12:59:00: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/5.yml
2018-03-02 12:59:00: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/4.yml
2018-03-02 12:59:00: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/3.yml
2018-03-02 12:59:00: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/2.yml
2018-03-02 12:59:00: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/1.yml
2018-03-02 12:59:00: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/0.yml
2018-03-02 12:59:00: [info] found 6 nvidia GPUs
2018-03-02 12:59:00: [info] device 0 [1080Ti]:   ccminer: blake2s,blake256r8,decred,cryptonight,x13,keccak,x11gost,neoscrypt
2018-03-02 12:59:00: [info] device 0 [1080Ti]:   ccminer: quark,nist5,qubit,skunk,lbry,lyra2rev2,x15
2018-03-02 12:59:00: [info] device 0 [1080Ti]:  ethminer: daggerhashimoto
2018-03-02 12:59:00: [info] device 0 [1080Ti]: excavator: blake2s,daggerhashimoto_sia,cryptonight,daggerhashimoto,keccak
2018-03-02 12:59:00: [info] device 0 [1080Ti]: excavator: sia,neoscrypt,daggerhashimoto_decred,daggerhashimoto_pascal,pascal
2018-03-02 12:59:00: [info] device 0 [1080Ti]: excavator: decred,nist5,equihash,lbry,lyra2rev2
2018-03-02 12:59:00: [info] device 0 [1080Ti]:  ccminer2: blake2s,keccak,quark,x11gost,qubit,lbry,nist5
2018-03-02 12:59:00: [info] device 1 [1080Ti]:  as above
2018-03-02 12:59:00: [info] device 2 [1080Ti]:  as above
2018-03-02 12:59:00: [info] device 3 [1080Ti]:  as above
2018-03-02 12:59:00: [info] device 4 [1080Ti]:  as above
2018-03-02 12:59:00: [info] device 5 [1080Ti]:  as above
2018-03-02 12:59:00: [info] you have calibration data for all supported algorithms :)
2018-03-02 12:59:00: [info] retrieving state from miner backends
2018-03-02 12:59:00: [info] device 0 [1080Ti]: most profitable is now: ethermine/daggerhashimoto in region: None using ethminer
2018-03-02 12:59:00: [info] device 1 [1080Ti]: most profitable is now: ethermine/daggerhashimoto in region: None using ethminer
2018-03-02 12:59:00: [info] device 2 [1080Ti]: most profitable is now: ethermine/daggerhashimoto in region: None using ethminer
2018-03-02 12:59:00: [info] device 3 [1080Ti]: most profitable is now: ethermine/daggerhashimoto in region: None using ethminer
2018-03-02 12:59:00: [info] device 4 [1080Ti]: most profitable is now: ethermine/daggerhashimoto in region: None using ethminer
2018-03-02 12:59:00: [info] device 5 [1080Ti]: most profitable is now: ethermine/daggerhashimoto in region: None using ethminer
2018-03-02 12:59:01: [info] device 0 [1080Ti]: starting algorithm daggerhashimoto with ethminer [pool=ethermine] [profile=1080ti_oc_daggerhas
himoto] [region=None]
2018-03-02 12:59:01: [info] device 1 [1080Ti]: starting algorithm daggerhashimoto with ethminer [pool=ethermine] [profile=1080ti_oc_daggerhas
himoto] [region=None]
2018-03-02 12:59:01: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/5.yml
2018-03-02 12:59:01: [info] device 2 [1080Ti]: starting algorithm daggerhashimoto with ethminer [pool=ethermine] [profile=1080ti_oc_daggerhas
himoto] [region=None]
2018-03-02 12:59:01: [info] device 5 [1080Ti]: starting algorithm daggerhashimoto with ethminer [pool=ethermine] [profile=1080ti_oc_daggerhas
himoto] [region=None]
2018-03-02 12:59:01: [info] device 3 [1080Ti]: starting algorithm daggerhashimoto with ethminer [pool=ethermine] [profile=1080ti_oc_daggerhas
himoto] [region=None]
2018-03-02 12:59:01: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/4.yml
2018-03-02 12:59:01: [info] device 4 [1080Ti]: starting algorithm daggerhashimoto with ethminer [pool=ethermine] [profile=1080ti_oc_daggerhas
himoto] [region=None]
2018-03-02 12:59:01: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/3.yml
2018-03-02 12:59:01: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/2.yml
2018-03-02 12:59:01: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/1.yml
2018-03-02 12:59:01: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/0.yml
2018-03-02 12:59:01: [warning] failed to set power limit on device 4 (check we have +s on nvidia-smi)
2018-03-02 12:59:21: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/5.yml
2018-03-02 12:59:21: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/4.yml
2018-03-02 12:59:21: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/3.yml
2018-03-02 12:59:21: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/2.yml
2018-03-02 12:59:21: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/1.yml
2018-03-02 12:59:21: [warning] gpustatd metrics file is stale (>5s): /var/run/gpustatd/0.yml

biesbjerg commented 6 years ago

Seems I have a crashing GPU

[midas@midas ~]$ sudo su gpustatd
bash-4.2$ gpustatd
info: gpustatd 1.1.4 starting up
info: scanning devices
Traceback (most recent call last):
  File "/home/mock/rpmbuild/BUILD/gpustatd-1.1.4/gpustatd.py", line 210, in <module>
  File "/home/mock/rpmbuild/BUILD/gpustatd-1.1.4/gpustatd.py", line 33, in __init__
  File "/home/mock/rpmbuild/BUILD/gpustatd-1.1.4/singleton.py", line 5, in __call__
  File "/home/mock/rpmbuild/BUILD/gpustatd-1.1.4/nvidia.py", line 63, in __init__
KeyError: 4
bash-4.2$ nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-29c69203-c1cd-73d6-7776-82b222317e43)
GPU 1: GeForce GTX 1080 Ti (UUID: GPU-f0043b8c-451a-645b-e0f7-d4164bff9443)
GPU 2: GeForce GTX 1080 Ti (UUID: GPU-408fdbc5-7d5a-67a9-97cc-e71214807d2a)
GPU 3: GeForce GTX 1080 Ti (UUID: GPU-87241d91-62ff-7633-b776-d69789b3bf12)
Unable to determine the device handle for gpu 0000:07:00.0: GPU is lost.  Reboot the system to recover this GPU

GPU 5: GeForce GTX 1080 Ti (UUID: GPU-653ef5a7-d43e-c784-146d-d850bdee0521)

gordan-bobic commented 6 years ago

Time to dial back the OC. What OC offsets do you use?

biesbjerg commented 6 years ago

The question is, does one crashed GPU somehow prevent gputstatd from updating metrics for all devices? That probably shouldn't be the case.

biesbjerg commented 6 years ago

Yeah, just dialed OC back a bit, for equihash and daggerhashimoto (the algos that have been running the last couple of hours)

device_profiles:
  default:
    algorithm: all
    device: all
  1080ti_oc:
    algorithm: all
    device: 1080ti
    gpu_clock_offset: 115
    memory_clock_offset: 1500
  1080ti_oc_equihash:
    algorithm: equihash
    device: 1080ti
    gpu_clock_offset: 150 # 155
    memory_clock_offset: 1580 # 1620
  1080ti_oc_neoscrypt:
    algorithm: neoscrypt
    device: 1080ti
    gpu_clock_offset: 105
    memory_clock_offset: 450
  1080ti_oc_x11gost:
    algorithm: x11gost
    device: 1080ti
    gpu_clock_offset: 155
    memory_clock_offset: 1620
  1080ti_oc_lyra2rev2:
    algorithm: lyra2rev2
    device: 1080ti
    gpu_clock_offset: 175
    memory_clock_offset: 1650
  1080ti_oc_daggerhashimoto:
    algorithm: daggerhashimoto
    device: 1080ti
    gpu_clock_offset: 115 # 125
    memory_clock_offset: 1130 # 1150

gordan-bobic commented 6 years ago

Nvidia API doesn't degrade gracefully, when a GPU crashes, most control goes away, and X usually goes with it.

+1500 RAM is borderline at best on 1080Ti. On Asus, stick with +823, any more and performance gets worse. On other cards, +1000 is safe in P2. Relatively few cards can handle +1500 stably. Any more and you are heading for serious instability very quickly.

+150 on GPU is also borderline. Stick with +125 with any sign of trouble.

And I'm not convinced that a profile per algorithm is more than a huge waste of time for a placebo effect.

gordan-bobic commented 6 years ago

And API will probably never get graceful degradation because the problem goes away without OC and fan control. No need for Xorg even without those.

biesbjerg commented 6 years ago

I find it really depends on the algo. I have manually calibrated my cards, algo vs. OC and think I've found the 'sweet spot' performance wise. Stability no guaranteed of course.

Anyway, after rebooting, my GPU still doesn't show up. I think maybe the GPU or riser has gone bad.

m4rkw / gpustatd

gpustatd crash #1