madMAx43v3r / chia-gigahorse

221 stars 32 forks source link

Long system timeouts > 1 minute on ada lovelace #74

Open Motophan opened 1 year ago

Motophan commented 1 year ago
[10900.272611] NVRM: Xid (PCI:0000:0b:00): 109, pid=11263, name=Renderer, Ch 00000020, errorString CTX SWITCH TIMEOUT, Info 0xac00d

[11002.151833] r8169 0000:05:00.0: invalid VPD tag 0x00 (size 0) at offset 0; assume missing optional EEPROM
[11272.551321] NVRM: Xid (PCI:0000:0b:00): 109, pid=11263, name=Renderer, Ch 00000020, errorString CTX SWITCH TIMEOUT, Info 0xac00d

[12210.204152] NVRM: Xid (PCI:0000:0b:00): 109, pid=11263, name=Renderer, Ch 00000020, errorString CTX SWITCH TIMEOUT, Info 0xac00d

System specs: rtx 4080 samsung 980 pro 2TB (under TBW) 128gb ram amd 5950x arch linux, desktop env is xfce

nvidia-smi output

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4080         Off| 00000000:0B:00.0  On |                  N/A |
|  0%   36C    P8                7W / 320W|    575MiB / 16376MiB |      5%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2215      C   cuda_plot_k32                               260MiB |
|    0   N/A  N/A     10767      G   /usr/lib/Xorg                               108MiB |
|    0   N/A  N/A     11263      G   /usr/lib/firefox/firefox                    163MiB |
+---------------------------------------------------------------------------------------+

What happens: in 20 minutes or 20 hours the system will become unstable. The system will freeze for >1 minute at a time, sometimes once every few minutes, sometimes once every few hours. Sometimes dmesg will output the xid 109 error (which is undocumented by nvidia) sometimes dmesg will not output any error at all (but the system will still freeze).

Example plotting command


screen -d -m -S plotsink chia_plot_sink /mnt/hdd0{1,2,3,4,5,6,7,8}```
madMAx43v3r commented 1 year ago

Since this is partial RAM mode (I assume), I would blame the NVMes first.

madMAx43v3r commented 1 year ago

CTX SWITCH TIMEOUT does seem like it could be the cause though...

madMAx43v3r commented 1 year ago

It could also be the other way around though, NVMes causing the CTX SWITCH TIMEOUT.

madMAx43v3r commented 1 year ago

I'm getting these freezes on my dev machine as well, with a GIGABYTE GP-GSM2NE3100TNTD

Motophan commented 1 year ago

I can source a new enterprise NVME, I am juuuuuuuust above the tbw on a 980 pro 2TB (I am 1400tbw on a 1200tbw drive)

Should I close this or do you know of anything I can do to isolate this issue? Or leave open until I can source a good high endurance nvme.

Also, do you have any recommendations for drives? I was going to get a pcie u2 adapter and ~ $120 used high endurance u2 intel dc drive.