mathoudebine / turing-smart-screen-python

Unofficial Python system monitor and library for small IPS USB-C displays like Turing Smart Screen or XuanFang
GNU General Public License v3.0
1.06k stars 177 forks source link

400% to 700% power usage increase when a Nvidia GPU is detected #534

Open FurretUber opened 2 months ago

FurretUber commented 2 months ago

Describe the bug

While turing-smart-screen-python is running, Nvidia GPU is always at maximum frequency, high temperature and uses 8 times the power when idle.

To Reproduce
Steps to reproduce the behavior:

  1. Have Nvidia GPU;
  2. Start turing-smart-screen-python;
  3. Observe odd GPU behavior regarding power consumption, frequency and temperature.

Expected behavior
GPU frequency, temperature, and power usage aren't impacted in a significant way by turing-smart-screen-python.

Screenshots / photos of the Turing screen
Add screenshots or photos of the rendering on the smart screen to help explain your problem. You can drag and drop photos here to add them to the description.

nvtop screenshot using the custom sensors below:

nvtop screenshot using the custom sensors below

nvtop screenshot using the default Nvidia detection:

nvtop screenshot using the default Nvidia detection

Environment:

Additional context
In the last few days, I had observed an odd behavior on my headless desktop: the GPU temperature and frequency were always high as if it was being used, and power usage was way higher than normal (from 5 W to 40 W idle). As I investigated this, I found the problem only happened when turing-smart-screen-python was running. I tried commenting out the entire GPU: sections from the theme file, but the problem persisted.

To fix this, I edited sensors_python.py and removed the Nvidia detection. This way, GPU temperature and frequency returned to normal.

What is even more strange is that I set up custom sensors to read the GPU data, there was no change in power consumption, temperature or frequency at all (WARNING: Works on my machine™ code):

class nvGPUFreq(CustomDataSource):
    def as_numeric(self) -> float:
        pass
    def as_string(self) -> str:
        try:
            saidaNvidia = obtemDadosNvidia()
            linhaDividida = saidaNvidia.strip().split()
            coreFreq = linhaDividida[5].strip()
            return '{}MHz'.format(coreFreq).rjust(7)
        except Exception as err:
            print(err)
            return ''

class nvGPUTemp(CustomDataSource):
    def as_numeric(self) -> float:
        pass
    def as_string(self) -> str:
        try:
            saidaNvidia = obtemDadosNvidia()
            linhaDividida = saidaNvidia.strip().split()
            gpuTemp = linhaDividida[2].strip()
            return '{}°C'.format(gpuTemp).rjust(5)
        except Exception as err:
            print(err)
            return ''

class nvGPUMem(CustomDataSource):
    def as_numeric(self) -> float:
        try:
            saidaNvidia = obtemDadosNvidia()
            linhaDividida = saidaNvidia.strip().split()
            gpuMem = int(linhaDividida[6]) + int(linhaDividida[7]) + int(linhaDividida[8])
            return gpuMem
        except Exception as err:
            print(err)
            return 0
    def as_string(self) -> str:
        try:
            saidaNvidia = obtemDadosNvidia()
            linhaDividida = saidaNvidia.strip().split()
            gpuMem = int(linhaDividida[6]) + int(linhaDividida[7]) + int(linhaDividida[8])
            return '{} MB'.format(gpuMem).rjust(8)
        except Exception as err:
            print(err)
            return ''

class nvGPUMemPercent(CustomDataSource):
    def as_numeric(self) -> float:
        try:
            saidaNvidia = obtemDadosNvidia()
            linhaDividida = saidaNvidia.strip().split()
            gpuMemPercent = int(round(100 * (int(linhaDividida[6]) + int(linhaDividida[7]) + int(linhaDividida[8]))/int(linhaDividida[15]),0))
            print(gpuMemPercent)
            return gpuMemPercent
        except Exception as err:
            print(err)
            return 0
    def as_string(self) -> str:
        pass

saidaNvidia = ""
ultimaExecucaoNvidiaSMI = 0

def obtemDadosNvidia():
    global saidaNvidia
    global ultimaExecucaoNvidiaSMI
    if (time.time() - ultimaExecucaoNvidiaSMI < 1):
        return saidaNvidia
    ultimaExecucaoNvidiaSMI = time.time()
    processoNV = subprocess.Popen(["nvidia-smi", "dmon", "-s", "pcmu", "-c", "1"],stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    saidaEErro = processoNV.communicate()
    for linha in saidaEErro[0].decode(encoding="utf-8").strip().split('\n'):
        if (linha.startswith('#')):
            continue
    processoNV_2 = subprocess.Popen(["nvidia-smi", "--query-gpu", "memory.total", "--id=0", "--format=csv,nounits,noheader"],stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    saidaEErro_2 = processoNV_2.communicate()
    saidaNvidia = linha + " " + saidaEErro_2[0].decode(encoding="utf-8").strip()
    print(linha)
    return linha

I'm doubtful this is a bug that was introduced on turing-smart-screen-python, as I had an older version available and it's presenting the same behavior now. This may be related to updates on the kernel or Nvidia drivers that caused some change that are now triggering this abnormal behavior. However, if this becomes the new "default", then it may cause problems, as cooking GPUs.

Information about the Nvidia driver: Driver Version: 555.42.06 CUDA Version: 12.5. Tested with 5.15 and 6.5 kernels available on Ubuntu repository.