keylase / nvidia-patch

This patch removes restriction on maximum number of simultaneous NVENC video encoding sessions imposed by Nvidia to consumer-grade GPUs.
3.46k stars 270 forks source link

Stuck in P-State P0 after transcode finished on NVIDIA ? #127

Closed comassky closed 5 years ago

comassky commented 5 years ago

Hi,

I noticed that after running transcodes on plex, with hardware acceleration enabled (gpu), the p-state of my nvidia gtx 1070 gets stuck in P0 mode. Because of this it is drawing a lot of power (more than while transcoding in P2…), and the fan is spinning up unnecessarily as well…

If I stop the plex media server, the p-state immediately gets back to P8 as it should be when idle. Can you please do something with the transcoder to let the gpu return to P8 after having finished the transcoding?

Processes using the GPU while transcoding:

$ sudo fuser -v /dev/nvidia*
                     FELHASZNÁLÓ  PID HOZZÁFÉRÉS PARANCS
/dev/nvidia0:        root      12516 F.... nvidia-persiste
                     plex      16779 F.... Plex Media Serv
                     plex      18191 F...m Plex Transcoder
/dev/nvidiactl:      root      12516 F.... nvidia-persiste
                     plex      16779 F.... Plex Media Serv
                     plex      18191 F...m Plex Transcoder
/dev/nvidia-modeset: root      12516 F.... nvidia-persiste
/dev/nvidia-uvm:     plex      16779 F.... Plex Media Serv
                     plex      18191 F.... Plex Transcoder

Processes after transcoding has finished:

$ sudo fuser -v /dev/nvidia*
                     FELHASZNÁLÓ  PID HOZZÁFÉRÉS PARANCS
/dev/nvidia0:        root      12516 F.... nvidia-persiste
                     plex      16779 F.... Plex Media Serv
/dev/nvidiactl:      root      12516 F.... nvidia-persiste
                     plex      16779 F.... Plex Media Serv
/dev/nvidia-modeset: root      12516 F.... nvidia-persiste
/dev/nvidia-uvm:     plex      16779 F.... Plex Media Serv

Somehow Plex Media Server doesn’t let the GPU go…


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26      Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    On   | 00000000:01:00.0 Off |                  N/A |
|  0%   47C    P8    14W / 151W |      1MiB /  8118MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

/*** Transcode begin ***/

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    On   | 00000000:01:00.0 Off |                  N/A |
|  0%   48C    P2    37W / 151W |    113MiB /  8118MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      8099      C   /usr/lib/plexmediaserver/Plex Transcoder     101MiB |
+-----------------------------------------------------------------------------+

/*** Transcode end ***/

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    On   | 00000000:01:00.0 Off |                  N/A |
|  0%   53C    P0    38W / 151W |     11MiB /  8118MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Plex Teams says it's a Nvidia (430.26) / Kernel (stock 4.19) / OS interaction (Debian Buster) ...

Do you think it's a global bug ? Can you reproduce in your environnement ?

Snawoot commented 5 years ago

Hi!

I can't reproduce it in my environment because I don't use Plex and I haven't Plex Pass.

As it appears to me, OS is irrevelant here (at least it is not distro-specific). Problem somewhere between Plex Server and Nvidia Driver. Power management of Nvidia GPUs is quite strange. Many years ago, when I used to use ffmpeg for video convertations on GPU, I noticed GPU power spikes when processing stops. Sometimes it caused entire system hang because my configuration had weak power supply units.

Observations you made is very useful. They actually prove problem can be workarounded from Plex Server side by means of freeing some GPU resources/handles when no GPU processing active. So workaround is definitely within Plex team grasp. Even if root cause of such power mismanagement is inside Nvidia driver (I believe it is), Plex team can make an effort and expedite this bug to Nvidia support, because they can map a broad picture and describe how to reproduce problem in terms of Nvidia API invocations.

As for nvidia patch project, I don't think we can address this issue in near future. If problem caused by Plex Server, it is out of scope for us. If problem is caused by Nvidia driver - it should be properly implemented in driver code, binary patching is not a substitution for correct implementation. But I'll leave this issue open just in case if someone can propose workaround on how to reset power state without Plex server restart. Maybe nvidia driver utils may help to manage power, I'm not aware about it yet.

comassky commented 5 years ago

Hi Snawoot,

Thanks a lot for your answer.

Actually it was not necessarily a bug for this repository but more a possibility to compare our results to see if the bug was generalized.

I put the bug on the forum plex, which told me that it was an Nvidia bug, but it concerns only plex ...

I hope they will treat tis issue, or pass it on to Nvidia

comassky commented 5 years ago

Just for information, bug seems to be a ffmpeg issue !

More precisely the version ffmpeg used by plex, which an old 3.x.

They are currently working (for the end of the year), on a rise in 4.x which will correct the bug

Snawoot commented 5 years ago

Thanks! Closing issue.

mekya commented 5 years ago

@comassky we're experiencing something similar in one of our project. Do you give some more information about this ffmpeg issue? I mean some links/issues about this issue in ffmpeg side.

comassky commented 5 years ago

https://ffmpeg.org/pipermail/ffmpeg-devel/2016-June/195598.html

But it's a 2016 issue, bug is fixed since 4.x ...

For the moment, the new Plex transcoder is still ko (https://forums.plex.tv/t/plex-media-server-1-16-7-1597-updated-new-transcoder-preview/451135/137) , even if they upgrade to a recent ffmpeg version.

mekya commented 5 years ago

I see. In our project, it's getting stuck in opening a new session in multithread environment. We use ffmpeg 4.1

Thank you for the information

comassky commented 5 years ago

Okay, interresting !

In Jellyfin project (Plex like), ffmpeg seems to be Ok with this issue, issue is not present.

Maybe a specific management ?