keylase / nvidia-patch

This patch removes restriction on maximum number of simultaneous NVENC video encoding sessions imposed by Nvidia to consumer-grade GPUs.
3.54k stars 278 forks source link

Question: Can this enable NVENC on mining GPUs like P102/P104/P106? #247

Closed gordan-bobic closed 4 years ago

gordan-bobic commented 4 years ago

Can this patch be extended to un-disable NVENC support on mining GPUs such as P102-100, P104-100 and P106-100?

Snawoot commented 4 years ago

I can't say for sure. It's unclear to me if NVENC locked out for mining boards programmatically in driver code or it is a firmware restriction or missing NVENC IP core. I don't have any mining cards to tinker with, so I can't check.

I'll keep this issue open: who knows, maybe people around have some experience with mining cards and they'll provide some insights.

gordan-bobic commented 4 years ago

I think it is locked out in driver code, the same way that using a GeForce GPU in a VM is locked out by a driver, so that only Tesla/Quadro/Grid cards work in VMs (trivially easy to work around these days. Similarly with headless Optimus laptop GPUs, the driver won't initialize the card unless it detects the battery information via ACPI (again, fairly trivial to pass a fake ACPI blob to the VM, but the driver won't initialize the card if it can't detect a battery). Nvidia do a LOT of such product differentiation purely in the driver.

vans163 commented 4 years ago

Whatchu working on gordan? Wanna chat?

gordan-bobic commented 4 years ago

Whatchu working on gordan? Wanna chat?

I'm setting up a headless multi-seat remote gaming rig using RemoteFX and SteamLink. Since it is completely headless, a headless GPU like a mining one is fine, but real time h.264 encoding on the CPU is expensive and laggy, hence why I am asking about enabling NVENC. And in theory a mining GPU should be a lot cheaper than a regular equivalent, since mining is dead and a mining GPU doesn't "just work" for nearly any use other than mining.

vans163 commented 4 years ago

You got discord? I think we can collab to save time

paulhothersall commented 4 years ago

Quadro P400 are like $85 and work great. Also low power , powered from PCIe and low profile

vans163 commented 4 years ago

For Video Encode but not for gaming, that thing barely has the juice.

image

256 cuda cores lol. Actually for workloads that apply effects (or use NPP scaling) this card might choke as well.

image

gordan-bobic commented 4 years ago

It needs to be usable for 1080p gaming. At the very least something as good as a 1060 (P106), preferably 1070 (P104) or 1080 (P102). I have a GT630/4GB and it's unusably slow at rendering (NVENC part is fine, though), and that's 384 cores vs P400's 256.

I have a similar setup working locally on an Optimus laptop (uncompressed local streaming), but for streaming over the network it needs to be compressed.

vans163 commented 4 years ago

Im literary working on the same thing lol so, more focused on the software side ATM (to reduce latency), a lot of latency comes from software. Working on a PoC now..

gordan-bobic commented 4 years ago

I have it all working over RemoteFX and SteamLink with the GT630, just need something with more ooompf.

Various parts of this guide are directly applicable. https://gist.github.com/Misairu-G/616f7b2756c488148b7309addc940b28

I just want to be cheap with using a mining GPU and do something different "for science".

If you are doing it virtualized, I can share some pro tips on making performance suck as little as possible.

paulhothersall commented 4 years ago

P400 Quadro . Oh for sure not gaming, but for raw 4k HEVC to 1080p h264 like I have running though machines with 3-4 of them in it, it's absolutely ripping

Each card can easily sustain 120fps 4k stream to stream, or multiple outputs from a single 4 input

Saentist commented 4 years ago

I think is more easy FFMpeg to implement OpenCL transcoding then maining card to have NvENC

vans163 commented 4 years ago

@paulhothersall yea but your doing a downsample of 1 stream right? With 1 output to 1080p? So your not touching NPP, as soon as you do, say you wanna multiplex different downsample resolutions, that card will choke. If I wanna downsample a video library a P400 Quadro does not work unless I am willing to wait a multiple of time longer for every video resolution I want to transcode to.

@Saentist That will remove cuda cores from the game and move them to encoding. I think its not a reasonable option.

gordan-bobic commented 4 years ago

I think is more easy FFMpeg to implement OpenCL transcoding then maining card to have NvENC

Seems unlikely. Patching the driver to ungimp the card typically involves little more than replacing a few bytes of code with 0x90. The tricky bit is figuring out which few bytes.

vans163 commented 4 years ago

I think is more easy FFMpeg to implement OpenCL transcoding then maining card to have NvENC

Seems unlikely. Patching the driver to ungimp the card typically involves little more than replacing a few bytes of code with 0x90. The tricky bit is figuring out which few bytes.

Fire up Ghidra on libnvidia-encode.so lol

Saentist commented 4 years ago

I think is more easy FFMpeg to implement OpenCL transcoding then maining card to have NvENC

Seems unlikely. Patching the driver to ungimp the card typically involves little more than replacing a few bytes of code with 0x90. The tricky bit is figuring out which few bytes.

nvidia resistor mod ;) info

gordan-bobic commented 4 years ago

@Saentist I did that until KVM introduced the bits required to hide itself from the Nvidia driver. And before that, up to and including on Fermi cards, no hardware modding needed, just reflash the device ID strap bits to change the device ID, and lo and behold, a GeForce was a Quadro. It didn't enable everything (some things were cut out or done in firmware), but it did enable it to work in a VM, it enabled the 2nd DMA channel and it made the driver enable a few extra OpenGL hardware functions that boosted some parts of SPECviewperf. Been there, done that. But that was 5 GPU generations ago.

Ah membah.

paulhothersall commented 4 years ago

@paulhothersall yea but your doing a downsample of 1 stream right? With 1 output to 1080p? So your not touching NPP, as soon as you do, say you wanna multiplex different downsample resolutions, that card will choke. If I wanna downsample a video library a P400 Quadro does not work unless I am willing to wait a multiple of time longer for every video resolution I want to transcode to.

Actually I am currently not using NPP, and instead running it CPU side through various filtergraph stages, which includes a scale for the 1:1 stuff.

Threadrippers are insanely cost effective at this with tons of cheap cores / ram bandwidth / PCIe lanes.

If you are careful about cpu affinity (ram/lanes/Corea), the closeout deals on 12-16 core 1950x/2920x/2950x made for absolute monsters at ripping frames.

The recent AM4 drops mean a 3900x or 3950x in an economy 3 16x (physical) slot ATX board presents a great cost efficiency.

vans163 commented 4 years ago

@paulhothersall yea but your doing a downsample of 1 stream right? With 1 output to 1080p? So your not touching NPP, as soon as you do, say you wanna multiplex different downsample resolutions, that card will choke. If I wanna downsample a video library a P400 Quadro does not work unless I am willing to wait a multiple of time longer for every video resolution I want to transcode to.

Actually I am currently not using NPP, and instead running it CPU side through various filtergraph stages, which includes a scale for the 1:1 stuff.

Threadrippers are insanely cost effective at this with tons of cheap cores / ram bandwidth / PCIe lanes.

If you are careful about cpu affinity (ram/lanes/Corea), the closeout deals on 12-16 core 1950x/2920x/2950x made for absolute monsters at ripping frames.

The recent AM4 drops mean a 3900x or 3950x in an economy 3 16x (physical) slot ATX board presents a great cost efficiency.

Didnt realise they were that effective at that, maybe will try one day.

@Saentist I did that until KVM introduced the bits required to hide itself from the Nvidia driver. And before that, up to and including on Fermi cards, no hardware modding needed, just reflash the device ID strap bits to change the device ID, and lo and behold, a GeForce was a Quadro. It didn't enable everything (some things were cut out or done in firmware), but it did enable it to work in a VM, it enabled the 2nd DMA channel and it made the driver enable a few extra OpenGL hardware functions that boosted some parts of SPECviewperf. Been there, done that. But that was 5 GPU generations ago.

Ah membah.

Heh yup, didnt know it enabled 2nd DMA channel tho.

paulhothersall commented 4 years ago

@paulhothersall yea but your doing a downsample of 1 stream right? With 1 output to 1080p? So your not touching NPP, as soon as you do, say you wanna multiplex different downsample resolutions, that card will choke. If I wanna downsample a video library a P400 Quadro does not work unless I am willing to wait a multiple of time longer for every video resolution I want to transcode to.

Actually I am currently not using NPP, and instead running it CPU side through various filtergraph stages, which includes a scale for the 1:1 stuff. Threadrippers are insanely cost effective at this with tons of cheap cores / ram bandwidth / PCIe lanes. If you are careful about cpu affinity (ram/lanes/Corea), the closeout deals on 12-16 core 1950x/2920x/2950x made for absolute monsters at ripping frames. The recent AM4 drops mean a 3900x or 3950x in an economy 3 16x (physical) slot ATX board presents a great cost efficiency.

Didnt realise they were that effective at that, maybe will try one day.

to be clear decode and encode is handled GPU side

Snawoot commented 4 years ago

So, at this moment I can't offer solution for problem in original question. It's still unclear how NVENC restriction is implemented on mining cards or do mining cards have NVENC IP core at all.

I'm closing this issue because I don't expect myself to put any active efforts in this field, but feel free to comment here if you have some useful insights which can help to resolve this eventually.

snwarch commented 1 year ago

Umm i know its late but there is a weird bug ive found on this card using linux that enables nvenc but changing any srtting in obs breaks it forever unless u do a syatem reinstall...

I installed a p104-1008gb card on ubuntu unity 22.04 with nvidia drivers "525.xx". With only this card in the system. A while later i turnes off my syatem and added my old p106-100 into a second pcie slot. It detectwd but didn't use the p106-100 but for some reason nvenc actually worked. When i changed the gpu feom "0" to "1" this broke for that install.

Also when it did work it said in obs "nvenc" instead of "nvenc/h.264" so i do not know why this workes but it worked aimilarly in 16.04. So i think its possible that its using the laptop edition of nvenc waiting for the x264 to be run that way and thus errors out and adding teh second desktop gpu enabled desktop functionality but then broke it? I donno anything technical bit could mean ti just needs to be forced into desktop mode via a patch?

DTL2020 commented 7 months ago

I also like to use mining board p104-100 as replacement of GTX1070 as Dual NVENC board for my mod of mvtools for running more DX12ME frames pairs search per second at the temporal video denoising. With new ideas of MVs refining by providing several shifted pairs of frames to ME engine and averaging output result we need another 10x or more performance to make somehow more quality of hardware-assisted denoising. So do current driver hack state allow to use (present ?) onboard 2 NVENC hardware to run with Windows 10 and DirectX 12 to provide DirectX 12 Motion Estimation API ? Same as 'standard' GTX cards ? I will try to test this with my test application at the board vendor before making purchase. But board vendor also have very few experience in the NVENC tests of this headless mining boards. Mining cards of p104-100 is about 1.6 times cheaper in comparison with GTX1070 cards here so it is very great to use headless mining cards in motion estimation acceleration rig if everything is working OK now in 2024.

bubbaprog commented 3 months ago

This repository has a patch to enable it on P104s etc. but it is unfortunately only for Windows. https://github.com/dartraiden/NVIDIA-patcher