rigaya / QSVEnc

QSVによる高速エンコードの性能実験
http://rigaya34589.blog135.fc2.com/blog-category-10.html
Other
313 stars 28 forks source link

Unlimited multi GPU support #124

Closed ZSC2017IM closed 1 month ago

ZSC2017IM commented 1 year ago

Hi rigaya, The QSVEnc now supports up to 4 GPUs. If you specify "device 5", an error will be reported. I plan to build a server with 6/8/more GPUs. Will you add unlimited multi GPU support in the future?

Simply specifying a device is sufficient to meet basic needs. Further, I saw the following description of "load balancing" in your NVEncC build. Will it be added to QSEncC in the future?

https://github.com/rigaya/NVEnc/blob/master/NVEncC_Options.en.md#--check-device device with lower Video Engine Utilization will be favored device with lower GPU Utilization will be favored

ZSC2017IM commented 1 year ago

NVEncC gets GPU load information from NVML or nvidia-smi. I don't know if INTEL has a similar API, but at least I can see the load of each codec in HWinfo

rigaya commented 1 year ago

Will you add unlimited multi GPU support in the future?

I think I'll be able to add multi GPU support more than 5 in the future. But I must say that I won't be able to test actually, as I don't have a system which has so many Intel GPUs.

Further, I saw the following description of "load balancing" in your NVEncC build. Will it be added to QSEncC in the future?

Actually, current NVEncC gets GPU load information from Windows performance counter.

QSVEncC also is able to get GPU utilization from Windows performance counter, and is able to select GPU automatically when possible depending on the parameters.

ZSC2017IM commented 1 year ago

Thank you for your continuous improvement of the software! Of course, users should not require free developers to purchase devices to test. Perhaps I will test and report after the new version is released. Do you mean that I can use "auto" or not use "--device" to achieve optimal efficiency? I have seen the phrase "device auto=device 1" in the issues of this project or somewhere else, but I can't recall the original link. If QSVEncC does have the same load balancing optimization strategy, perhaps you can update the documentation.

Saentist commented 1 year ago

Under Linux monitoring is simple with https://github.com/Syllo/nvtop / Intel GPU memory monitoring is problematic /

rigaya commented 1 year ago

I've found a problem when I checked through oneVPL SDK, it might not support 5 GPUs or more, as it defines up to 4th device only. https://spec.oneapi.io/onevpl/latest/API_ref/VPL_enums.html#mfximpl

enum  {
    MFX_IMPL_AUTO         = 0x0000,  /*!< Auto Selection/In or Not Supported/Out. */
    MFX_IMPL_SOFTWARE     = 0x0001,  /*!< Pure software implementation. */
    MFX_IMPL_HARDWARE     = 0x0002,  /*!< Hardware accelerated implementation (default device). */
    MFX_IMPL_AUTO_ANY     = 0x0003,  /*!< Auto selection of any hardware/software implementation. */
    MFX_IMPL_HARDWARE_ANY = 0x0004,  /*!< Auto selection of any hardware implementation. */
    MFX_IMPL_HARDWARE2    = 0x0005,  /*!< Hardware accelerated implementation (2nd device). */
    MFX_IMPL_HARDWARE3    = 0x0006,  /*!< Hardware accelerated implementation (3rd device). */
    MFX_IMPL_HARDWARE4    = 0x0007,  /*!< Hardware accelerated implementation (4th device). */
    MFX_IMPL_RUNTIME      = 0x0008,  /*!< This value cannot be used for session initialization. It may be returned by the MFXQueryIMPL
                                          function to show that the session has been initialized in run-time mode. */
    MFX_IMPL_VIA_ANY      = 0x0100,  /*!< Hardware acceleration can go through any supported OS infrastructure. This is the default value. The default value
                                          is used by the legacy Intel(r) Media SDK if none of the MFX_IMPL_VIA_xxx flags are specified by the application. */
    MFX_IMPL_VIA_D3D9     = 0x0200,  /*!< Hardware acceleration goes through the Microsoft* Direct3D* 9 infrastructure. */
    MFX_IMPL_VIA_D3D11    = 0x0300,  /*!< Hardware acceleration goes through the Microsoft* Direct3D* 11 infrastructure. */
    MFX_IMPL_VIA_VAAPI    = 0x0400,  /*!< Hardware acceleration goes through the Linux* VA-API infrastructure. */
    MFX_IMPL_VIA_HDDLUNITE     = 0x0500,  /*!< Hardware acceleration goes through the HDDL* Unite*. */

    MFX_IMPL_UNSUPPORTED  = 0x0000  /*!< One of the MFXQueryIMPL returns. */
};

Therefore, I'm not sure if it works for more than 4 GPUs.

ZSC2017IM commented 1 year ago

sad. https://spec.oneapi.io/onevpl/latest/programming_guide/VPL_prg_hw.html Will the "New Model to Work with Hardware Acceleration" fix it? Otherwise, I'll just have to pray that intel updates it. Thanks!

quamt commented 1 year ago

@ZSC2017IM How many GPUs do you have currently running?

I am asking as I only have one A770 but I am able to run 2x the QSVEnc on the same GPU due to the 2 encoding engines (I assume). Which lets me encode two videos (separate settings) at the same time. The bottleneck for me is my NVME / SATA connection but still have near to no loss in FPS, meaning both encode with the same settings will have nearly the same FPS speed.

You might want to give it a try.

ZSC2017IM commented 1 year ago

@quamt I have 3 gpus: 2 dg1 and 1 uhd630 from 8700K(both have 2 decoding/encoding engines).

The bottleneck for me is my NVME / SATA connection but still have near to no loss in FPS, meaning both encode with the same settings will have nearly the same FPS speed. I also have 4 nvidia gpus: tesla p4(1 decoding and 2 encoding engines). i can run bat script to transcode 8/16 videos at the same time. And very low CPU usage (0-3% per video) with 8700K. I won't be limited by the 4K random read/write speed of HDD, let alone SSD. For intel_qsv, I think the bottleneck should be PCIE bandwidth and CPU. I am not sure which one of INTEL/QSVEncC/FFMPEG is responsible. During transcoding, the CPU usage and PCIE bandwidth usage on my INTEL GPUs are much higher than those on my NVIDIA GPUs.

I have reviewed the NVIDIA technical documentation (https://developer.nvidia.com/blog/nvidia-ffmpeg-transcoding-guide/), which mentions "Adding the -hwaccel cuvid option means the raw decoded frames will not be copied and the transcoding will be faster and use less system resources". I think that's why. But I'm not sure if INTEL GPU hardware or INTEL driver or FFMPEG/QSVEncC didn't do this.

It should be noted that CPU and PCIE usage is positively correlated with transcoding FPS, so the simpler the video (e.g. 360P 200kbitrate) transcoding speed (hundreds of FPS), the higher the CPU and PCIE usage. If you don't have any low-bitrate videos in your scene, then this may not be a problem, otherwise, with the above said, I think that the bottleneck in your system (or any intel gpu-based transcoding system) may also is the cpu. I would appreciate it if you could provide a test on the A770.

quamt commented 1 year ago

@ZSC2017IM What kind of test? The one that you just opened in issue #126?

ZSC2017IM commented 1 year ago

@quamt Yes. If we use consistent testing, the results will be comparable.

quamt commented 1 year ago

@ZSC2017IM I can try to run the test. The only problem might be that I'm using the A770 with an AMD CPU and the hypermode won't work there if I recall correctly. I have an intel system but it would be complicated to change the GPU within the systems.

ZSC2017IM commented 1 year ago

the hypermode won't work there if I recall correctly.

@quamt the hypermode also won't work on my system. So I think it doesn't matter

rigaya commented 1 year ago

QSVEnc 7.36 removes limit from --device option on the application side. There still might be limitation on SDK side, but unfortunately I won't be able to test.

rigaya commented 1 month ago

I'll close this issue, as update has been already applied to QSVEnc.