How to improve some performace

EMUNES commented 1 year ago

Dear author, this repo is the only lead for me to study about video player in Iced with gstreamer. Thanks a lot for sharing!

It works but the video can be laggy if the video has a higher resolution like 1920 x 1080. So I wonder whether the problem is in appsink callback because of writing video data to the frame property, or it's iced having trouble to refresh it's Image from frame data.

As any of those tools seems to lack debugging facilities (can't find a way to debug gstreamer in rust). So I ask is there any idea to improve the performace of video playing based on your code.

I've updated gstreamer to 0.21 and iced to 0.10 in dependencies.

jazzfool commented 1 year ago

Yes, there are many things suboptimal about this video player. It really comes down to two problems:

If software decoding is used, then it is fairly slow, but at least the image memory doesn't need to move around too much.
If hardware decoding is used (which on my system it is not, but this can probably be fixed by enabling the VA-API GStreamer plugin), then you face the issue of getting the image memory to somewhere that Iced can render it. It's quite a performance nightmare in terms of GPU-host synchronization, host-visible memory, image layout transitions, etc.

So, to improve the performance, you would need to do the following:

Enable HW acceleration in GStreamer (through the vulkanh264dec plugin, for instance)
Import the GPU image memory where the decoded frames are, into WGPU. Right now WGPU doesn't support external memory so the APIs must be the same, but it does support constructing handles from the underlying API (see this method). You'd need to force Iced to always use a specific graphics API (e.g., Vulkan), and then you can import the image handle from GStreamer.
Render the WGPU image directly in Iced. This is where things get a bit murky since, strangely, iced_wgpu doesn't offer a way to do this. Nonetheless, I don't see why implementing this in Iced wouldn't be trivial by adding a Texture variant in the custom primitives here.

That should net you close to the best performance possible.

EMUNES commented 1 year ago

Now I know about them, I will try those optimizations. Thanks for your reply! It solves my puzzles.

mattmart3 commented 8 months ago

@jazzfool, according to README.md, it looks like some of the performance issues have been fixed (commit e347a9b3249d33d9c9e40bfb8b149c60490e09d8):

Decent performance. Skips a lot of the overhead from Iced Image and copies frame data directly to a WGPU texture, and renders using a custom WGPU render pipeline. For a very subjective reference, I can play back 1080p HEVC video with hardware decoding without hitches, in debug mode.

Can you confirm this? Is the minimal example using hardware decoding by default? Are there any other performance issues still to be aware of?

Anyway this repository is of great help, thanks!

jazzfool commented 7 months ago

From my testing, yes, the performance is a lot better since I last wrote. My earlier points still stand to squeeze out more performance but currently with hardware decoding (which seems to be working by default now) it's very usable.

mattmart3 commented 7 months ago

Thanks @jazzfool, I am also able to run it with hardware decoding now but I still face some performance issues. I leave here the walkthrough and results of some tests.

Whether hardware acceleration is used or not it now depends entirely on gstreamer system setup, its plugins and the underlying video driver.

By enabling gstreamer logs I could see that I was using avdec_h264 decoder, that to my understanding is a software decoder:

$ GST_DEBUG=2,videodecoder:INFO cargo run --example minimal
    Finished dev [unoptimized + debuginfo] target(s) in 0.12s
     Running `target/debug/examples/minimal`
0:00:00.102594958  9109 0x785ee0000f10 INFO            videodecoder gstvideodecoder.c:1631:gst_video_decoder_sink_event_default:<avdec_h264-0> upstream tags: taglist, video-codec=(string)"H.264\ /\ AVC", container-specific-track-id=(string)1, bitrate=(uint)5152237;
...

I was testing this on PC with Nvidia GTX 1060 GPU, Arch linux, nvidia drivers 550.54.14. HW acceleration should be handled by NVDEC/NVENC codec, handled by the nvcodec gstreamer plugin.

However by inspecting the nvcodec plugin with gst-inspect-1.0 nvcodec I could not see any feature listed.

This because I was missing the cuda package. After that I had to clean gstreamer cache too:

$ rm -r ~/.cache/gstreamer-1.0

After this gst-inspect was properly showing encoder and decoder features:

$ gst-inspect-1.0 nvcodec
Plugin Details:
  Name                     nvcodec
  Description              GStreamer NVCODEC plugin
  Filename                 /usr/lib/gstreamer-1.0/libgstnvcodec.so
  Version                  1.24.0
  License                  LGPL
  Source module            gst-plugins-bad
  Documentation            https://gstreamer.freedesktop.org/documentation/nvcodec/
  Source release date      2024-03-04
  Binary package           Arch Linux GStreamer 1.24.0-1
  Origin URL               https://www.archlinux.org/

  cudaconvert: CUDA colorspace converter
  cudaconvertscale: CUDA colorspace converter and scaler
  cudadownload: CUDA downloader
  cudaipcsink: CUDA IPC Sink
  cudaipcsrc: CUDA IPC Src
  cudascale: CUDA video scaler
  cudaupload: CUDA uploader
  nvautogpuh264enc: NVENC H.264 Video Encoder Auto GPU select Mode
  nvautogpuh265enc: NVENC H.265 Video Encoder Auto GPU select Mode
  nvcudah264enc: NVENC H.264 Video Encoder CUDA Mode
  nvcudah265enc: NVENC H.265 Video Encoder CUDA Mode
  nvh264dec: NVDEC H.264 Decoder
  nvh264enc: NVENC H.264 Video Encoder
  nvh265dec: NVDEC H.265 Decoder
  nvh265enc: NVENC HEVC Video Encoder
  nvjpegdec: NVDEC jpeg Video Decoder
  nvjpegenc: NVIDIA JPEG Encoder
  nvmpeg2videodec: NVDEC mpeg2video Video Decoder
  nvmpeg4videodec: NVDEC mpeg4video Video Decoder
  nvmpegvideodec: NVDEC mpegvideo Video Decoder
  nvvp9dec: NVDEC VP9 Decoder

  21 features:
  +-- 21 elements

and the minimal test was then using the nvh264dec hardware decoder:

GST_DEBUG=2,videodecoder:INFO cargo run --example minimal
    Finished dev [unoptimized + debuginfo] target(s) in 0.12s
     Running `target/debug/examples/minimal`
0:00:01.314568227 10681 0x77b5fc000f10 INFO            videodecoder gstvideodecoder.c:1631:gst_video_decoder_sink_event_default:<nvh264dec0> upstream tags: taglist, video-codec=(string)"H.264\ /\ AVC", container-specific-track-id=(string)1, bitrate=(uint)5152237;
...

However despite using the hardware decoder there is still some performance issue, because I get about 80% CPU in both tests with and without hardware decoding but if I directly play the video with gstreamer:

gst-launch-1.0 playbin uri=file:///$(pwd)/.media/test.mp4

it says it uses nvh264dec decoder and takes about 20%.

@jazzfool , would you expect such a difference? Is it due to what you were referring in your point above?

If hardware decoding is used (which on my system it is not, but this can probably be fixed by enabling the VA-API GStreamer plugin), then you face the issue of getting the image memory to somewhere that Iced can render it. It's quite a performance nightmare in terms of GPU-host synchronization, host-visible memory, image layout transitions, etc.

The same video played by mpv takes about 10% cpu and it also says it uses nvdec hw decoder.

On an another PC with an Intel Celeron N3350 (Dual core CPU and Intel HD Graphics 500) I had to install intel-media-driver and gst-plugin-va which allows gstramer to use VA-API which should use the intel underlying hw coded. After this the minimal example started to use the hardware codec vah264dec, however performance were even worse in this case, higher CPU usage (~130% compared to ~100% without hw decoder) and video lagging (there was no lagging without hw decoder). I think that in this case there is also something wrong with gstreamer itself because even if I directly play the video with gstreamer:

gst-launch-1.0 playbin uri=file:///$(pwd)/.media/test.mp4

it says it's using hw decoder vah264dec but I still face high CPU usage (~80%) and video lagging.

However if I play the same video with mpv:

mpv --hwdec=auto .media/test.mp4

it still says it's using hardware decoding (vaapi, so it should be the same underlying codec) but I only have 20% CPU usage and a smooth video playback.

I also tried to force using the vulkan decoder by setting the env var GST_PLUGIN_FEATURE_RANK=vulkanh264dec:MAX but on both machine it fails during initialization, even tough gst-ispect-1.0 vulkan shows the decoder, and so gstreamer fallbacks to the other available decoders.

jazzfool commented 7 months ago

Thanks for the detailed tests! Yes, I would expect greater CPU usage, based on my earlier points. MPV and gst-launch almost certainly skip the CPU overhead by keeping everything on the GPU when using hw decoding. However, I do not expect to see a difference on the order of 20% vs 80%. I tested MPV and gst-launch vs the minimal example in release build, and saw closer to 2-3% vs 5-6%.

I expect there's something going on with gstreamer and the system configuration to cause such a big difference - perhaps something with how the gstreamer sink pipeline is setup. It's hard to reproduce this myself as I would want to capture some profiles and look at what gstreamer is doing in more detail.

VanderBieu commented 4 months ago

I have noticed a significant performance gap between OpenGL Renderer provided by gstreamer plugin autovideosink and our iced video player. The latter has higher latency and consumes more resources than OpenGL Renderer. Is it possible to close up the gap?

jazzfool commented 4 months ago

I found a small bug which should improve performance slightly. However, regarding the performance overall, I did find the source of the issue as to why performance ends up being slower than e.g., gst-launch:

Video frames are usually encoded in a YUV colour space, to help with spatial compression. The problem is that converting YUV to RGBA is not a simple operation, and in this case is being performed on the CPU (by the 'videoconvert' plugin). Now that wouldn't really be a problem if we could just accept the frames in YUV then place it into a YUV WGPU texture (NV12) so that the conversion can be done on the GPU - but... NV12 textures need the NV12 feature gate when creating the WGPU device, and Iced does not let us select features we want.

With GPU colour space conversion I anticipate that CPU usage% would drop by roughly 15-20% (from my local tests). The rest of the CPU usage comes from write_texture (i.e., copy CPU memory to GPU texture) and I see no simple way to reduce that.

Looking into the future, the biggest leap forward would be https://github.com/gfx-rs/wgpu/issues/2330, but there's no sign of that feature any time soon.

VanderBieu commented 4 months ago

I found a small bug which should improve performance slightly. However, regarding the performance overall, I did find the source of the issue as to why performance ends up being slower than e.g., gst-launch:

Video frames are usually encoded in a YUV colour space, to help with spatial compression. The problem is that converting YUV to RGBA is not a simple operation, and in this case is being performed on the CPU (by the 'videoconvert' plugin). Now that wouldn't really be a problem if we could just accept the frames in YUV then place it into a YUV WGPU texture (NV12) so that the conversion can be done on the GPU - but... NV12 textures need the NV12 feature gate when creating the WGPU device, and Iced does not let us select features we want.

With GPU colour space conversion I anticipate that CPU usage% would drop by roughly 15-20% (from my local tests). The rest of the CPU usage comes from write_texture (i.e., copy CPU memory to GPU texture) and I see no simple way to reduce that.

Looking into the future, the biggest leap forward would be gfx-rs/wgpu#2330, but there's no sign of that feature any time soon.

Thanks for your reply. I profiled my pipeline and color conversion did make a huge part of running time(around 80% of total time on my M1 Pro MacBook). By the way, I am just curious why you address codec in WGPU is a biggest leap, so is your point that if codec in WGPU achieved then we can discard gstreamer pipeline then eliminate unnecessary memory copy?

jazzfool commented 4 months ago

That's right. If WGPU implements the native video decoding extensions for each API then that would result in almost no overhead since the memory doesn't need to move anywhere. The next best thing would be if WGPU implemented external memory extensions (VK_KHR_external_memory or equivalent) so that decoding is done in e.g., OpenGL but the texture memory can be imported as e.g., a VkImage.

For now I may investigate compute shaders as an alternative for speeding up the YUV -> RGB conversion.

VanderBieu commented 4 months ago

That's right. If WGPU implements the native video decoding extensions for each API then that would result in almost no overhead since the memory doesn't need to move anywhere. The next best thing would be if WGPU implemented external memory extensions (VK_KHR_external_memory or equivalent) so that decoding is done in e.g., OpenGL but the texture memory can be imported as e.g., a VkImage.

For now I may investigate compute shaders as an alternative for speeding up the YUV -> RGB conversion.

But to my knowledge gstreamer appsink cannot return GPU memory(D3D, OpenGL, CUDA) so it seems that we need to rewrite almost the entire gstreamer pipeline in rust if we want to eliminate all unnecessary memory copy. It does sound like a hell of a work.

jazzfool commented 4 months ago

Referring to the external memory extensions, actually gstreamer does expose glimagesink, vulkansink, and d3d11videosink. Of course in the interest of supporting interop with all WGPU backends you'd want to pick the most portable one. Whether throughout the gstreamer pipeline itself it internally decodes on the GPU is another question. Though to be honest, instead of using glimagesink at that point I would consider switching entirely from gstreamer and instead to libmpv.

jazzfool commented 1 month ago

I have implemented hardware accelerated NV12 to RGB conversion that does not rely on the WGPU feature gate in 9d60f26.

With that, CPU usage has been reduced by around 30-40%. From my testing, the CPU usage is now comparable with other video players. At this point the only further CPU-side optimization that could be made is zero-copy frames (currently it copies from GPU to CPU to GPU), but without changes in wgpu that is not currently possible to avoid.

As such, I will be closing this issue.

jazzfool / iced_video_player

How to improve some performace #5