jocover / jetson-ffmpeg

ffmpeg support on jetson nano
Other
619 stars 199 forks source link

Request for hardware accelerated filtering #92

Open JezausTevas opened 3 years ago

JezausTevas commented 3 years ago

Everyone can agree this project is quite good at what it does so far. Decoding and encoding logic works great and only issues left are on nVidia's own included libraries. So we finally have something awesome and don't need to look back at gst ever again, right? Well... If you plan on doing anything more than just transcoding between formats unfortunately you're out of luck. Even simple scaling from 1080p to 720p on jetson nano will pin one of the CPU cores to 80%~100% load.

Tl;dr what I'm trying to say is it would be very nice to use some of that sweet GPU power for filtering the video. Gst has support for hardware accelerated filtering, so it should be possible to port to ffmpeg. If anyone would be willing take a crack at this, please start at scaling filter, something like npp_scale or even decoder scaling would be much appreciated.

cizekmilan commented 3 years ago

I am definitely in favor. I would like something like scale_npp or cuda_yadif (deinterlace). But if that's possible, I have no idea.

JezausTevas commented 3 years ago

If gstreamer can do it, so should ffmpeg. In gst resizing is done via caps for video codec.

vectronic commented 1 year ago

@JezausTevas wouldn't scale_cuda work for this? https://ffmpeg.org/ffmpeg-filters.html#scale_005fcuda

JezausTevas commented 1 year ago

@vectronic unfortunately Jetson nano does not have support for CUDA. If it had, there wouldn't be a need for custom implementation

bmegli commented 1 year ago

please start at scaling filter, something like npp_scale or even decoder scaling would be much appreciated.

@JezausTevas wouldn't scale_cuda work for this? https://ffmpeg.org/ffmpeg-filters.html#scale_005fcuda

See this fork and pull request:

bmegli commented 1 year ago

@vectronic unfortunately Jetson nano does not have support for CUDA. If it had, there wouldn't be a need for custom implementation

See this fork and pull request:

bmegli commented 1 year ago

it would be very nice to use some of that sweet GPU power for filtering the video

For some hardware accelerated operations on Jetson see this API reference

Technically it is not GPU but dedicated chipset.

jetson_ffmpeg implementation uses Encoder/Decoder Video APIs, the Converter is same "family"


Apart from that there is ISP on Jetson but it is more for (pre-)processing raw data from the camera.

vectronic commented 1 year ago

Thank you for your replies to my reply :-)

I am still trying to clarify in my head the whole Jetson/ffmpeg scenario...

@vectronic unfortunately Jetson nano does not have support for CUDA. If it had, there wouldn't be a need for custom implementation

Jetson Nano does support CUDA in general:

https://docs.nvidia.com/jetson/archives/r35.1/DeveloperGuide/text/AR/JetsonSoftwareArchitecture.html

https://docs.nvidia.com/cuda/archive/10.2/cuda-for-tegra-appnote/index.html

https://developer.nvidia.com/blog/simplifying-cuda-upgrades-for-nvidia-jetson-users/

So does this statement mean "Jetson Nano doesn't support CUDA based frame output from the nvmpi decoder" or "Jetson Nano doesn't have decode implemented in CUDA"?


I am using ffmpeg 6.0 and a fork of jetson-ffmpeg here: https://github.com/Keylost/jetson-ffmpeg

I am successfully using scale_cuda on a Jetson Nano:

ffmpeg -c:v h264_nvmpi -i in.mp4 \
-filter_complex "[0:v]hwupload_cuda[gpu];[gpu]scale_cuda=w=1200:h=1200[scaled];[scaled]hwdownload,format=yuv420p" \
-c:v h264_nvmpi out.mp4

This however suffers from the issue raised here related to excessive memory transfers:

https://github.com/jocover/jetson-ffmpeg/issues/67#issue-792081536


As far as I can tell, although the Jetson nvmpi codec uses the Jetson dedicated codec hardware blocks, the output of the decoder/input of the encoder should be able to be used directly with CUDA. This is due to the iGPU SOC architecture where both the GPU and CPU use of the same DRAM.

https://docs.nvidia.com/cuda/archive/10.2/cuda-for-tegra-appnote/index.html#memory-selection

There is an example of decoded frames being used by CUDA without extra copying here:

https://docs.nvidia.com/jetson/l4t-multimedia/l4t_mm_02_video_dec_cuda.html

Ideally the invocation would be something like:

ffmpeg -c:v h264_nvmpi -hwaccel cuda -hwaccel_output_format cuda -i in.mp4 \
-filter_complex "[0:v]scale_cuda=w=1200:h=1200" \
-c:v h264_nvmpi out.mp4

I am still exploring whether this would be possible by modifying the jetson-ffmpeg nvmpi codec implementation to use NvVideoConverter/EGLImage (?) as per the example code and hand the decoded frame to ffmpeg as CUDA (and the reverse for encoding).


Please (!) let me know if I am completely off track...

JezausTevas commented 1 year ago

@vectronic you are correct. Try looking for the GST Jetson code implementation. Although it might not be open sourced by the NVidia yet. Gstreamer does scaling of the video pretty well on Jetson, but it is just horrific to work with compared to FFmpeg. Also while using Gstreamer I noticed multiple issues with hardware encoding where the video frames would randomly contain blocks of the previous frames, possibly erroneous memory management.

bmegli commented 1 year ago

Try looking for the GST Jetson code implementation. Although it might not be open sourced by the NVidia yet

The source for gst-nvvideo4linux2 can be found here

User level NVIDIA GStreamer docs with scaling pipelines are here:

The GSstreamer nvvidconv, "VIC" hardware path is the same I linked in post above (same hardware, accessed through different layer of software)

There is also CUDA path example


The mentioned earlier pull request with scaling on decoder

Seems to be using this API

bmegli commented 1 year ago

Please (!) let me know if I am completely off track...

I believe your understanding is correct.

Some notes at the same time.

Hardware paths

  1. PVA/VIC (dedicated chipset)
  2. iGPU (and CUDA)

Jetson AGX Orin technical brief, Nano also has VIC.

PVA/VIC is the dedicated hardware for image processing. If it fits the use case I would prefer it over CUDA for better power efficiency and to leave GPU free for other tasks. You may access it through GStreamer nvvidconv and Jetson Multimedia API. The jetson-ffmpeg pull request I mentioned earlier seems to be using it for scaling also.

GStreamer vs FFMpeg

NVidia maintains GStreamer support for Jetson.

At the some time it changed the APIs a few times, broke older functionalities, introduced performance regressions with same code on newer platforms/L4T.

With FFmpeg this means the community struggling to keep the support functional.

If GStreamer works for your use case on Jetson prefer it over FFmpeg. At least NVidia takes responsibility to make it working and maintain in the long run.

I understand the need for FFmpeg on Jetson here, I need it myself.

madsciencetist commented 1 year ago

Oh, hey guys. I was just working on this. Let me explain.

The Jetson multimedia engine returns decoded frames as a DMA file descriptor, essentially an NvBuffer, which contains a block-linear NV12 image. Ideally, this hardware handle would be the output, and other Jetson filters would work directly on this NvBuffer. This is not a CUDA device pointer, but can be mapped to a texture that can almost be read by CUDA kernels (as long as the kernels handle any necessary NV12->NV12_ER conversion manually). The VIC can operate directly on these NvBuffers though with no problem.

Perhaps, like the cuvid decoder has the -hwaccel_output_format cuda option, we could add -hwaccel_output_format options like nvbuffer, texture and/or cuda. The latter would be compatible with existing cuda filters, while the former would be optimal if new filters were created that worked directly with the NvBuffer or texture. Those new filters could then be created that use VPI to dispatch to the VIC (or PVA or GPU if specified via filter option).

The current state though, as you point out, is that nvmpi copies to CPU memory, so a hwupload is required to use cuda filters.

My PR you linked that adds a -resize option to the decoder was easy (I didn't have to add new interfaces) and cheap (I just modified an existing VIC operation). But this only works if all consumers want the resized video, so it's not a complete solution. Ideally we would output the dmabuf/nvbuffer hardware handle and filters specialized for jetson platforms would use it.