Open elmuz opened 1 year ago
These are the specifics of the video retrieved by FFprobe
ffprobe version 5.1.3 Copyright (c) 2007-2022 the FFmpeg developers
built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
configuration: --prefix=/usr/local/ --extra-cflags=-I/usr/local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64 --nvccflags='-gencode arch=compute_75,code=sm_75 -O2' --disable-doc --disable-static --enable-gnutls --enable-shared --enable-gpl --enable-nonfree --enable-libfdk-aac --enable-libmp3lame --enable-libopus --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-cuda-nvcc --enable-nvenc --enable-cuvid --enable-libnpp --enable-nvdec
libavutil 57. 28.100 / 57. 28.100
libavcodec 59. 37.100 / 59. 37.100
libavformat 59. 27.100 / 59. 27.100
libavdevice 59. 7.100 / 59. 7.100
libavfilter 8. 44.100 / 8. 44.100
libswscale 6. 7.100 / 6. 7.100
libswresample 4. 7.100 / 4. 7.100
libpostproc 56. 6.100 / 56. 6.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'tests/files/lagarde.mp4':
Metadata:
major_brand : isom
minor_version : 512
compatible_brands: isomiso2avc1mp41
encoder : Lavf59.27.100
Duration: 00:13:36.00, start: 0.000000, bitrate: 883 kb/s
Stream #0:0[0x1](und): Video: h264 (High) (avc1 / 0x31637661), yuv420p(tv, bt709, progressive), 480x480 [SAR 1:1 DAR 1:1], 750 kb/s, 25 fps, 25 tbr, 12800 tbn (default)
Metadata:
handler_name : VideoHandler
vendor_id : [0][0][0][0]
encoder : Lavc59.37.100 libx264
Stream #0:1[0x2](eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 127 kb/s (default)
Metadata:
handler_name : SoundHandler
vendor_id : [0][0][0][0]
Another inconsistency:
VideoReader.get_metadata()
returns a dict. Key "fps" is a float
for "cuda", while it is a List[float]
for "video_reader"Hi @elmuz , Thanks for the reports. To be completely transparent, the video decoder (and in particular the GPU video decoder) are still in Beta stage, and we acknowledge that there are a bunch of bugs and edge cases that aren't completely covered yet. We're still trying to figure out the level of support we can provide for these, and hopefully we'll be able to provide a suitable alternative soon.
Hey, thanks. I understand video decoding is a hard topic, especially since there's a lack of reference/de-facto way of doing things like it is on the audio counter-part. However, on this topic I see many points of contact between torchvision, torchaudio or even Nvidia DALI. Unfortunately, at the moment unpacking a video catalog into frames is still the smoothest option (as long as you have enough memory to hold). Otherwise, it's a pain...
π Describe the bug
I want to exploit the CUDA backend for the new VideoReader object. However, I believe it doesn't work as expected. In particular, I noticed the following:
with
video_reader
backend frames are returned as[C, H, W]
, while usingcuda
they are returned as[H, W, C]
. I believe this isn't expected.color space conversion is not consistent (it looks that "cuda" is a little bit more yellow-ish). The following are 0-th frame of the video decoded with different backends. (this is from CPU decoding)
while this (this is from CUDA decoding)
random seek is buggy in CUDA, while it works fine in CPU (although it takes quite some time). It is really a strange output:
You can try to reproduce these results using the following script:
As a side note I can comment that
torchaudio.io.StreamReader
using thecuvid
decoder as per this tutorial.Versions