pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.08k stars 6.94k forks source link

[Bug?] Lack of frames using torchvision.io.video_read #2490

Closed JuanFMontesinos closed 4 years ago

JuanFMontesinos commented 4 years ago

🐛 Bug

Hi, I’ve realized that torchvision as well as other libraries suck as skvideo and opencv retrieve less amount of frames than ffmpeg. I found this happens only for some videos.

Context: I ve a rencoded dataset of videos which are 25.0 FPS. Rencoding has been done via ffmpeg.

Recording (.mkv) contains audio stream and video stream. Both streams are same duration (according to metadata info from ffprobe) Audio stream’s duration match the ones stated by metadata

Extracting frames via unix command line with ffmpeg provides a proper amount of frames (3688 in case of the given video example) ffmpeg -i /media/jfm/Slave/SkDataset/videos/cello/1u3yHICR_BU.mkv %05d.bmp

Extracting frames with other libraries such us skvideo or opencv only obtains 3537 frames. My knowledge about the intrisecs of these libraries is limited. I verified that torchvision reader is not discarding frames with negative stamps (seems not to be the case).

I found a library which captures the proper amount of videos: imageio. However it's reader only counts 3537 frames (but reads 3688)

To Reproduce

Video example to reproduce the issue. Video example: https://drive.google.com/file/d/1DIRsDf1SrLOTGbVejoL-PEIlxDPP0LMC/view?usp=sharing

from imageio import get_reader, mimread
from torchvision.io import read_video

PATH = '/media/jfm/Slave/SkDataset/videos/cello/1u3yHICR_BU.mkv'

torchvision_video, torchvision_audio, info = read_video(PATH, pts_unit='sec')

# Expected duration
dur = torchvision_audio.shape[1] / info['audio_fps']
min = dur // 60
sec = dur % 60
print('Expected duration: %d min, %d sec'%(min,sec))
print('Expected amount of frames %d'%int(dur*25))
reader = get_reader(PATH)
print('Expected frames by different readers %d'%reader.count_frames())
print('Frames obtained by torchvision: %d '%torchvision_video.shape[0])
imageio_video = mimread(PATH, memtest=False)
print('Frames obtained by imageio: %d' %len(imageio_video))
print('')

Environment

Torchvision version: 0.5.0 Imageio version : 2.5.0

bjuncek commented 4 years ago

Hi @JuanFMontesinos , thanks for raising this!

Are you using the video_reader backend or are you using pyav backend for this. If it's not that much of a bother, could you try using the former?

From what I see, it seems like libav's probing is for one reason or another missing the timestamp of these frames - it sees 3537 frames so it returns only that. Both pyav and CV2 use libav's probing for optimization purposes so that might be the root of the issue, and I believe we have some fixes in place with the video_reader backend.

JuanFMontesinos commented 4 years ago

Hi, BTW I liked you paper Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization.

Soo bit hands on, according to this post of fmassa https://github.com/pytorch/vision/issues/2216 video_reader backend requires compiling from source Here there is more info about video_reader https://github.com/pytorch/vision/releases/tag/v0.4.2

x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DWITH_CUDA -I/home/jfm/vision/torchvision/csrc -I/home/jfm/.local/lib/python3.6/site-packages/torch/include -I/home/jfm/.local/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/jfm/.local/lib/python3.6/site-packages/torch/include/TH -I/home/jfm/.local/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c /home/jfm/vision/torchvision/csrc/vision.cpp -o build/temp.linux-x86_64-3.6/home/jfm/vision/torchvision/csrc/vision.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
In file included from /home/jfm/vision/torchvision/csrc/vision.cpp:14:0:
/home/jfm/vision/torchvision/csrc/ROIAlign.h: In function ‘at::Tensor roi_align(const at::Tensor&, const at::Tensor&, double, int64_t, int64_t, int64_t, bool)’:
/home/jfm/vision/torchvision/csrc/ROIAlign.h:28:25: error: ‘class c10::Dispatcher’ has no member named ‘findSchemaOrThrow’; did you mean ‘findSchema’?
                        .findSchemaOrThrow("torchvision::roi_align", "")
                         ^~~~~~~~~~~~~~~~~
                         findSchema
/home/jfm/vision/torchvision/csrc/ROIAlign.h:29:31: error: expected primary-expression before ‘decltype’
                        .typed<decltype(roi_align)>();
                               ^~~~~~~~
/home/jfm/vision/torchvision/csrc/ROIAlign.h: In function ‘at::Tensor ROIAlign_autocast(const at::Tensor&, const at::Tensor&, double, int64_t, int64_t, int64_t, bool)’:
/home/jfm/vision/torchvision/csrc/ROIAlign.h:49:14: error: ‘ExcludeDispatchKeyGuard’ is not a member of ‘c10::impl’
   c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast);
              ^~~~~~~~~~~~~~~~~~~~~~~
/home/jfm/vision/torchvision/csrc/ROIAlign.h:49:14: note: suggested alternative: ‘ExcludeTensorTypeIdGuard’
   c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast);
              ^~~~~~~~~~~~~~~~~~~~~~~
              ExcludeTensorTypeIdGuard
/home/jfm/vision/torchvision/csrc/ROIAlign.h: In function ‘at::Tensor _roi_align_backward(const at::Tensor&, const at::Tensor&, double, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, bool)’:
/home/jfm/vision/torchvision/csrc/ROIAlign.h:76:12: error: ‘class c10::Dispatcher’ has no member named ‘findSchemaOrThrow’; did you mean ‘findSchema’?
           .findSchemaOrThrow("torchvision::_roi_align_backward", "")
            ^~~~~~~~~~~~~~~~~
            findSchema
/home/jfm/vision/torchvision/csrc/ROIAlign.h:77:18: error: expected primary-expression before ‘decltype’
           .typed<decltype(_roi_align_backward)>();
                  ^~~~~~~~
In file included from /home/jfm/vision/torchvision/csrc/vision.cpp:17:0:
/home/jfm/vision/torchvision/csrc/nms.h: In function ‘at::Tensor nms(const at::Tensor&, const at::Tensor&, double)’:
/home/jfm/vision/torchvision/csrc/nms.h:18:25: error: ‘class c10::Dispatcher’ has no member named ‘findSchemaOrThrow’; did you mean ‘findSchema’?
                        .findSchemaOrThrow("torchvision::nms", "")
                         ^~~~~~~~~~~~~~~~~
                         findSchema
/home/jfm/vision/torchvision/csrc/nms.h:19:31: error: expected primary-expression before ‘decltype’
                        .typed<decltype(nms)>();
                               ^~~~~~~~
/home/jfm/vision/torchvision/csrc/nms.h: In function ‘at::Tensor nms_autocast(const at::Tensor&, const at::Tensor&, double)’:
/home/jfm/vision/torchvision/csrc/nms.h:28:14: error: ‘ExcludeDispatchKeyGuard’ is not a member of ‘c10::impl’
   c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast);
              ^~~~~~~~~~~~~~~~~~~~~~~
/home/jfm/vision/torchvision/csrc/nms.h:28:14: note: suggested alternative: ‘ExcludeTensorTypeIdGuard’
   c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast);
              ^~~~~~~~~~~~~~~~~~~~~~~
              ExcludeTensorTypeIdGuard
/home/jfm/vision/torchvision/csrc/vision.cpp: At global scope:
/home/jfm/vision/torchvision/csrc/vision.cpp:45:14: error: expected constructor, destructor, or type conversion before ‘(’ token TORCH_LIBRARY(torchvision, m) {
              ^
/home/jfm/vision/torchvision/csrc/vision.cpp:59:19: error: expected constructor, destructor, or type conversion before ‘(’ token TORCH_LIBRARY_IMPL(torchvision, CPU, m) {
                   ^
/home/jfm/vision/torchvision/csrc/vision.cpp:67:19: error: expected constructor, destructor, or type conversion before ‘(’ token TORCH_LIBRARY_IMPL(torchvision, CUDA, m) {
                   ^
/home/jfm/vision/torchvision/csrc/vision.cpp:76:19: error: expected constructor, destructor, or type conversion before ‘(’ token TORCH_LIBRARY_IMPL(torchvision, Autocast, m) {
                   ^
/home/jfm/vision/torchvision/csrc/vision.cpp:82:19: error: expected constructor, destructor, or type conversion before ‘(’ token TORCH_LIBRARY_IMPL(torchvision, Autograd, m) {
                   ^
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

I've got that wonderful error. Soo I don't feel qualified to debug that 😞 Let me know if you discover anything.

bjuncek commented 4 years ago

Gotcha, yeah I agree the compilation is annoying to get right, especially outside of the clean env. I really hope the fix for that comes soon :)

The error itself is not necessarily helpful, it's more of a pointer that something has gone wrong. If you have conda installed, can you build this from scratch on the clean env? Just a simple build - the following worked for me:

conda create --name repro python=3.7
conda activate repro

# install prereqs from forge 
conda install -y av -c conda-forge

conda install -y pytorch torchvision -c pytorch

# TODO: install torchvision from source to support video reader
### first remove the one installed by conda (DUMB and hacky way, but conda installs all the binary which is convenient)
pip uninstall torchvision
### Then install it from scratch
mkdir -p ~/bin; cd bin
git clone git@github.com:pytorch/vision.git
cd vision
python setup.py install

In the meantime, I'll take a look into your video as well to see what i get

bjuncek commented 4 years ago

Ok, so I've passed on this and it seems like it's a problem with your video, not a reader problem - in any way I check, there are only 3537 frames registered; specifically:

FFPROBE:

(tv08) bjuncek@qgpu:~/work/issue_repro$ ffprobe -v error -count_frames -select_streams v:0 -show_entries stream=nb_read_frames -of default=nokey=1:noprint_wrappers=1 data/1u3yHICR_BU.mkv
3537

FFMPEG (see last line)

(tv08) bjuncek@qgpu:~/work/issue_repro$ ffmpeg -i data/1u3yHICR_BU.mkv -map 0:v:0 -c copy -f null -
Input #0, matroska,webm, from 'data/1u3yHICR_BU.mkv':
  Metadata:
    MINOR_VERSION   : 0
    COMPATIBLE_BRANDS: iso6avc1mp41
    MAJOR_BRAND     : dash
    ENCODER         : Lavf57.83.100
  Duration: 00:02:27.54, start: 0.000000, bitrate: 2607 kb/s
    Stream #0:0: Video: h264 (High), yuv420p(progressive), 1920x1080 [SAR 1:1 DAR 16:9], 25 fps, 25 tbr, 1k tbn, 50 tbc (default)
    Metadata:
      HANDLER_NAME    : VideoHandler
      ENCODER         : Lavc57.107.100 libx264
      DURATION        : 00:02:27.523000000
    Stream #0:1(eng): Audio: vorbis, 48000 Hz, stereo, fltp (default)
    Metadata:
      ENCODER         : Lavc57.107.100 libvorbis
      DURATION        : 00:02:27.538000000
Output #0, null, to 'pipe:':
  Metadata:
    MINOR_VERSION   : 0
    COMPATIBLE_BRANDS: iso6avc1mp41
    MAJOR_BRAND     : dash
    encoder         : Lavf58.29.100
    Stream #0:0: Video: h264 (High), yuv420p(progressive), 1920x1080 [SAR 1:1 DAR 16:9], q=2-31, 25 fps, 25 tbr, 1k tbn, 1k tbc (default)
    Metadata:
      HANDLER_NAME    : VideoHandler
      ENCODER         : Lavc57.107.100 libx264
      DURATION        : 00:02:27.523000000
Stream mapping:
  Stream #0:0 -> #0:0 (copy)
Press [q] to stop, [?] for help
frame= 3537 fps=0.0 q=-1.0 Lsize=N/A time=00:02:27.40 bitrate=N/A speed=5.13e+03x

PYAV and CV2

import av
images_av = []
container = av.open(PATH)
# container.streams.video[0].thread_type =  # force single thread
for frame in container.decode(video=0):
    images_av.append(frame.to_rgb().to_ndarray())
len(images_av)
# 3537

cap = cv2.VideoCapture(PATH)
images_cv2 = []
while(cap.isOpened()):
    ret, frame = cap.read()
    if ret is True:
        rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        images_cv2.append(frame)
    else:
        break
cap.release()
len(images_cv2)
# 3537

Finally TV with video reader backend:

import torchvision
from torchvision.io import read_video
PATH = 'data/1u3yHICR_BU.mkv'
torchvision.set_video_backend("video_reader")
torchvision_video, torchvision_audio, info = read_video(PATH)
print("TV version", torchvision.__version__)
print("video fps by TV", info['video_fps'])
print('Frames obtained by torchvision: %d '%torchvision_video.shape[0])

ImageIO: Fails on reading this video with your code with OSError: [Errno 12] Cannot allocate memory

I'm going to close this as it seems a video specific thing and FFMPEG and FFPROBE seem to show the same number of frames as returned by video reader.

JuanFMontesinos commented 4 years ago

Hi, I think it doesn't probe my point. This is a webm video stream (so bad quality) downloaded from youtube (that's why ended up as mkv) and resampled by ffmpeg. This could be a typical DL pipeline. Ofc I imagine there is some "issue" with the video from either the downloader, the container or youtube itself. However I understand that the original aim of torchvision's reader was providing a robust reader. Therefore it should be able to deal with shitty videos (variable framerate, weird framerates like 18.53) and I assume that's why the sourcecode checks timestamps rather than doing Time*FPS in both the audio and video stream. If the idea is forcing the user to disscard a sample, the user will simply look for a workaround (in my case using Nvidia DALI or imageio). That's why I took my time to report this issue and to provide an example, cos it's tricky and requires more expertise than what I can provide.

I already mentioned that readers are counting 3557 frames and that it was consistent with ffmpeg, ffprobe, opencv. scikit-video and torchvision. Even imageio "detects"/counts that amount. You would need between 16 and 32 Gb of RAM to run the full code (which loads the video twice).

What I wanted to highligh again is that 2 min 27.54 sec = 2x60x25+27.54x25 = 3688 frames (there are 3557) I propose this workaround which calls raw ffmpeg in a subprocess in case you cannot have acess to enough RAM. Feel free to run it if you find time. Or you can directly run unix ffmpeg by ffmpeg -i /media/jfm/Slave/SkDataset/videos/cello/1u3yHICR_BU.mkv %05d.bmp

ffmpeg code from https://stackoverflow.com/questions/10957412/fastest-way-to-extract-frames-using-ffmpeg

from imageio import get_reader
from torchvision.io import read_video
import torchvision

import subprocess
import os
import shutil

torchvision.set_video_backend('video_reader')
PATH = '/media/jfm/Slave/SkDataset/videos/cello/1u3yHICR_BU.mkv'

torchvision_video, torchvision_audio, info = read_video(PATH, pts_unit='sec')

# Expected duration
dur = torchvision_audio.shape[1] / info['audio_fps']
min = dur // 60
sec = dur % 60
print('Backend: %s' % torchvision.get_video_backend())
print('Expected duration: %d min, %d sec' % (min, sec))
print('Expected amount of frames %d' % int(dur * 25))
reader = get_reader(PATH)
print('Expected frames by different readers %d' % reader.count_frames())
print('Frames obtained by torchvision: %d ' % torchvision_video.shape[0])
os.mkdir('./bmp_files')
dst = os.path.abspath('./bmp_files')
dst = os.path.join(dst, '%05d.bmp')
print('Writing frames at %s' % dst)
print('Executing Popen: %s' % "ffmpeg -i " + PATH + " " + dst)
result = subprocess.Popen(["ffmpeg", "-i", PATH, dst],
                          stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
output = [str(x) for x in result.stdout.readlines()]
for line in output:
    print(line)

print('Frames obtained by ffmpeg: %d' % len(os.listdir('./bmp_files')))
print('')

Regards

bjuncek commented 4 years ago

I understand that the original aim of torchvision's reader was providing a robust reader.

The idea of torchvision video reader is to be robust and flexible - it doesn't make assumptions and it reads what can be read from the format if it's supported by ffmpeg (since it uses it in the underlying implementation). In the case of the video which is in one way or another corrupted (like the one you have here), it won't break or ask you to re-encode video in a particular way - it will read whatever is salvageable from the video and not fail.

What I wanted to highligh again is that 2 min 27.54 sec = 2x60x25+27.54x25 = 3688 frames (there are 3557)

I understand that - that's why I'm saying that it's likely the issue of re-encoding and packaging the video rather than decoding itself. If ffmpeg itself cannot see more frames, that means an issue stemmed from there - whether they are missing headers the packets are corrupted. All the implementations you have mentioned (ffmpeg, ffprobe, opencv. scikit-video and torchvision) call c-implemetation of ffmpeg in their underlying implementation.

the user will simply look for a workaround (in my case using Nvidia DALI or imageio)

(I would also add decord as well). These are all viable alternatives which have their strengths and weaknesses, but are ultimately just as amazing as ours is. Note that DALI and decord use almost exactly the same ffmpeg calls as torchvision/cv2/pyav (but make some approximations as a trade-off for speed so they repeat or skip some frames and ignore additional streams), so I'm not sure how much different results you can expect, but they are well worth looking into.

Also please note that this issue is not written off, but will be revisited once we better understand what broke during the re-encoding of the video

JuanFMontesinos commented 4 years ago

I see thanks for the clarification. I just was worried about why plain ffmpeg frame extraction can see the 3688 frames but the backend ffmpeg used by these many libraries is reading less. But I imagine, as you said it's a problem of header and metadata. BTW I didn't know decord. That's for pointing it out.

Thank you very much for your time. Juan

fepegar commented 4 years ago

@bjuncek I'm using the commands you shared in https://github.com/pytorch/vision/issues/2490#issuecomment-664674758 to build, but I'm getting the following error. Do you know what could be wrong?

gcc -pthread -B /home/fernando/.conda/envs/repro/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/fernando/bin/vision/torchvision/csrc -I/home/fernando/.conda/envs/repro/lib/python3.7/site-packages/torch/include -I/home/fernando/.conda/envs/repro/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/fernando/.conda/envs/repro/lib/python3.7/site-packages/torch/include/TH -I/home/fernando/.conda/envs/repro/lib/python3.7/site-packages/torch/include/THC -I/home/fernando/.conda/envs/repro/include/python3.7m -c /home/fernando/bin/vision/torchvision/csrc/vision.cpp -o build/temp.linux-x86_64-3.7/home/fernando/bin/vision/torchvision/csrc/vision.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /home/fernando/bin/vision/torchvision/csrc/vision.cpp:14:0:
/home/fernando/bin/vision/torchvision/csrc/ROIAlign.h: In function ‘at::Tensor roi_align(const at::Tensor&, const at::Tensor&, double, int64_t, int64_t, int64_t, bool)’:
/home/fernando/bin/vision/torchvision/csrc/ROIAlign.h:28:25: error: ‘class c10::Dispatcher’ has no member named ‘findSchemaOrThrow’; did you mean ‘findSchema’?
                        .findSchemaOrThrow("torchvision::roi_align", "")
                         ^~~~~~~~~~~~~~~~~
                         findSchema
/home/fernando/bin/vision/torchvision/csrc/ROIAlign.h:29:31: error: expected primary-expression before ‘decltype’
                        .typed<decltype(roi_align)>();
                               ^~~~~~~~
/home/fernando/bin/vision/torchvision/csrc/ROIAlign.h: In function ‘at::Tensor _roi_align_backward(const at::Tensor&, const at::Tensor&, double, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, bool)’:
/home/fernando/bin/vision/torchvision/csrc/ROIAlign.h:76:12: error: ‘class c10::Dispatcher’ has no member named ‘findSchemaOrThrow’; did you mean ‘findSchema’?
           .findSchemaOrThrow("torchvision::_roi_align_backward", "")
            ^~~~~~~~~~~~~~~~~
            findSchema
/home/fernando/bin/vision/torchvision/csrc/ROIAlign.h:77:18: error: expected primary-expression before ‘decltype’
           .typed<decltype(_roi_align_backward)>();
                  ^~~~~~~~
In file included from /home/fernando/bin/vision/torchvision/csrc/vision.cpp:17:0:
/home/fernando/bin/vision/torchvision/csrc/nms.h: In function ‘at::Tensor nms(const at::Tensor&, const at::Tensor&, double)’:
/home/fernando/bin/vision/torchvision/csrc/nms.h:18:25: error: ‘class c10::Dispatcher’ has no member named ‘findSchemaOrThrow’; did you mean ‘findSchema’?
                        .findSchemaOrThrow("torchvision::nms", "")
                         ^~~~~~~~~~~~~~~~~
                         findSchema
/home/fernando/bin/vision/torchvision/csrc/nms.h:19:31: error: expected primary-expression before ‘decltype’
                        .typed<decltype(nms)>();
                               ^~~~~~~~
/home/fernando/bin/vision/torchvision/csrc/vision.cpp: At global scope:
/home/fernando/bin/vision/torchvision/csrc/vision.cpp:45:14: error: expected constructor, destructor, or type conversion before ‘(’ token
 TORCH_LIBRARY(torchvision, m) {
              ^
/home/fernando/bin/vision/torchvision/csrc/vision.cpp:59:19: error: expected constructor, destructor, or type conversion before ‘(’ token
 TORCH_LIBRARY_IMPL(torchvision, CPU, m) {
                   ^
/home/fernando/bin/vision/torchvision/csrc/vision.cpp:82:19: error: expected constructor, destructor, or type conversion before ‘(’ token
 TORCH_LIBRARY_IMPL(torchvision, Autograd, m) {
                   ^
error: command 'gcc' failed with exit status 1
fmassa commented 4 years ago

@fepegar I believe you need to update your PyTorch version and recompile torchvision again.

fepegar commented 4 years ago

@fepegar I believe you need to update your PyTorch version and recompile torchvision again.

Thanks. For some reason, conda was installing 1.4 so I had to explicitly ask for pytorch=1.6.