Evaluate Decord for video IO

jjedele commented 4 years ago

I stumbled upon Decord a while ago, played around with it a bit and it made working with video data really conventient.

Might be worth having a look if we could gain anything by integrating this instead of working with ffmpeg directly.

yongtang commented 4 years ago

Thanks @jjedele. NV Codecs could be an interesting direction we want to explore.

There were some discussions about why FFmpeg is hard to integrate:

FFmpeg is LGPL so we could not built it directly, instead we have to link to it. But then which version to link, and what if user install a different version on their machine? For that reason in tensorflow-io we only explicit link against Ubuntu 16.04 and 18.04's ffmpeg lib and even that we are facing lots of issues.
FFmpeg is supposed to be cross platform but in reality, this is not the case. In macOS there is AVFoundation that is quite easy to integrate (we already link AVFoundtion for mp4a audio in tensorflow-io, we plan on doing the same for video). In Windows there is media foundation I think they are doing the similar. (haven't worked on windows for a long time).
In theory, FFmpeg still have the license/patent risk. As some of the codecs like mp4 usage is an grey area. This is not the case for macOS AVFoundation and Windows native APIs as they have been covered as part of your macOS or Windows.

Another thing to consider, is the separation of container parser, codec, and color conversion. Only the codec faces license issues on Linux. What we could do, is to split out container parser and color conversion in cross-platform way. For codecs, we provide different options based on platform and user choice. So we reroute codec to AVFoundation on macOS, media foundation on windows, and fallback to FFmpeg on linux.

I could see NV codec might be an excellent choice as well and user may have a tendency to choose it for their GPU. This fit nicely with the separation of container parser, codec, and color conversion.

One more thing to consider, with the upcoming MLIR other vendors other than Nvidia might also be willing to provide solutions so eventually we could easily swap in and out of a codec choice based on user selection.

yongtang commented 4 years ago

Ref #536 which briefly covered the separation of color conversion and codec.

yongtang commented 4 years ago

@jjedele One more thing to think though, is that video is a little special for random access and whole buffer decode.

Unlike audio, where a small file could hold a long audio clip in long minutes, the video is pretty big. A minute of video, even if in mp4, could be huge. And the decoded frames could be even bigger.

So decode_mp3 (audio) <=> decode_mp4 (video) are not exactly a good comparison.

Random access of video is also problematic, as frame has to be decode from I-frame. For some clips, that means even if we try to access the last few frame, we might be forced to decode all the way from beginning.

Ideally, a tf.data.Dataset types of API for sequential access would be a good match for video.

We could still provide the same level of decode_mp4 (video) and VideoIOTensor just for the sake of completion, and for users that are not so concerned with the slowness or memory consumption.

bhack commented 4 years ago

@yongtang Could the snapshot feature play a role? But I really don't know if the mainstream case is about random slice.

yongtang commented 4 years ago

@bhack Random slicing of video is more about feature engineering and preprocessing. For example, a video could be mapped to a subtitle that is in a time range. It is quite handy if long video could be easily cut and slice into small chunks that mapped to subtitle, as part of the input pipeline. It may not be maintain stream for training today, probably because in training the video clips are already processed and normalized, with input dataset already in usable format. Random slicing support will make the processing and normalization handy (not to write customized programs to do that), and will be especially helpful for inference and in production.

bhack commented 4 years ago

@yongtang For random slicing, cause we cannot use snapshot feature, the only solution to scale-up seems to me what the TF team announced at https://youtu.be/n7byMbl2VUQ?t=499

yongtang commented 4 years ago

@bhack In tensorflow-io we purposely implement Dataset as a resource handle (DT_RESOURCE) + definite index for most of the dataset ops, for exactly the reason to easily utilize the distribute strategy at the python level in the future. In tensorflow core, TFRecord/CSV Dataset are implemented as customized TF subgraphs constructed in C++. In tensorflow-io, we stay with DT_RESOURCE so that every Dataset is a collections of basic ops (not an integrated subgraph) and the graph manipulation is processed at the python level. This means the implementation for scale-up/distribute could be done at the python level in tensorflow-io as well. The decision to go with this route (vs. C++ subgraph route in tf.data), is that C++ developers are really scarce nowadays. For a project in tensorflow-io where maintainers and contributors are mostly from outside of google (since most of the formats in tensorflow-io like Kafka/Arrow/etc are not used by google), moving more components to python level will help ease the challenge for contributors.

See #700 for related note. Also /cc @BryanCutler as you might be interested in dataset+distribute as well.

bhack commented 3 years ago

You could be interested in https://www.khronos.org/blog/an-introduction-to-vulkan-video

tensorflow / io

Evaluate Decord for video IO #840