Closed faroit closed 4 years ago
back then this was added by @yongtang and @carlthome
Added PR #494 as the first step to obtain the shape and type info of the FFmpeg audio/video info
PR #494 adds the initial track information parsing for ffmpeg, which is the fist step to get sub stream support. There are more issues arise while looking into ffmpeg: 1) ffmpeg is timebase, not frame based in video/audio. While it makes sense in situations where video is processed sequentially, it may not be very helpful to training where you need indexed frames accurately. 2) ffmpeg tries to fit everything into one pipeline (video/audio/subtitles) and which actually makes parsing for certain containers non-initiative. For example, it is very hard to list the keyframe which is present in some format.
Overall, I think more work will need to be done:
Split the container parsing from the decoding Container parsing will parse mp4/wav/etc into raw data that could be fit into different decoders. There are several advantages: a) Decoder ops could be reused, and combined with other container parser that are natural to system. For example, Windows/macOS or mobile devices provide manual frequently used video/audio processing in system API with appropriate license. It is really not necessarily to always go with Ffmpeg whose license have limitations. b) Some container formats are so frequently used, such that it is easy to just find a library to processing it in quality. And some formats such as WAV are straightforward so it is just easy to implement one. c) While Ffmpeg incorporates many hardware accelerated decoders (e.g., from Nvidia), they are not necessarily installed properly (or enabled properly) on all systems. Supporting some frequently used decoders with native APIs are actually much more easy. Ffmpeg has the strength of "support all format", which is actually cumbersome for certain frequently used formats.
Add time-based indexing Instead of one indexing, believe multi-indexing (both frame-index and time-index) will greatly help the processing of video/audio/subtitle.
/cc @ivelin here as audio might be related to your work.
One thing I am thinking regarding to your work, is to encode audio locally, send through the channel a raw byte stream. And on server side to decode first, then do a preprocessing, then and pass to tf.keras. See https://github.com/tensorflow/io/issues/453#issuecomment-533802219 about splitting container (e.g., mp4/WAV) parsing from decoding (H.264).
Local encoding should be fine as most of the edge devices are powerful enough nowadays to handle it I believe. On server side decoding could be hard-ware accelerated (assuming license is not an issue).
Alternatively, edge could do all the preprocessing to convert all feature into floating point arrays, and send tensor-serialized data through grpc and server will use directly with tf.keras.
@ivelin Apparently ffmpeg has too many required parameter in order to decode a raw packet into video. So unless those parameters could be serialized (which is pretty messy in Ffmpeg) this is not very feasible.
Added PR #499 for install sub stream support for audio with ffmpeg. Not all format are supported yet but adding any audio format should be straightforward with ffmpeg.
One limitation of ffmpeg is the seeking/indexing. Ffmpeg is timebased and may not really be performing in certain situations.
@yongtang Thank you for keeping me in the loop. As I mentioned earlier, I've spent more time looking at the options for real time media streaming feeds into TF IO. Seems like we are independently reaching similar conclusions. ffmpeg works well for transforming to raw media a large number of file containers and formats, but it lacks fine grained control, which is especially important for capturing and replaying real time media with all of its network characteristics like dropped packets, latency, jitter, etc. Its probably best to split the discussion into several buckets:
I think for 1. the recent PRs have made a ton of good progress and the topic can be soon closed. For 2&3, I am researching ways forward with tools like gstreamer, but don't have results to report yet.
@faroit With PR #499 in place I think this issue is resolved.
@ivelin Those are good discussions, let me open a new issue to continue the discussion.
@faroit with PR #499 in place, you can now pass a substream with the name a:0
to select audio sub stream 0, v:0
to select video sub stream 0, and s:0
to select subtitles. (Yes subtitles have been supported with #PR 499.
Also, I exposed a tfio.IOTensor.from_ffmpeg
with gives you random/indexable access to Audio and Video sub streams for convenience usage. Let me know if you want to see additional enhancement.
@yongtang this is incredible useful. I will give it a try the next week and will get back here with feedback. Thanks!
49 added support for audio decoding using the ffmpeg ops.
tf.contrib.ffmpeg.decode_audio
supported the ability to select the audio substream, which is really some applications like music separation. I would propose to also add this to tf.io.