tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
700 stars 281 forks source link

Add support for FFMPEG (audio) substreams #453

Closed faroit closed 4 years ago

faroit commented 5 years ago

49 added support for audio decoding using the ffmpeg ops. tf.contrib.ffmpeg.decode_audio supported the ability to select the audio substream, which is really some applications like music separation. I would propose to also add this to tf.io.

faroit commented 5 years ago

back then this was added by @yongtang and @carlthome

yongtang commented 4 years ago

Added PR #494 as the first step to obtain the shape and type info of the FFmpeg audio/video info

yongtang commented 4 years ago

PR #494 adds the initial track information parsing for ffmpeg, which is the fist step to get sub stream support. There are more issues arise while looking into ffmpeg: 1) ffmpeg is timebase, not frame based in video/audio. While it makes sense in situations where video is processed sequentially, it may not be very helpful to training where you need indexed frames accurately. 2) ffmpeg tries to fit everything into one pipeline (video/audio/subtitles) and which actually makes parsing for certain containers non-initiative. For example, it is very hard to list the keyframe which is present in some format.

Overall, I think more work will need to be done:

yongtang commented 4 years ago

/cc @ivelin here as audio might be related to your work.

One thing I am thinking regarding to your work, is to encode audio locally, send through the channel a raw byte stream. And on server side to decode first, then do a preprocessing, then and pass to tf.keras. See https://github.com/tensorflow/io/issues/453#issuecomment-533802219 about splitting container (e.g., mp4/WAV) parsing from decoding (H.264).

Local encoding should be fine as most of the edge devices are powerful enough nowadays to handle it I believe. On server side decoding could be hard-ware accelerated (assuming license is not an issue).

Alternatively, edge could do all the preprocessing to convert all feature into floating point arrays, and send tensor-serialized data through grpc and server will use directly with tf.keras.

yongtang commented 4 years ago

@ivelin Apparently ffmpeg has too many required parameter in order to decode a raw packet into video. So unless those parameters could be serialized (which is pretty messy in Ffmpeg) this is not very feasible.

yongtang commented 4 years ago

Added PR #499 for install sub stream support for audio with ffmpeg. Not all format are supported yet but adding any audio format should be straightforward with ffmpeg.

One limitation of ffmpeg is the seeking/indexing. Ffmpeg is timebased and may not really be performing in certain situations.

ivelin commented 4 years ago

@yongtang Thank you for keeping me in the loop. As I mentioned earlier, I've spent more time looking at the options for real time media streaming feeds into TF IO. Seems like we are independently reaching similar conclusions. ffmpeg works well for transforming to raw media a large number of file containers and formats, but it lacks fine grained control, which is especially important for capturing and replaying real time media with all of its network characteristics like dropped packets, latency, jitter, etc. Its probably best to split the discussion into several buckets:

  1. Basic use case of reading from an arbitrary recorded media file into TFIO. This does not deal with network artifacts. Applicable for scenarios where the media is pre-recorded. Since the file is static, the same original source can be reused many times for training.
  2. Real time media streaming for inference. The media is live and has network artifacts, but does not need to be recorded for training purpose. The aim here is real time inference performance.
  3. Real time media streaming for training. The media has network artifacts that need to be precisely captured (e.g. via pcap) and organized for model training. The emphasis here is on precision in capturing all important dataset features. Not just the packets carrying media signal, but also the network delays and other artifacts that can impact the Bayesian baseline banchmark of a model.

I think for 1. the recent PRs have made a ton of good progress and the topic can be soon closed. For 2&3, I am researching ways forward with tools like gstreamer, but don't have results to report yet.

yongtang commented 4 years ago

@faroit With PR #499 in place I think this issue is resolved.

@ivelin Those are good discussions, let me open a new issue to continue the discussion.

yongtang commented 4 years ago

@faroit with PR #499 in place, you can now pass a substream with the name a:0 to select audio sub stream 0, v:0 to select video sub stream 0, and s:0 to select subtitles. (Yes subtitles have been supported with #PR 499.

Also, I exposed a tfio.IOTensor.from_ffmpeg with gives you random/indexable access to Audio and Video sub streams for convenience usage. Let me know if you want to see additional enhancement.

faroit commented 4 years ago

@yongtang this is incredible useful. I will give it a try the next week and will get back here with feedback. Thanks!