tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
700 stars 281 forks source link

Video optimized models #638

Open ivelin opened 4 years ago

ivelin commented 4 years ago

Following up on the SIG IO monthly call today:

@yongtang mentioned that an intermediate approach would be to extract key frames and only run inference on these. I think this is a viable intermediate solution that will reduce compute.

For motion detection of objects, it would be still interesting to hear back from folks if there are TF models that can work with frame deltas.

yongtang commented 4 years ago

@ivelin It is actually possible to solve this issue in a native nn way I think: 1) When video stream comes in, we get both key frames and non-key-frames. Those frames should not be decoded (as it will defeat the purpose). 2) We construct two series of data inputs, one is the key-frame only, another is the non-keyframe. 3) We take both key-frame and non-keyframe as the input. That means the input will be a tuple of (key-frame, non-key-frame). Note: depending on the properties of non-key-frame, we could also take into account the P-frame or B-frame. 4) Note the size of non-key-frame is substantially less. 5) We apply decoding of key-frame only in the pipeline. 6) That means the input is (decoded-key-frame, non-decoded-non-key-frame) 7) With the input of (decoded-key-frame, non-decoded-non-key-frame) as a pair, we could build a model and train. 8) In case of inference, we pass (decoded-key-frame, non-decoded-non-key-frame) for predict.

Note "non-decoded-non-key-frame" is the delta.

yongtang commented 4 years ago

From implementation point of view, the challenge is more about separating the demuxing and decoding. In this way, we could selectively decode key-frame only, and pass non-decoded-non-key-frame as is. However, in current FFmpeg implementation in tensorflow/io, we mix them together due to the complexity of FFmpeg.

ivelin commented 4 years ago

Understood. ffmpeg has a flag that extract only key frames. I've tested it on a couple of projects: -skip_frame nokey

Would be good to hear from other folks who might be working on video inputs whether there are good pre-trained models that can effectively process partial image updates.

yongtang commented 4 years ago

@ivelin To continue the discussion, I think this issue is very much codec-centric. Different codec may need to have different algorithms, so generalization does not apply.

However, in case of H265 it actually could be a good candidate for optimization here, as coding tree unit concept naturally fit the sub-area (thus benefits the image segmentation/object detection). So it is a pixelation+keyframe scenario.

The implementation might be involved, though, as I don't see existing libraries out-of-the-box or close-to-out-of-box. To do it might need several steps 1) parse mp4/mov and demux into frames streams (no decoded). 2) parse H265 frame and extract enough information about coding tree unit.

The demux and get raw frame bytes in 1) might be easier as I found quite a few libraries to reach there. 2) is unknown as I never get into that low level of CTU before.