Video optimized models - Githubissues

ivelin commented 4 years ago

Following up on the SIG IO monthly call today:

Are there available TF models that remember the previous image frames and only calculate inference on partial image changes?
If such models exist, it could stimulate work on video IO plugins. Most video streaming protocols have a notion of delta between frames.
If inference can be as optimized as video compression, then we could look for 50x boost in performance compared to the commonly used approach today of scanning individually each raw video frame.
My experiments show that the amount of compute it takes to decompress each frame in a 60fps HD video and convert to raw RGB tensors is similar to the amount of compute to run object detection inference on each frame.

@yongtang mentioned that an intermediate approach would be to extract key frames and only run inference on these. I think this is a viable intermediate solution that will reduce compute.

For motion detection of objects, it would be still interesting to hear back from folks if there are TF models that can work with frame deltas.

yongtang commented 4 years ago

@ivelin It is actually possible to solve this issue in a native nn way I think: 1) When video stream comes in, we get both key frames and non-key-frames. Those frames should not be decoded (as it will defeat the purpose). 2) We construct two series of data inputs, one is the key-frame only, another is the non-keyframe. 3) We take both key-frame and non-keyframe as the input. That means the input will be a tuple of (key-frame, non-key-frame). Note: depending on the properties of non-key-frame, we could also take into account the P-frame or B-frame. 4) Note the size of non-key-frame is substantially less. 5) We apply decoding of key-frame only in the pipeline. 6) That means the input is (decoded-key-frame, non-decoded-non-key-frame) 7) With the input of (decoded-key-frame, non-decoded-non-key-frame) as a pair, we could build a model and train. 8) In case of inference, we pass (decoded-key-frame, non-decoded-non-key-frame) for predict.

Note "non-decoded-non-key-frame" is the delta.

yongtang commented 4 years ago

From implementation point of view, the challenge is more about separating the demuxing and decoding. In this way, we could selectively decode key-frame only, and pass non-decoded-non-key-frame as is. However, in current FFmpeg implementation in tensorflow/io, we mix them together due to the complexity of FFmpeg.

ivelin commented 4 years ago

Understood. ffmpeg has a flag that extract only key frames. I've tested it on a couple of projects: -skip_frame nokey

Would be good to hear from other folks who might be working on video inputs whether there are good pre-trained models that can effectively process partial image updates.

yongtang commented 4 years ago

@ivelin To continue the discussion, I think this issue is very much codec-centric. Different codec may need to have different algorithms, so generalization does not apply.

However, in case of H265 it actually could be a good candidate for optimization here, as coding tree unit concept naturally fit the sub-area (thus benefits the image segmentation/object detection). So it is a pixelation+keyframe scenario.

The implementation might be involved, though, as I don't see existing libraries out-of-the-box or close-to-out-of-box. To do it might need several steps 1) parse mp4/mov and demux into frames streams (no decoded). 2) parse H265 frame and extract enough information about coding tree unit.

The demux and get raw frame bytes in 1) might be easier as I found quite a few libraries to reach there. 2) is unknown as I never get into that low level of CTU before.

tensorflow / io

Video optimized models #638