Benchmark verification vs. transcoding on the same CPU machine

yondonfu commented 4 years ago

We want to know what the current compute speed/cost of verification is relative to transcoding i.e. verification compute speed/cost is X% of transcoding compute speed/cost.

cyberj0g commented 4 years ago

Compared transcoding 1080p -> 720p, 10 sec video with libx264 codec through Ffmpeg CLI, and verification of the same pair on CPU. Verification is only 15% faster on average (2.6 sec vs 3.0 sec). Call graph of verification with profiling data is attached. Iterating through the video with OpenCV takes 70% of execution time. Speed up can be achieved by deeper integration with Ffmpeg through third party binding like this or developing a C module. Main goals for such integration are seeking support and direct access to PTS timestamps.

ndujar commented 4 years ago

Speed up can be achieved by deeper integration with Ffmpeg through third party binding like this or developing a C module. Main goals for such integration are seeking support and direct access to PTS timestamps.

I think this is very connected with this issue: https://github.com/livepeer/verification-classifier/issues/39 Side experiments were made in that direction (i.e. using seeking to accelerate the process and avoid decoding the whole sequence), but results were only satisfactory for a large number of frames. In this context, chunks are too short. A potential improvement of the current iterator might be using multithreading: https://nrsyed.com/2018/07/05/multithreading-with-opencv-python-to-improve-video-processing-performance/

cyberj0g commented 4 years ago

I have implemented seeking through ffmpeg CLI with -ss option, and the above is true in this case - it adds a significant overhead on spawning new ffmpeg process for each frame, which limits performance gains with high N_samples/N_total ratios. If this obvious overhead is eliminated, seeking should always be faster when the ratio is below some value to be determined experimentally. The overhead in that case is in the need for a decoder to rebuild internal state at requested location by decoding the closest previous I frame and, possibly, some P frames. I'd expect the ratio to be around 1/4 worst case. Most current experiments are using 10 seconds videos and 10-30 samples, which gives ratios from 1/8 to 1/60.

cyberj0g commented 4 years ago

Improved video scanning performance by:

getting the PTS through OpenCV without the need for Ffmpeg
using VideoCapture.grab() to iterate frames, and retrieve image data only for necessary frames. Experiment is the same as above, verification time is improved to ~1.2 sec and now takes only 40% of transcoding time. For 30 uniformly distributed samples of 301 frame video, scan takes roughly the same time as metric computation.

ndujar commented 4 years ago

Most current experiments are using 10 seconds videos and 10-30 samples, which gives ratios from 1/8 to 1/60.

I believe the chunk limit in Livepeer was 4s. But maybe that has been redefined?

getting the PTS through OpenCV without the need for Ffmpeg

Interesting. Are you using a different backend for OpenCV? Gstreamer instead? (edited) Oh, I see. videocapture::grab doesn't do decoding, unlike VideoCapture::read https://docs.opencv.org/2.4/modules/highgui/doc/reading_and_writing_images_and_video.html

yondonfu commented 4 years ago

@cyberj0g Thanks for posting the updated results! For reference, what were the specs of the machine that verification and transcoding were run on?

Regarding this point:

I believe the chunk limit in Livepeer was 4s. But maybe that has been redefined?

The results with 10s chunks are definitely still helpful, but as indicated here It would also be helpful to test with 2s and 4s chunks since those are more common in Livepeer's live streaming workflows.

Are you using a different backend for OpenCV? Gstreamer instead?

@ndujar I believe the latest code uses a newer version of OpenCV that contains an update where the actual frame PTS will be returned for capture.get(cv2.CAP_PROP_POS_MSEC) when using ffmpeg as the backend instead of the frame number multiplied by the FPS.

cyberj0g commented 4 years ago

@yondonfu test results are for Intel Core i7-8750H, 6 cores HT disabled, DDR4 RAM. I tried to disable parallelization both in Ffmpeg and Numpy to make a single-thread test, but it still generated the load on more than one core, so I just let it run unrestricted and got almost 100% CPU utilization.

Results for 4 sec video: Verification: 1.14 sec, transcoding: 2.33 sec

Results for 2 sec video: Verification: 0.94 sec, transcoding: 1.35 sec

So transcoding seem to behave almost linearly of video length, while verification process has an overhead of loading models on every verification. I plan to refactor verifier.py to make it a class and load models once on initialization.

cyberj0g commented 4 years ago

Results for 10 samples (all above tests are 30 samples):

Results for 4 sec video: Verification: 0.78 sec, transcoding: 2.15 sec

Results for 2 sec video: Verification: 0.5 sec, transcoding: 1.10 sec

livepeer / verification-classifier

Benchmark verification vs. transcoding on the same CPU machine #110