CUDA implementation of MPEG-7 video signature filter

yondonfu commented 3 years ago

Based on https://github.com/livepeer/internal-project-tracking/issues/161.

yondonfu commented 3 years ago

@oscar-davids Question regarding these points from your post on the CUDA implementation:

First, calculate the signature and download the binary to host CPU per frame and make linked list of them at CPU, write it to disk at the end of each segment.

While both approaches sound plausible, I am leaning towards first approach for the following reasons: 1 . Managing linked list in GPU memory is tricky and is not recommended.

Downloading signature data as soon as possible and freeing GPU memory will reduce traffic load from GPU memory to CPU memory and also reduce GPU usage.

Do you know what the rough estimate of the byte size of each frame signature that needs to be downloaded to the CPU per frame is? I'm assuming that relative to the size of the decompressed frame, the frame signature is fairly small, but I think it'd be useful to get a sense of the average data download per frame onto the CPU that will be required with this approach. If the size of the frame signature is really small then the data download per frame onto the CPU may not be a problem, but since we're still downloading data to the CPU with this approach it would be good to verify that (to avoid the worst case of the max PCIe bandwidth still being a problem).

oscar-davids commented 3 years ago

You are right. The byte size of each frame is 2784(348x8(uint64_t)). First I am going to write the rough code for testing without an algorithm and estimate the speed. If there is no problem, I will complete the rest of the algorithm part.

oscar-davids commented 3 years ago

I estimated the downloading speed from GPU to CPU, and the download size was 3072 bytes. Attached are benchmarking log files. I think reducing the download size is effective.

singlelane_fulllog_singonmode.log singlelane_fulllog_singoffmode.log 8lanes_fulllog_singonmode.log 8lanes_fulllog_singoffmode.log

oscar-davids commented 3 years ago

The motivation for CUDA implementation is to overcome PCIe lane bottleneck that is caused due to mass amount of GPU<>CPU data copying.

In an attempt to roughly estimate the performance gain after CUDA implementation, I originally planned to work out a skeleton code that skips signature calculation and outputs data which is same in size as the actual signature. However, given that the signature filter shrinks raw frames to 32x32 and apply filter to get 348 int64 values(2784bytes), I thought applying scale_cuda filter to the renditions before we download to CPU can reduce the size of data that travels from GPU to CPU.

Applying cuda_scale(w=64:h=32) hwdownload, nv12, the size of the data to be downloaded to CPU is 3072 byte, which is almost the same as the actual size of the signature - 2784 byte.

My thought was that applying scale cuda filter to the renditions before downloading frames to CPU will reduce data size and we can use this experiment result to roughly estimate performance gain under the assumption that the main bottleneck in performance is PCIe lane. If, ideally, the accuracy of this is as good as the original, we don't even need to do CUDA implementation but just add scale cuda filter before downloading raw frames to CPU.

The test results can be viewed from two angles - performance perspective and accuracy perspective.

The test results show that the number of concurrent streams when signature filter is on/off are all 20.
Although the test results were good in terms of performance, it was not good in accuracy. It turns out that the accuracy was low.

From here, my conclusion is that

When we reduce the size of the data that gets downloaded from GPU to CPU down to 3kbytes, the signature filter has no impact or very little impact on performance.
The accuracy problem needs more inspection. In case it turns out that applying scale filter to downsize the raw frames affect signature, I will implement the scaling raw frames to 32x32 part, and the applying filter(to avoid confusion, filter here is image filter, also termed as kernel in image processing context) part in CUDA and download the signature to CPU.

jailuthra commented 3 years ago

Thanks for the summary @oscar-davids!

I agree with the conclusion that ~3KB transfer per frame is much better compared to downloading raw un-scaled frames.

If, ideally, the accuracy of this is as good as the original, we don't even need to do CUDA implementation but just add scale cuda filter before downloading raw frames to CPU.

Although the test results were good in terms of performance, it was not good in accuracy. It turns out that the accuracy was low.

The accuracy problem needs more inspection. In case it turns out that applying scale filter to downsize the raw frames affect signature, I will implement the scaling raw frames to 32x32 part, and the applying filter(to avoid confusion, filter here is image filter, also termed as kernel in image processing context) part in CUDA and download the signature to CPU.

If I understand correctly - the plan is to inspect in detail if using CUDA scaling with CPU based mpeg7 filter was accurate enough for our purposes scale_cuda(w=32:h=32) -> hwdownload -> signature - and in the initial testing performed it wasn't very accurate.

Taking a look at how the CPU mpeg7 signature filter implements scaling - it seems like they use Box sampling interpolation, where each pixel in the scaled-down 32x32 image is obtained by averaging the brightness information of all the pixels in the original image that correspond to that pixel location.

This is different from the default nearest-neighbour interpolation used by the scale_cuda filter - which discards pixels from the original image while scaling down. I think this explains why accuracy was poor in the tests. Using different interpolations like bi-linear, bi-cubic that scale_cuda supports might give slightly better accuracy, but even those will discard many pixels when going from a big image down to 32x32.

Given all of that - I agree with the plan to implement a similar "Box sampling" scaling inside the CUDA kernel of our new mpeg7 signature filter, as was done in the CPU filter. WDYT?

jailuthra commented 3 years ago

Given all of that - I agree with the plan to implement a similar "Box sampling" scaling inside the CUDA kernel of our new mpeg7 signature filter, as was done in the CPU filter. WDYT?

Thinking more about this, maybe we should first implement only the scaling in a CUDA kernel, and try again with copying the 32x32 pixel image to CPU and using the existing signature filter.

If that performs poorly or is slow only then we should move to implementing the mpeg7 kernels in CUDA as well. Might save a lot of effort and time!

oscar-davids commented 3 years ago

Right, Today I have tested scale_cuda(w=32:h=32) -> hwdownload -> signature filters chain. But result is poor. And found the problem why the accuracy is poor. The reason is a difference in implementation between the Sum scale and the Normal scale. So I am going to implement of algorithm parts in the new mpeg7 signature filter.

yondonfu commented 3 years ago

The reason is a difference in implementation between the Sum scale and the Normal scale.

@oscar-davids I'm not sure that I totally understand your previous comment. A few questions:

Are the two implementations that differ here the scaling implementation in the scale_cuda filter and the 32x32 scaling implementation in the signature filter?
What is the relevance of the Sum scale and Normal scale here - does one implementation use one scale and the other implementation use the other scale?
My understanding of this comment is that @jailuthra is suggesting we implement a box sampling scaling CUDA kernel that behaves the same way as the 32x32 scaling implementation in the signature filter and run the box sampling scaling CUDA kernel on the decoded frames and then download the 32x32 frames to CPU to be passed to the signature filter. I think the motivation for this suggestion is to reduce the scope of the CUDA implementation to just the box sampling scaling instead of the entire signature filter. I couldn't tell from your previous comment whether you agree or disagree with this suggestion. Are you planning on following this suggestion or are you planning to use a different approach?

oscar-davids commented 3 years ago

Are the two implementations that differ here the scaling implementation in the scale_cuda filter and the 32x32 scaling implementation in the signature filter?

Yes, Right.

What is the relevance of the Sum scale and Normal scale here - does one implementation use one scale and the other implementation use the other scale?

Sum scale means that one pixel of the scaled image is the sum of all pixel values within a rectangle(OrgW/32 X OrgH/32 in our case).

Are you planning on following this suggestion or are you planning to use a different approach?

Correctly. I agree with @jailuthra's suggestion in order to reduce the amount of work and we need to write a new signature filter with the SUM scale or add parameters int to the original signature filter.

yondonfu commented 3 years ago

I spoke with @oscar-davids offline about the current plan. Here is a summary (cc @jailuthra):

Implement a CUDA kernel for the box sampling algorithm that is currently used in the signature filter for scaling a frame down to 32x32
Update the signature filter to support skipping the 32x32 frame scaling since we want to run the 32x32 frame scaling on the GPU and then download the 32x32 frames to the CPU to be passed to the signature filter - if the frames passed to the signature filter are already 32x32 then we don't need to do any additional scaling in the signature filter

Regarding 1 - I wonder if we can add a box sampling kernel to the file that contains all the algorithms currently supported by scale_cuda and then add box sampling as an interp_algo option for scale_cuda?

Regarding 2 - I wonder if we can add a scale boolean option to the signature filter that defaults to true and then we can manually set it to false to disable scaling within the signature filter.

The end filtergraph workflow that we're working towards is:

scale_cuda (32x32) -> hwdownload -> signature

jailuthra commented 3 years ago

I wonder if we can add a box sampling kernel to the file that contains all the algorithms currently supported by scale_cuda and then add box sampling as an interp_algo option for scale_cuda?

Sure! That sounds much cleaner than writing a new filter - and can also be a good patch for upstream as it is a helpful interpolation algo in general.

Regarding 2 - I wonder if we can add a scale boolean option to the signature filter that defaults to true and then we can manually set it to false to disable scaling within the signature filter.

Makes sense. I think even sending a 32x32 image to the current state of signature filter should theoretically work - although it will lead to some extra calculations around creating the new scaled-down picture buffer. So adding a new boolean option to skip those calculations sounds good!

oscar-davids commented 3 years ago

new signature filter benchmarking result.

I think the result is good. If there's one thing that's lacking, the comparison accuracy between GPU signature results and CPU's is 98.33 percent due to the difference in GPU and CPU calculation. Here is the test script for accuracy.

| Sign On (max session) | Sign Off (max session) -- | -- | -- 1 lane GPU | 22 | 22 8 lane GPU | 19 | 19

singlelane_fulllog_singonmode_2.log singlelane_fulllog_singoffmode_2.log 8lanes_fulllog_singonmode_2.log 8lanes_fulllog_singoffmode_2.log

accuracy_benchmarking.log

yondonfu commented 3 years ago

Closed by #14

livepeer / FFmpeg

CUDA implementation of MPEG-7 video signature filter #11