Closed yondonfu closed 3 years ago
@oscar-davids Question regarding these points from your post on the CUDA implementation:
First, calculate the signature and download the binary to host CPU per frame and make linked list of them at CPU, write it to disk at the end of each segment.
While both approaches sound plausible, I am leaning towards first approach for the following reasons: 1 . Managing linked list in GPU memory is tricky and is not recommended.
- Downloading signature data as soon as possible and freeing GPU memory will reduce traffic load from GPU memory to CPU memory and also reduce GPU usage.
Do you know what the rough estimate of the byte size of each frame signature that needs to be downloaded to the CPU per frame is? I'm assuming that relative to the size of the decompressed frame, the frame signature is fairly small, but I think it'd be useful to get a sense of the average data download per frame onto the CPU that will be required with this approach. If the size of the frame signature is really small then the data download per frame onto the CPU may not be a problem, but since we're still downloading data to the CPU with this approach it would be good to verify that (to avoid the worst case of the max PCIe bandwidth still being a problem).
You are right. The byte size of each frame is 2784(348x8(uint64_t)). First I am going to write the rough code for testing without an algorithm and estimate the speed. If there is no problem, I will complete the rest of the algorithm part.
I estimated the downloading speed from GPU to CPU, and the download size was 3072 bytes. Attached are benchmarking log files. I think reducing the download size is effective.
singlelane_fulllog_singonmode.log singlelane_fulllog_singoffmode.log 8lanes_fulllog_singonmode.log 8lanes_fulllog_singoffmode.log
The motivation for CUDA implementation is to overcome PCIe lane bottleneck that is caused due to mass amount of GPU<>CPU data copying.
In an attempt to roughly estimate the performance gain after CUDA implementation, I originally planned to work out a skeleton code that skips signature calculation and outputs data which is same in size as the actual signature. However, given that the signature filter shrinks raw frames to 32x32 and apply filter to get 348 int64 values(2784bytes), I thought applying scale_cuda filter to the renditions before we download to CPU can reduce the size of data that travels from GPU to CPU.
Applying cuda_scale(w=64:h=32) hwdownload, nv12, the size of the data to be downloaded to CPU is 3072 byte, which is almost the same as the actual size of the signature - 2784 byte.
My thought was that applying scale cuda filter to the renditions before downloading frames to CPU will reduce data size and we can use this experiment result to roughly estimate performance gain under the assumption that the main bottleneck in performance is PCIe lane. If, ideally, the accuracy of this is as good as the original, we don't even need to do CUDA implementation but just add scale cuda filter before downloading raw frames to CPU.
The test results can be viewed from two angles - performance perspective and accuracy perspective.
From here, my conclusion is that
Thanks for the summary @oscar-davids!
I agree with the conclusion that ~3KB transfer per frame is much better compared to downloading raw un-scaled frames.
If, ideally, the accuracy of this is as good as the original, we don't even need to do CUDA implementation but just add scale cuda filter before downloading raw frames to CPU.
Although the test results were good in terms of performance, it was not good in accuracy. It turns out that the accuracy was low.
The accuracy problem needs more inspection. In case it turns out that applying scale filter to downsize the raw frames affect signature, I will implement the scaling raw frames to 32x32 part, and the applying filter(to avoid confusion, filter here is image filter, also termed as kernel in image processing context) part in CUDA and download the signature to CPU.
If I understand correctly - the plan is to inspect in detail if using CUDA scaling with CPU based mpeg7 filter was accurate enough for our purposes scale_cuda(w=32:h=32) -> hwdownload -> signature
- and in the initial testing performed it wasn't very accurate.
Taking a look at how the CPU mpeg7 signature
filter implements scaling - it seems like they use Box sampling interpolation, where each pixel in the scaled-down 32x32 image is obtained by averaging the brightness information of all the pixels in the original image that correspond to that pixel location.
This is different from the default nearest-neighbour interpolation used by the scale_cuda
filter - which discards pixels from the original image while scaling down. I think this explains why accuracy was poor in the tests. Using different interpolations like bi-linear, bi-cubic that scale_cuda
supports might give slightly better accuracy, but even those will discard many pixels when going from a big image down to 32x32.
Given all of that - I agree with the plan to implement a similar "Box sampling" scaling inside the CUDA kernel of our new mpeg7 signature filter, as was done in the CPU filter. WDYT?
Given all of that - I agree with the plan to implement a similar "Box sampling" scaling inside the CUDA kernel of our new mpeg7 signature filter, as was done in the CPU filter. WDYT?
Thinking more about this, maybe we should first implement only the scaling in a CUDA kernel, and try again with copying the 32x32 pixel image to CPU and using the existing signature
filter.
If that performs poorly or is slow only then we should move to implementing the mpeg7 kernels in CUDA as well. Might save a lot of effort and time!
Right, Today I have tested scale_cuda(w=32:h=32) -> hwdownload -> signature filters chain. But result is poor. And found the problem why the accuracy is poor. The reason is a difference in implementation between the Sum scale and the Normal scale. So I am going to implement of algorithm parts in the new mpeg7 signature filter.
The reason is a difference in implementation between the Sum scale and the Normal scale.
@oscar-davids I'm not sure that I totally understand your previous comment. A few questions:
Are the two implementations that differ here the scaling implementation in the scale_cuda filter and the 32x32 scaling implementation in the signature filter?
Yes, Right.
What is the relevance of the Sum scale and Normal scale here - does one implementation use one scale and the other implementation use the other scale?
Sum scale means that one pixel of the scaled image is the sum of all pixel values within a rectangle(OrgW/32 X OrgH/32 in our case).
Are you planning on following this suggestion or are you planning to use a different approach?
Correctly. I agree with @jailuthra's suggestion in order to reduce the amount of work and we need to write a new signature filter with the SUM scale or add parameters int to the original signature filter.
I spoke with @oscar-davids offline about the current plan. Here is a summary (cc @jailuthra):
Regarding 1 - I wonder if we can add a box sampling kernel to the file that contains all the algorithms currently supported by scale_cuda
and then add box sampling as an interp_algo
option for scale_cuda
?
Regarding 2 - I wonder if we can add a scale
boolean option to the signature filter that defaults to true and then we can manually set it to false to disable scaling within the signature filter.
The end filtergraph workflow that we're working towards is:
scale_cuda
(32x32) -> hwdownload
-> signature
I wonder if we can add a box sampling kernel to the file that contains all the algorithms currently supported by scale_cuda and then add box sampling as an interp_algo option for scale_cuda?
Sure! That sounds much cleaner than writing a new filter - and can also be a good patch for upstream as it is a helpful interpolation algo in general.
Regarding 2 - I wonder if we can add a scale boolean option to the signature filter that defaults to true and then we can manually set it to false to disable scaling within the signature filter.
Makes sense. I think even sending a 32x32 image to the current state of signature
filter should theoretically work - although it will lead to some extra calculations around creating the new scaled-down picture buffer. So adding a new boolean option to skip those calculations sounds good!
new signature filter benchmarking result.
I think the result is good. If there's one thing that's lacking, the comparison accuracy between GPU signature results and CPU's is 98.33 percent due to the difference in GPU and CPU calculation. Here is the test script for accuracy.
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
| Sign On (max session) | Sign Off (max session) -- | -- | -- 1 lane GPU | 22 | 22 8 lane GPU | 19 | 19
Based on https://github.com/livepeer/internal-project-tracking/issues/161.