alex77g2 commented 1 year ago

Describe the feature and motivation

Usually/often both Sobel X and Y are needed together (considered usecase here). This applies to all Arch (e.g. ARM, x86, x64, OpenCL, CUDA, ..), if ImageSize > L1-Cache. Principal idea is to go from image pipelining (finish each operator, before next one starts) to row-pipelining (each row enters L1 only once). General benefits: much faster (with typical L1 cache sizes) + less lines of code + less temporal memory. Edge-detection is a major method in CV in general, and Sobel is taken here as example.

Additional context

The Issue applies to both Python and C-Interface This Interface Code should only explain the idea, it is not intended to be final.

1. the grayscale case (only 1 channel) -- [RGB/BGR see below]

ddepth = cv2.CV_16S # no important now
gray = cv2.imread(image, CV_LOAD_IMAGE_GRAYSCALE) # cv2.IMREAD_COLOR
grad_x = cv2.Sobel(gray, ddepth, 1, 0, ksize=3, scale=1, delta=0, borderType=cv2.BORDER_DEFAULT)
grad_y = cv2.Sobel(gray, ddepth, 0, 1, ksize=3, scale=1, delta=0, borderType=cv2.BORDER_DEFAULT)
# a typical Image (8bit, 1..8 Megapixels) fits nowadays into L3-cache (some Megabytes), but not into L1-cache (32..100 KB)
# if you calculate both Sobels successively, SobelY still works from L3
# but if you would work in in parallel (e.g. Line1_gx, Line1_gy, Line2_gx, Line2_gy, ..) you would gain L1-speed for GradY
grad_x,grad_y = SobelGradBoth(gray, ksize) # possible python syntax

2. grayscale cases, but combinations of Sobel X+Y

# 2a) grad = abs(grad_x) + abs(grad_y)
abs_grad_x = cv.convertScaleAbs(grad_x)
abs_grad_y = cv.convertScaleAbs(grad_y)
grad = cv.addWeighted(abs_grad_x, 0.5, abs_grad_y, 0.5, 0)
grad8= cv.addWeighted(abs_grad_x, 0.25, abs_grad_y, 0.25, 0) # (in CV_8U) even better, as pixel-bandwidth limits CV
# a single optimized call would merge 5 runs (2x Sobel, 2x convertScaleAbs, 1x addWeighted) into one (and gain L1-speed for 4 of them)
grad = GradSumFromSobels(gray, ksize, OUT_FORMAT) # new possible syntax

# 2b) thres_8U = threshold(dx^2 + dy^2, thresh*thresh, 255, cv.THRESH_BINARY)
# same here, including extra 16 to 8 bit method (thres_8U is similar to Canny without NMS)

# 2c) Magnitude^2 = (dx^2 + dy^2)
mag2_16 = Magn2FromSobels(dx^2 + dy^2) # 16S or 16U
mag_scl8U  = Magn8FromSobels(dx^2 + dy^2) / 512 # (255^2 + 255^2) /510 = 255, 512=2^9
# same here = many operations together (6 vs 1)

# 2d) mag = sqrt(dx^2 + dy^2) (using cv::magnitude, but caution sqrt is slow)
mag_32F = Magn32F_FromSobels(dx^2 + dy^2)
mag_8F  = Magn8F_FromSobels(dx^2 + dy^2) # Minifloat, less memory + faster sqrt
# same here (and even more improvments thinkable by avoiding sqrt)
# like: log2_mag = log2(dx^2 + dy^2) * 0.5 # log2( CV_16U ) = __builtin_clz() or LZCNT()
# benefit: fast (no sqrt), similar information (rounded), output could be CV_8U (skip /2)
# optional: direction bits (part of 8U byte), often direction information can be rough (strong rounded)
# 2d1) 1 bit direction: bit = abs(dx) > abs(dy)  [e.g. bit array 7+1]
# 2d2) 2 bit direction: bit1= sign(dx), bit2 = sign(dy) // a fast atan2 with maxerror=45deg

3. Color images, RGB/BGR (3 channel)

# single direction, gain = 9 : 1
sobel_sum_x = abs(grad_red_x) + abs(grad_green_x) + abs(grad_blue_x) = SobelRgbX(img,ksize) # out = 16U & 8U [scaled /1024]
sobel_sum_y = abs(grad_red_y) + abs(grad_green_y) + abs(grad_blue_y) = SobelRgbY(img,ksize) # out = 16U & 8U [scaled /1024]
# both direction, gain = 19 : 1  !!!!
grad_rgb = sobel_sum_x + sobel_sum_y = SobelRgbXandY(img,ksize) # output = 16U & 8U [scaled /2048 or /(6.0*255) ]
mag2_rgb = Magn2_FromSobelRgb(dx^2 + dy^2) # output = 16U
magT_rgb = MagnT_FromSobelRgb(dx^2 + dy^2 > thresh*thresh) # THRESH_BINARY

Thanks for reading.

asmorkalov commented 1 year ago

@alex77g2 Please take a look on G-API.

alex77g2 commented 1 year ago

@alex77g2 Please take a look on G-API.

Thanks. But 2 points I'm still not sure:

Does G-API-piplining also mean, that temporal memory can be saved? (meaning: if only the final result-image is needed, no megabytes for the in-between steps is reserved/wasted (because it is done within single lines).
From my understanding, an hand-optimized version of an many-in-one method, should still improve much. Especially RGB-image to combined edge-magnitude. Auto-pipelining should only solve the L1-cache limit. Do you belive that G-API will reach the same level of possible performance?

I understand, that not all CV-algo-combinations can be hand-optimized. But Sobel-pairs is the start of a huge family of CV-algorithms. So it should be concidered separately.

Please consider these points again.

asmorkalov commented 1 year ago

cc @TolyaTalamanov @dmatveev

dmatveev commented 1 year ago

Principal idea is to go from image pipelining (finish each operator, before next one starts) to row-pipelining (each row enters L1 only once). General benefits: much faster (with typical L1 cache sizes) + less lines of code + less temporal memory. Edge-detection is a major method in CV in general, and Sobel is taken here as example.

This is exactly what G-API does with the Fluid backend.

dmatveev commented 1 year ago

Does G-API-piplining also mean, that temporal memory can be saved? (meaning: if only the final result-image is needed, no megabytes for the in-between steps is reserved/wasted (because it is done within single lines).

Exactly. With the Fluid backend, the intermediate memory is optimized out.

From my understanding, an hand-optimized version of an many-in-one method, should still improve much. Especially RGB-image to combined edge-magnitude. Auto-pipelining should only solve the L1-cache limit. Do you belive that G-API will reach the same level of possible performance?

There's SobelXY in G-API: https://github.com/opencv/opencv/blob/4.x/modules/gapi/src/backends/fluid/gfluidimgproc.cpp#L1061

It is currently limited to 3x3 window size but the engine supports any window size and contributions are welcome!

alex77g2 commented 1 year ago

Does G-API-piplining also mean, that temporal memory can be saved? (meaning: if only the final result-image is needed, no megabytes for the in-between steps is reserved/wasted (because it is done within single lines).

Exactly. With the Fluid backend, the intermediate memory is optimized out.

From my understanding, an hand-optimized version of an many-in-one method, should still improve much. Especially RGB-image to combined edge-magnitude. Auto-pipelining should only solve the L1-cache limit. Do you belive that G-API will reach the same level of possible performance?

There's SobelXY in G-API: https://github.com/opencv/opencv/blob/4.x/modules/gapi/src/backends/fluid/gfluidimgproc.cpp#L1061

It is currently limited to 3x3 window size but the engine supports any window size and contributions are welcome!

Thanks for details about G-API. There is a mathematical problem with Sobel 5x5, which I think is the reason for seldom usage. Sobel 5x5 writen as separated product, consists of a blur vector in one direction and a difference in the orthogonal direction. The Blur usually is a rounded gaussian, e.g. [1,2,4,2,1], but also used [1,1,1,1,1] or [1,1,2,1,1] or others. Fine until here. But in gradient direction I'm aware of different approaches: [+2,+1, 0,-1,-2] and [+1,+2, 0,-2,-1]-like delta-vectors. In most cases Blur(3x3) before Sobel(3x3) give a reasonable and symmetric result.

opencv / opencv

Sobel XandY-combining for performance #24091

Describe the feature and motivation

Additional context

1. the grayscale case (only 1 channel) -- [RGB/BGR see below]

2. grayscale cases, but combinations of Sobel X+Y

3. Color images, RGB/BGR (3 channel)