OE-17 "New Filter Engine" discussion

vpisarev commented 6 years ago

the feature request about evolution proposal OE-17

kinchungwong commented 6 years ago

While tile-based and strip-based execution engine is a good addition to OpenCV, they should be seen as complements (in addition) to row-based execution engine. Based on my experience, I take objection to some of the wording and tone, and also to their underlying premise.

I do agree that FilterEngine needs a redesign.

I do not intend to make claims without showing any code to back it up. However, I need to secure my legal release from my former employer, which may take time.

Here is my claim:

Certain image algorithms are most efficient with row-based execution engine and corresponding memory management strategy. Specifically, any algorithms whose coordinate transformation matrix is [ [a, 0], [0, b] ] (for any a, b, including but not necessarily 1.0) (in other words, "non-rotating", or "horizontal and vertical axis preserving". My math terminology may be wrong, sorry about that), will benefit from a row-based strategy.
A correctly-implemented row-based engine can work with strip-based data trivially.
Strip can be considered a special case of Tile, where the number of "horizontal partitions" is one. True tiling refers to both horizontal and vertical partitioning of pixel data.
Image algorithms which are "rotating", including rotation (where angle is not 0 degree modulo 360, and specifically including 90, 80, 270, and all values in between) and non-affine transformations (perspective, log-polar, etc), will benefit from tile-based execution.
Generalized remap benefit from tile-based execution if and only if the remap is a diffeomorphism. Same idea as cache locality. Otherwise, any non-flat memory layout will harm its performance in these situations.
For types of algorithms which can be implemented with both tile-based and row-based, tile-based typically result in more data copying than row-based.

In simple words, tile-based is beneficial to a few algorithms, but we should continue to use row-based execution for all other algorithms.

Tile-based execution may be necessary to inter-operate with GPU, but GPU usage already implies memory-copying. GPU may also impose maximum texture size.

Here is a teaser: imagine having this class, which is a "pool" of "row buffers (aligned)", and imagine algorithms which can be written to exploit such usefulness:

// for simplicity, here we ignore strict-OOP, visibility, etc. 
// We also skip class template, which is trivial to add later.
// Reference counting for each row (either shared_ptr or on top of that) can also be added later.
// "RowBufferU8" and aligned_vector is only for illustration. 
// A correct implementation would need unique_ptr for a row, along with aligned allocator/deallocator.  
// Use of int, size_t not taken cared of carefully here. A real implementation will try to get those correct.
struct RowBufferPool  
{
    using RowBufferU8 = std::aligned_vector<uchar, 16uL>;
    size_t elementCount;
    std::unordered_map<int, RowBufferU8> m_active;
    std::vector<RowBufferU8> m_unused;
public:
    RowBufferU8& add_or_get(int key);
    RowBufferU8* get(int key) const;
    void remove(int key);
 private:
    void ensure_spare(ptrdiff_t count);
    void add_spare();
};

In my experience, this can greatly simplify row-based image processing as well as greatly reduce unnecessary data copying (memory transfer). SIMD code can also be greatly simplified.

If an algorithm involves change of horizontal number of pixels, multiple instances of row buffer pools can be used; one pool for each pixel width (depending on the algorithm). For example, if an algorithm involves doubling of pixel width, then one pool can have width W, another pool can have width 2*W.

If an algorithm involves mixed numerical types, multiple instances of row buffer pools can be used, one for each numeric type.

kinchungwong commented 6 years ago

Please read this write-up carefully.

an illustration of "row buffer" concept with applicability toward OpenCV OE-17

https://gist.github.com/kinchungwong/152f0423698312d8e40870280d6ade7a

kinchungwong commented 6 years ago

Proposed preparation study

Identify all current usage of Filter Engine in OpenCV 3.x
For each current usage, evaluate whether that usage will benefit from:
- Status quo, using the same Filter Engine as in OpenCV 3.x
- A new engine of Vadim's design
- A new engine of Ryan's (kinchungwong) design
- Calling out to a third-party library or implementation
Casually (and collaboratively) look for examples of code in OpenCV which currently do not use Filter Engine, but could potentially benefit from a future implementation of it.
For each future potential use case, identify whether it will benefit best from:
- Status quo (not using any Filter Engine, keeping its code independent of any)
- Existing Filter Engine (from OpenCV 3.x)
- A new design of Vadim's choice
- A new design of Ryan's choice
- A new design of (some future team contractor or code contributor)
- Calling out to a third-party implementation

The reason I say "(someone's) choice", is that the future may benefit from having more than one Filter Engines co-existing. At the minimum, there needs to be two or three: Row-based, Tile-based. The tile-based Filter Engine is further divided into regular tiles (every tile has same width and height) and irregular tiles (arbitrary output rectangle is chosen; the input rectangle is computed by inverting the coordinate mapping function and finding the bounding rectangle) The person advocating a particular design is tasked with explaining the reason.

There is no reason to rush, as the best design may not emerge immediately.

(Disclaimer: despite this, I might rush myself to do this, but I will not push for a decision to be made, as that would impair the collective decision making process.)

One reason not to rush, is that the best design may come from outside - researchers or C++ experts worldwide. We need some outreach to solicit ideas.

kinchungwong commented 6 years ago

@vpisarev @alalek

I would like to know roughly what image processing operations are intended to be within scope of OE-17.

"Fixed window size filtering" (such as convolutions or morphological operation)
"Resampling without rotation" (resize with or without interpolation)
"Tile-based transposition readout" (rotation by 90 degrees or its multiples)
"Remapping" (rotation, affine, and other non-affine coordinate transforms)

Code skeletons:

(Note. code skeletons on my gist have been removed. Please see https://github.com/kinchungwong/imgproc_engine_vs2017/tree/master/source instead.)

(Right now, I am posting code skeletons which don't compile. This is because I am still waiting for a computer which I can start software development. Until then, I will be writing code skeletons with notepad.)

kinchungwong commented 6 years ago

@vpisarev @alalek

The code skeleton have been moved to https://github.com/kinchungwong/imgproc_engine_vs2017/tree/master/source

Thee code skeleton is provided for discussion. The coding style and implementation details currently don't conform to OpenCV standard, but these will be resolved after the basic requirements and design have been confirmed.

As mentioned in the README.md above, I would like to know roughly what image processing operations are intended to be within scope of OE-17.

"Fixed window size filtering" (such as convolutions or morphological operation)
"Resampling without rotation" (resize with or without interpolation)
"Tile-based transposition readout" (rotation by 90 degrees or its multiples)
"Remapping" (rotation, affine, and other non-affine coordinate transforms)

(Edited 2018-03-21)

The documentation for my approach has been moved to https://github.com/kinchungwong/imgproc_engine_design/blob/master/README.md

This is so that the understanding and discussion can take place, without being distracted by issues surrounding the source code.

kinchungwong commented 6 years ago

See #11138 cv::resize with INTER_AREA for 5 or more channels.

kinchungwong commented 6 years ago

Regarding SIMD coding style, the following project and sample usage shows how to use C++ templates and type system to define a general 3x3 convolution for uint8. It is arguably more readable than SIMD intrinsics. The following highlighted snippet generates 16 output pixels, when provided with an input window of 3-row-by-48-column pixels, all of which either in SIMD register or in aligned stack space. The outer loop will then advance 16 pixels at a time to generate an entire output row. Memory I/O is minimized. I have checked the disassembly with the MSVC compiler, and I will check with GCC soon.

https://github.com/kinchungwong/simd_style_helper/blob/2496282e6dafc430af59757937a6825d650e841a/simd_style_helper_experimental.h#L171-L197

kinchungwong commented 6 years ago

I have decided to abandon the coding style mentioned in my previous comments.

My current recommended coding style for the SIMD moving window kernel is here:

~~https://gist.github.com/kinchungwong/94a090ace1bdd67c02cf932c4a4b8a2a~~ Updated https://gist.github.com/kinchungwong/ff2392d9454c52253cbfc85b8075bb22 (2018-12-08)

The main idea is that it is not necessary to define superfluous union-structs for holding arrays of vectors. Instead, just use C++11 std::array

// nv = number of vectors (SIMD registers) in this array
template <typename vec_type, int nv>
using v_arr = std::array<vec_type, nv>;

// nvr = number of vectors (SIMD regs) in each row of this 2d array
// nvc = number of vectors (SIMD regs) in each column of this 2d array
template <typename vec_type, int nvr, int nvc> 
using v_arr2d = std::array<std::array<vec_type, nvc>, nvr>;

Functions that extract or insert SIMD worth of data from/to a larger SIMD array will accept std::array as argument.

template <typename vec_type, int nvr, int nvc>
struct v_arr_center_info
{
    enum
    {
        center_row = ((nvr - 1) / 2),
        lanes_per_vec = vec_type::nlanes,
        lanes_per_row = (lanes_per_vec * nvc),
        center_col = ((lanes_per_row - lanes_per_vec) / 2),
    };
};

template <int row_offset, int col_offset, typename vec_type, int nvr, int nvc>
inline _Tpvec v_extract_around_center(const v_arr2d<vec_type, nvr, nvc>& input)
{
    using center_info = v_arr_center_info<vec_type, nvr, nvc>;
    return v_extract<
        (row_offset + center_info::center_row), 
        (col_offset + center_info::center_col), 
        vec_type, nvr, nvc>(input);
}

Further suggestions to make the user's code (i.e. code that implement CPU-SIMD kernels for image processing algorithms) more elegant are welcome.

FantasqueX commented 2 months ago

@kinchungwong Hi, I'm interested in your proposal. However, I cannot access your code currently. Is it possible to make those code public again :)

kinchungwong commented 2 months ago

@kinchungwong Hi, I'm interested in your proposal. However, I cannot access your code currently. Is it possible to make those code public again :)

Sorry, thanks. I've made it public again. The notes hasn't been updated for 6 years and were quite out of date. Be sure to check with OpenCV maintainers first, because I suspect they have already made a lot of progress that were not reflected in my notes.

opencv / opencv

OE-17 "New Filter Engine" discussion #11012

an illustration of "row buffer" concept with applicability toward OpenCV OE-17

Proposed preparation study