opencv / opencv

Open Source Computer Vision Library
https://opencv.org
Apache License 2.0
78.89k stars 55.8k forks source link

OE-17 "New Filter Engine" discussion #11012

Open vpisarev opened 6 years ago

vpisarev commented 6 years ago

the feature request about evolution proposal OE-17

kinchungwong commented 6 years ago

While tile-based and strip-based execution engine is a good addition to OpenCV, they should be seen as complements (in addition) to row-based execution engine. Based on my experience, I take objection to some of the wording and tone, and also to their underlying premise.

I do agree that FilterEngine needs a redesign.

I do not intend to make claims without showing any code to back it up. However, I need to secure my legal release from my former employer, which may take time.

Here is my claim:

In simple words, tile-based is beneficial to a few algorithms, but we should continue to use row-based execution for all other algorithms.

Tile-based execution may be necessary to inter-operate with GPU, but GPU usage already implies memory-copying. GPU may also impose maximum texture size.


Here is a teaser: imagine having this class, which is a "pool" of "row buffers (aligned)", and imagine algorithms which can be written to exploit such usefulness:

// for simplicity, here we ignore strict-OOP, visibility, etc. 
// We also skip class template, which is trivial to add later.
// Reference counting for each row (either shared_ptr or on top of that) can also be added later.
// "RowBufferU8" and aligned_vector is only for illustration. 
// A correct implementation would need unique_ptr for a row, along with aligned allocator/deallocator.  
// Use of int, size_t not taken cared of carefully here. A real implementation will try to get those correct.
struct RowBufferPool  
{
    using RowBufferU8 = std::aligned_vector<uchar, 16uL>;
    size_t elementCount;
    std::unordered_map<int, RowBufferU8> m_active;
    std::vector<RowBufferU8> m_unused;
public:
    RowBufferU8& add_or_get(int key);
    RowBufferU8* get(int key) const;
    void remove(int key);
 private:
    void ensure_spare(ptrdiff_t count);
    void add_spare();
};

In my experience, this can greatly simplify row-based image processing as well as greatly reduce unnecessary data copying (memory transfer). SIMD code can also be greatly simplified.

If an algorithm involves change of horizontal number of pixels, multiple instances of row buffer pools can be used; one pool for each pixel width (depending on the algorithm). For example, if an algorithm involves doubling of pixel width, then one pool can have width W, another pool can have width 2*W.

If an algorithm involves mixed numerical types, multiple instances of row buffer pools can be used, one for each numeric type.

kinchungwong commented 6 years ago

Please read this write-up carefully.

an illustration of "row buffer" concept with applicability toward OpenCV OE-17

https://gist.github.com/kinchungwong/152f0423698312d8e40870280d6ade7a

kinchungwong commented 6 years ago

Proposed preparation study

The reason I say "(someone's) choice", is that the future may benefit from having more than one Filter Engines co-existing. At the minimum, there needs to be two or three: Row-based, Tile-based. The tile-based Filter Engine is further divided into regular tiles (every tile has same width and height) and irregular tiles (arbitrary output rectangle is chosen; the input rectangle is computed by inverting the coordinate mapping function and finding the bounding rectangle) The person advocating a particular design is tasked with explaining the reason.

There is no reason to rush, as the best design may not emerge immediately.

(Disclaimer: despite this, I might rush myself to do this, but I will not push for a decision to be made, as that would impair the collective decision making process.)

One reason not to rush, is that the best design may come from outside - researchers or C++ experts worldwide. We need some outreach to solicit ideas.

kinchungwong commented 6 years ago

@vpisarev @alalek

I would like to know roughly what image processing operations are intended to be within scope of OE-17.

Code skeletons:

(Right now, I am posting code skeletons which don't compile. This is because I am still waiting for a computer which I can start software development. Until then, I will be writing code skeletons with notepad.)

kinchungwong commented 6 years ago

@vpisarev @alalek

The code skeleton have been moved to https://github.com/kinchungwong/imgproc_engine_vs2017/tree/master/source

Thee code skeleton is provided for discussion. The coding style and implementation details currently don't conform to OpenCV standard, but these will be resolved after the basic requirements and design have been confirmed.

As mentioned in the README.md above, I would like to know roughly what image processing operations are intended to be within scope of OE-17.

(Edited 2018-03-21)

The documentation for my approach has been moved to https://github.com/kinchungwong/imgproc_engine_design/blob/master/README.md

This is so that the understanding and discussion can take place, without being distracted by issues surrounding the source code.

kinchungwong commented 6 years ago

See #11138 cv::resize with INTER_AREA for 5 or more channels.

kinchungwong commented 6 years ago

Regarding SIMD coding style, the following project and sample usage shows how to use C++ templates and type system to define a general 3x3 convolution for uint8. It is arguably more readable than SIMD intrinsics. The following highlighted snippet generates 16 output pixels, when provided with an input window of 3-row-by-48-column pixels, all of which either in SIMD register or in aligned stack space. The outer loop will then advance 16 pixels at a time to generate an entire output row. Memory I/O is minimized. I have checked the disassembly with the MSVC compiler, and I will check with GCC soon.

https://github.com/kinchungwong/simd_style_helper/blob/2496282e6dafc430af59757937a6825d650e841a/simd_style_helper_experimental.h#L171-L197

kinchungwong commented 6 years ago

I have decided to abandon the coding style mentioned in my previous comments.

My current recommended coding style for the SIMD moving window kernel is here:

https://gist.github.com/kinchungwong/94a090ace1bdd67c02cf932c4a4b8a2a Updated https://gist.github.com/kinchungwong/ff2392d9454c52253cbfc85b8075bb22 (2018-12-08)

The main idea is that it is not necessary to define superfluous union-structs for holding arrays of vectors. Instead, just use C++11 std::array

// nv = number of vectors (SIMD registers) in this array
template <typename vec_type, int nv>
using v_arr = std::array<vec_type, nv>;

// nvr = number of vectors (SIMD regs) in each row of this 2d array
// nvc = number of vectors (SIMD regs) in each column of this 2d array
template <typename vec_type, int nvr, int nvc> 
using v_arr2d = std::array<std::array<vec_type, nvc>, nvr>;

Functions that extract or insert SIMD worth of data from/to a larger SIMD array will accept std::array as argument.

template <typename vec_type, int nvr, int nvc>
struct v_arr_center_info
{
    enum
    {
        center_row = ((nvr - 1) / 2),
        lanes_per_vec = vec_type::nlanes,
        lanes_per_row = (lanes_per_vec * nvc),
        center_col = ((lanes_per_row - lanes_per_vec) / 2),
    };
};

template <int row_offset, int col_offset, typename vec_type, int nvr, int nvc>
inline _Tpvec v_extract_around_center(const v_arr2d<vec_type, nvr, nvc>& input)
{
    using center_info = v_arr_center_info<vec_type, nvr, nvc>;
    return v_extract<
        (row_offset + center_info::center_row), 
        (col_offset + center_info::center_col), 
        vec_type, nvr, nvc>(input);
}

Further suggestions to make the user's code (i.e. code that implement CPU-SIMD kernels for image processing algorithms) more elegant are welcome.

FantasqueX commented 2 months ago

@kinchungwong Hi, I'm interested in your proposal. However, I cannot access your code currently. Is it possible to make those code public again :)

kinchungwong commented 2 months ago

@kinchungwong Hi, I'm interested in your proposal. However, I cannot access your code currently. Is it possible to make those code public again :)

Sorry, thanks. I've made it public again. The notes hasn't been updated for 6 years and were quite out of date. Be sure to check with OpenCV maintainers first, because I suspect they have already made a lot of progress that were not reflected in my notes.