Open vpisarev opened 6 years ago
While tile-based and strip-based execution engine is a good addition to OpenCV, they should be seen as complements (in addition) to row-based execution engine. Based on my experience, I take objection to some of the wording and tone, and also to their underlying premise.
I do agree that FilterEngine
needs a redesign.
I do not intend to make claims without showing any code to back it up. However, I need to secure my legal release from my former employer, which may take time.
Here is my claim:
[ [a, 0], [0, b] ]
(for any a, b, including but not necessarily 1.0) (in other words, "non-rotating", or "horizontal and vertical axis preserving". My math terminology may be wrong, sorry about that), will benefit from a row-based strategy.remap
benefit from tile-based execution if and only if the remap
is a diffeomorphism. Same idea as cache locality. Otherwise, any non-flat memory layout will harm its performance in these situations. In simple words, tile-based is beneficial to a few algorithms, but we should continue to use row-based execution for all other algorithms.
Tile-based execution may be necessary to inter-operate with GPU, but GPU usage already implies memory-copying. GPU may also impose maximum texture size.
Here is a teaser: imagine having this class, which is a "pool" of "row buffers (aligned)", and imagine algorithms which can be written to exploit such usefulness:
// for simplicity, here we ignore strict-OOP, visibility, etc.
// We also skip class template, which is trivial to add later.
// Reference counting for each row (either shared_ptr or on top of that) can also be added later.
// "RowBufferU8" and aligned_vector is only for illustration.
// A correct implementation would need unique_ptr for a row, along with aligned allocator/deallocator.
// Use of int, size_t not taken cared of carefully here. A real implementation will try to get those correct.
struct RowBufferPool
{
using RowBufferU8 = std::aligned_vector<uchar, 16uL>;
size_t elementCount;
std::unordered_map<int, RowBufferU8> m_active;
std::vector<RowBufferU8> m_unused;
public:
RowBufferU8& add_or_get(int key);
RowBufferU8* get(int key) const;
void remove(int key);
private:
void ensure_spare(ptrdiff_t count);
void add_spare();
};
In my experience, this can greatly simplify row-based image processing as well as greatly reduce unnecessary data copying (memory transfer). SIMD code can also be greatly simplified.
If an algorithm involves change of horizontal number of pixels, multiple instances of row buffer pools can be used; one pool for each pixel width (depending on the algorithm). For example, if an algorithm involves doubling of pixel width, then one pool can have width W, another pool can have width 2*W.
If an algorithm involves mixed numerical types, multiple instances of row buffer pools can be used, one for each numeric type.
Please read this write-up carefully.
https://gist.github.com/kinchungwong/152f0423698312d8e40870280d6ade7a
The reason I say "(someone's) choice", is that the future may benefit from having more than one Filter Engines co-existing. At the minimum, there needs to be two or three: Row-based, Tile-based. The tile-based Filter Engine is further divided into regular tiles (every tile has same width and height) and irregular tiles (arbitrary output rectangle is chosen; the input rectangle is computed by inverting the coordinate mapping function and finding the bounding rectangle) The person advocating a particular design is tasked with explaining the reason.
There is no reason to rush, as the best design may not emerge immediately.
(Disclaimer: despite this, I might rush myself to do this, but I will not push for a decision to be made, as that would impair the collective decision making process.)
One reason not to rush, is that the best design may come from outside - researchers or C++ experts worldwide. We need some outreach to solicit ideas.
@vpisarev @alalek
I would like to know roughly what image processing operations are intended to be within scope of OE-17.
Code skeletons:
(Right now, I am posting code skeletons which don't compile. This is because I am still waiting for a computer which I can start software development. Until then, I will be writing code skeletons with notepad.)
@vpisarev @alalek
The code skeleton have been moved to https://github.com/kinchungwong/imgproc_engine_vs2017/tree/master/source
Thee code skeleton is provided for discussion. The coding style and implementation details currently don't conform to OpenCV standard, but these will be resolved after the basic requirements and design have been confirmed.
As mentioned in the README.md above, I would like to know roughly what image processing operations are intended to be within scope of OE-17.
(Edited 2018-03-21)
The documentation for my approach has been moved to https://github.com/kinchungwong/imgproc_engine_design/blob/master/README.md
This is so that the understanding and discussion can take place, without being distracted by issues surrounding the source code.
See #11138 cv::resize
with INTER_AREA
for 5 or more channels.
Regarding SIMD coding style, the following project and sample usage shows how to use C++ templates and type system to define a general 3x3 convolution for uint8. It is arguably more readable than SIMD intrinsics. The following highlighted snippet generates 16 output pixels, when provided with an input window of 3-row-by-48-column pixels, all of which either in SIMD register or in aligned stack space. The outer loop will then advance 16 pixels at a time to generate an entire output row. Memory I/O is minimized. I have checked the disassembly with the MSVC compiler, and I will check with GCC soon.
I have decided to abandon the coding style mentioned in my previous comments.
My current recommended coding style for the SIMD moving window kernel is here:
https://gist.github.com/kinchungwong/94a090ace1bdd67c02cf932c4a4b8a2a
Updated https://gist.github.com/kinchungwong/ff2392d9454c52253cbfc85b8075bb22 (2018-12-08)
The main idea is that it is not necessary to define superfluous union-structs for holding arrays of vectors. Instead, just use C++11 std::array
// nv = number of vectors (SIMD registers) in this array
template <typename vec_type, int nv>
using v_arr = std::array<vec_type, nv>;
// nvr = number of vectors (SIMD regs) in each row of this 2d array
// nvc = number of vectors (SIMD regs) in each column of this 2d array
template <typename vec_type, int nvr, int nvc>
using v_arr2d = std::array<std::array<vec_type, nvc>, nvr>;
Functions that extract or insert SIMD worth of data from/to a larger SIMD array will accept std::array
as argument.
template <typename vec_type, int nvr, int nvc>
struct v_arr_center_info
{
enum
{
center_row = ((nvr - 1) / 2),
lanes_per_vec = vec_type::nlanes,
lanes_per_row = (lanes_per_vec * nvc),
center_col = ((lanes_per_row - lanes_per_vec) / 2),
};
};
template <int row_offset, int col_offset, typename vec_type, int nvr, int nvc>
inline _Tpvec v_extract_around_center(const v_arr2d<vec_type, nvr, nvc>& input)
{
using center_info = v_arr_center_info<vec_type, nvr, nvc>;
return v_extract<
(row_offset + center_info::center_row),
(col_offset + center_info::center_col),
vec_type, nvr, nvc>(input);
}
Further suggestions to make the user's code (i.e. code that implement CPU-SIMD kernels for image processing algorithms) more elegant are welcome.
@kinchungwong Hi, I'm interested in your proposal. However, I cannot access your code currently. Is it possible to make those code public again :)
@kinchungwong Hi, I'm interested in your proposal. However, I cannot access your code currently. Is it possible to make those code public again :)
Sorry, thanks. I've made it public again. The notes hasn't been updated for 6 years and were quite out of date. Be sure to check with OpenCV maintainers first, because I suspect they have already made a lot of progress that were not reflected in my notes.
the feature request about evolution proposal OE-17