GpuMat as input/output to cv::dnn::Net

YashasSamaga commented 4 years ago

System information (version)

OpenCV => 4.2.0

APIs that need changes

Suitable APIs for GpuMat input:

void setInput(InputArray blob, const String& name = "", double scalefactor = 1.0, const Scalar& mean = Scalar());

Suitable APIs for GpuMat output:

void forward(OutputArrayOfArrays outputBlobs, const String& outputName = String());
void forward(OutputArrayOfArrays outputBlobs, const std::vector<String>& outBlobNames);

Modifications to support GpuMat:

void blobFromImage(InputArray image, OutputArray blob, double scalefactor=1.0, const Size& size = Size(), const Scalar& mean = Scalar(), bool swapRB=false, bool crop=false, int ddepth=CV_32F);

Might also require a new API that would allow a stream to be specified (required if blobFromImage needs to be performed asynchronously).

The CUDA backend uses fully packed tensors everywhere and expects a fully packed tensor as input. GpuMat given as input might be using a 2D memory layout which may not be contiguous. blobFromImage must return a contiguous copy of the input.

The outputs of the CUDA backend are also in fully packed tensors. Hence, for further optimal use of the output (in the case it's an image) would require a transformation to 2D memory layout.

Problems:

GpuMat cannot store an arbitrary number of dimensions. How can the batch of images be passed?

alalek commented 4 years ago

@YashasSamaga Thank you for collecting overview of this problem! Good job!

GpuMat cannot store an arbitrary number of dimensions

Right, current GpuMat implementation supports 2D layout only. Problem is that existed CUDA API code expects 2D layout only - need to avoid massive refactoring of code (there are tons of user code which we are not able to fix anyway - compatibility problem).

new API that would allow a stream to be specified

Consider using of dedicated cuda (cv:dnn::cuda) namespace. Also we can introduce thread-local "default" stream API.

czgdp1807 commented 4 years ago

@alalek @YashasSamaga Is this issue still open for a PR? I would like to work on it.

GpuMat cannot store an arbitrary number of dimensions. How can the batch of images be passed?

Though I am very new to this side of OpenCV. However, may be we can use row major order at the time of indexing containers with arbitrary number of dimensions and in the memory we can store these containers as linear arrays. Please let me know if I have misinterpreted something wrong here.

YashasSamaga commented 4 years ago

@czgdp1807 and how do you plan to store the shape? The DNN module would require a 4D+ tensor and you need to pass this shape information somehow through GpuMat.

czgdp1807 commented 4 years ago

Thanks for the response. IMO, we can store the shape as another linear array. For example, GpuMat can have two data pointers, data_type* shape, and another one through data_type* data_gpu. We can use static methods to initialise both the shape and dimension pointers. This approach can generalise the number of dimensions in the data. Please let me know your views on this. P.S. - I will do more research on the current architecture of GpuMat and come up with something that can be integrated more easily. I will analyse how the API suggestions in the description of this issue can used to support arbitrary number of dimensions.

YashasSamaga commented 4 years ago

I don't have any strong opinions on how GpuMat should be designed or improved (I do not know enough about GpuMat to discuss it anyway).

The fundamental problem with GpuMat currently is that you cannot pass arbitrary dimension tensors (even if the actual memory layout is linear, 2D or 3D). Once this feature is implemented, it's quite easy to support GpuMat as input and output.

czgdp1807 commented 4 years ago

Thanks for letting me know. I will look into the current code and API and will try to come up with a new API design for the current and the new use case of arbitrary number of dimensions.

czgdp1807 commented 4 years ago

Hi, I was just going through the code and observed that _InputArray supports 2D containers only. Is there any container already implemented that supports multidimensional arrays, something like Tensor in tensorflow? If not, then may be we can refactor the code to support arbitrary dimensions in basic containers(like _InputArray, _OutputArray) without changing the interface to avoid repeated code. What do you say?

alalek commented 4 years ago

_InputArray supports 2D containers only

This is not correct. See .sizend() method. Main problem is GpuMat - it is 2D only.

_InputArray is not a container (in terms of storing buffers). It is "unified" wrapper which stores type/kind and a pointer on real container object. Also it doesn't even hold container (doesn't extend lifetime through reference counters), so it should not be stored anywhere (except local stack).

refactor the code

Please don't try to start changing of such "core" interfaces. It is bad idea in general (we don't want to break users code).

czgdp1807 commented 4 years ago

Sure, so probably we need changes only in GpuMat. Or Is supporting arbitrary dimension in GpuMat feasible in the current state. May be, we can just work on the APIs suggested in the OP.

YashasSamaga commented 4 years ago

@czgdp1807

Excerpt from alalek's comment https://github.com/opencv/opencv/issues/16433#issuecomment-578436958

Right, current GpuMat implementation supports 2D layout only. Problem is that existed CUDA API code expects 2D layout only - need to avoid massive refactoring of code (there are tons of user code which we are not able to fix anyway - compatibility problem).

The CUDA backend of the DNN module operates on fully packed tensors (i.e. 1D memory irrespective of the logical shape of the tensor).

Before you can begin working on the APIs suggested in the OP, GpuMat needs to support arbitrary shape with 1D memory layout. The issue currently at hand is that the existing codebase (non-DNN stuff like cuda::resize) assumes that GpuMat data is stored in 2D pitched memory. Hence, when arbitrary dimension support on 1D memory is added to GpuMat, it would require updating the existing codebase to deal with new GpuMat need not necessarily store data in 2D pitched memory. Otherwise, you'll have algorithms operating on 1D memory thinking it's 2D memory which could cause unexpected problems. One way to go about would be to add checks to enforce the 2D layout requirement wherever necessary.

Adding arbitrary dimension support isn't very hard but updating the entire codebase is quite a lot of work.

czgdp1807 commented 4 years ago

Adding arbitrary dimension support isn't very hard but updating the entire codebase is quite a lot of work.

Agreed. That's why I asked for feasibility and infact it's now clear to me that the APIs suggested in OP need arbitrary dimension support so first step would be to do that.

One way to go about would be to add checks to enforce the 2D layout requirement wherever necessary.

This is reminding me of a scenario where if-else checks were added in every method of the class to separate handling of two use cases. This suggests that we can create a new class for arbitrary dimension support and use it wherever required. Another alternative is for whatever new methods are added from OP should use the arbitrary dimension support and the already existing are using the 2D layout. We can make changes to the existing method where ever required to use the arbitrary dimension 1D layout. I am assuming that algorithms right now using GpuMat in OpenCV are using 2D layout and we will add new code(not added yet) for the algorithms which will use the arbitrary dimension and in that case what to use from GpuMat.

czgdp1807 commented 4 years ago

Hi, I was thinking about this issue. I have the following two API designs in mind for multidimensional support in GpuMat.

Overloading functions We can overload functions of GpuMat which will support multiple dimensions. For example, currently one of the constructors of GpuMat is,

GpuMat(int rows, int cols, int type, Allocator* allocator = defaultAllocator());

Here we have to give, rows, cols and type. We can replace rows and cols with something like, MatShape(I came to know about this while working on a recent PR) or a data pointer, like, int* shape.

GpuMat(MatShape shape, type, Allocator* allocator = defaultAllocator());

or,

GpuMat(int* shape, type, Allocator* allocator = defaultAllocator());

Creating a separate class To keep the current implementation of GpuMat untouched we can create a new class, Tensor which will suit the purpose of arbitrary dimensions. We can keep it something like torch.tensor.

Tensor(int* shape, int type, Allocator* allocator = defaultAllocator());

May be tensors can go to opencv_contrib till they become more mature. Please let me know your views. Thanks.

Aberzelius commented 4 years ago

Greetings people. I'm looking for a tesseract ocr coding about find duplicate numbers in an image and marking them. For example I have numbers listed from 1 to 99 in an image and i dont want any number to repeat more than once, but in my paperworks there are some repetitions (like two "65" randomly in the paper like 5 23 65 86 65 12 33 21 etc.) and i want to discard those repeated paperworks. In real life it is a huge struggle to read them all can you guide me sir? I want to make an app for mobile phone to use that with camera is it possible? Or is there an existing program like this scans and finds dublicate numbers/texts? Please God help me I'm so much in a mess...

alalek commented 4 years ago

Perhaps separate new class GpuMatND ('ND' states for n-dimensional) should work great.

Tensor

is a more common name (may conflict with other existed containers in OpenCV or should be designed perfectly to unify all of them)

czgdp1807 commented 4 years ago

I would like to make a PR with the interface of GpuMatND and will continue with adding implementation for the same. Please let me know if it's feasible and where should I make the PR, to opencv or opencv_contrib. Thanks.

pwuertz commented 4 years ago

Wouldn't it be easier to just reuse the existing ndarray/tensor type cv::dnn::cuda4dnn::csl::Tensor and support copies from/to GpuMat?

blob_in = cv.cuda_GpuBlobFromImage(gpu_mat, stream)
blob_in = cv.cuda_GpuBlobFromArray([gpu_mat1, gpu_mat2, gpu_mat3], stream=stream)

net.setInput(blob)
blob_out = net.forward(stream=stream)

gpu_mat_out = cv.cuda_GpuMatFromBlob(blob_out)

The blob type itself could be a completely opaque handle with no public interface to commit to, akin to the existing OpenCV concept of creating a "blob" as input argument from existing Mat types.

YashasSamaga commented 4 years ago

@pwuertz cv::dnn::cuda4dnn::csl::Tensor belongs to the CUDA backend's internal API and is not exposed to the end-user. Exposing Tensor through public API will require exposing several other classes from the internal API which doesn't seem like a great idea.

We can maybe use a new lightweight proxy which takes the role of Tensor from blobFromImage to net.forward and vice-versa. But this will complicate the currently straightforward simple-to-use DNN API (which is I think is one of the biggest pros of OpenCV DNN).

pwuertz commented 4 years ago

@YashasSamaga Thats what I meant by "opaque blob type", a lightweight, i.e. no or minimal public interface handle, internally backed by your Tensor.

Why do you think this will complicate the existing DNN API? The example above looks like a normal OpenCV DNN pattern to me.

Besides, doesn't this project require an alternative API anyway due to the CUDA specific stream argument?

czgdp1807 commented 4 years ago

@pwuertz Are you trying to say that we can create, GpuMatND as a wrapper for accessing the functionality of Tensor?

pwuertz commented 4 years ago

I was proposing that initially, we don't need to design / define any new type at all. For this specific task, an opaque blob handle you can convert from/to Mat/GpuMat would be sufficient. Internally this could be backed by Tensor.

Generally, it would be really nice to have a GpuMatND, and OpenCV shifting more towards ndarray containers instead of 2d focused data objects. But as mentioned by alalek, maybe this is a topic best tackled in conjunction with a generalized OpenCV tensor / ndarray API?

MonocleSecurity commented 4 years ago

Is there any update on the progress of this?

czgdp1807 commented 4 years ago

@MonocleSecurity I am working on it in #16666 I am just waiting for some comments to get a direction in which I should proceed.

YashasSamaga commented 4 years ago

I think we should use a lightweight opaque proxy object instead of GpuMatND. The GpuMat (even cv::Mat) interfaces are quite old. Maybe in the near feature, OpenCV will get new C++11 based interface for all these containers. Maybe it's not the right time to add a new container. This is just my opinion.

The proxy object is quite simple to make. It's an RAII object which essentially contains a smart pointer and a shape array. That's it.

czgdp1807 commented 4 years ago

We can do it. To the best of my understanding, the proxy object will be a wrapper to GpuMat which uses 2D layout internally, so supporting multi-dimensional arrays will again be problem. May be we can make a proxy to Tensor(with limited access) as suggested in https://github.com/opencv/opencv/issues/16433#issuecomment-591408101

YashasSamaga commented 4 years ago

@czgdp1807 You just need one single block of linear memory.

class Proxy {
public:

private:
GpuMat data; // 1D memory
std::vector<int> shape; // store shape here
};

czgdp1807 commented 4 years ago

I got your point. Thanks for the hint, GpuMat can be used for 1D memory by keeping either row or col 1. Will update my already open PR with Proxy object.

pwuertz commented 4 years ago

@czgdp1807 I assume it must be a single row (i.e. not a single column) though. GpuMat does impose certain row-stride constraints like requiring multiple of 512 or something to that effect, right?

YashasSamaga commented 4 years ago

@czgdp1807 are you still working on this?

ChrisWardy commented 3 years ago

I am keen to see this feature added, I posted a bounty but I am unsure if I am allowed to link to it.

ChrisWardy commented 3 years ago

Here is a link to the bounty if anyone is interested: https://www.bountysource.com/issues/87547327-gpumat-as-input-output-to-cv-dnn-net

YashasSamaga commented 3 years ago

It's best if we could come up with a plan before anyone begins to write the code. There would be changes in the public API. It will have to be approved by the maintainers. This is my proposal.

A new header #include <opencv2/dnn/cuda.hpp> will be added which won't be included by default by dnn.hpp. All the CUDA specific API will be added to its own namespace in cv::dnn::cuda.

proxy object for `cuda4dnn::csl::Tensor`:

// public API
namespace cv { namespace dnn { namespace cuda {

    class GpuBlob {
    public:
        GpuBlob();
        ~GpuBlob();

    private:
        friend class GpuBlobAccessor;

        struct C4DTensorProxy;
        std::unique_ptr<C4DTensorProxy> proxy;
    };

}}};

// internal implementation
template <class T>
using Tensor = cuda4dnn::csl::Tensor<T>;

struct GpuBlob::C4DTensorProxy {
    std::variant<Tensor<float>, Tensor<half>> tensor; // or some other type-safe union
};

blobFromImage(s) and ImagesFromBlob

void blobFromImage(const cv::cuda::GpuMat& image, cv::dnn::cuda::GpuBlob& blob, double scalefactor=1.0,
                                  const Size& size = Size(), const Scalar& mean = Scalar(),
                                  bool swapRB=false, bool crop=false, int ddepth=CV_32F);

void blobFromImage(const cv::cuda::GpuMat& image, cv::dnn::cuda::GpuBlob& blob, double scalefactor=1.0,
                                  const Size& size = Size(), const Scalar& mean = Scalar(),
                                  bool swapRB=false, bool crop=false, int ddepth=CV_32F);

void imagesFromBlob(const cv::dnn::cuda GpuBlob& blob, OutputArrayOfArrays images);

We need to provide a way to specify the stream. I think the best way would be to duplicate the three APIs and add a stream argument. This seems to be more user-friendly than having a default argument.

void blobFromImage(cv::cuda::Stream& stream, const cv::cuda::GpuMat& image, cv::dnn::cuda::GpuBlob& blob, double scalefactor=1.0,
                                  const Size& size = Size(), const Scalar& mean = Scalar(),
                                  bool swapRB=false, bool crop=false, int ddepth=CV_32F);

void blobFromImage(cv::cuda::Stream& stream,const cv::cuda::GpuMat& image, cv::dnn::cuda::GpuBlob& blob, double scalefactor=1.0,
                                  const Size& size = Size(), const Scalar& mean = Scalar(),
                                  bool swapRB=false, bool crop=false, int ddepth=CV_32F);

void imagesFromBlob(cv::cuda::Stream& stream, const cv::dnn::cuda GpuBlob& blob, OutputArrayOfArrays images);

The declarations will be present in opencv2/dnn/cuda.hpp. The implementation will be in dnn.cpp which is a wrapper around the actual implementation. The actual implementation should be inside cuda4dnn directory. The dnn.cpp implementation will invoke the actual implementation depending on whether the DNN module was built with the CUDA backend. Something like:

// dnn.cpp

#ifdef HAVE_CUDA
#include "cuda4dnn/blobFromImages.hpp"
#endif

void blobFromImages(InputArrayOfArrays images_, cuda::GpuBlob& blob, ...)
{
    CV_TRACE_FUNCTION();

#ifdef HAVE_CUDA
    std::vector<cv::cuda::GpuMat> images;
    images_.getGpuMatVector(images);
    cuda4dnn::blobFromImages(images, blob, ...);
#else
    CV_Error(Error::GpuNotSupported, "DNN module was not built with CUDA backend");
#endif
}

We could use cudaimgproc module for resizing images but I think we should roll our own kernel for the DNN module to avoid a new dependency (maybe we will get rid of the opencv_contrib dependency altogether in the future; see #17488 ). There are resize kernels in the CUDA backend which are used by ResizeLayer. We might be able to use it as it is or maybe as a template for a new resize kernel.

We could also require new kernels that can perform scaling, mean subtraction and channel swap. It would be best to fuse all these together into one kernel. The operation is memory bound and hence we will probably have enough compute resources to accommodate all the calculations (even if redundant).

Currently, with the FP16 target, the first operation is to convert the input (which is always transferred to the GPU in FP32 format) to FP16 on the GPU. We can fuse this operation with the previous step. The first step during inference would then reduce to copying the tensor in GpuBlob to the input blob. We could directly use the GpuBlob tensor as input as well.

Memory allocations are costly in CUDA. We should avoid any allocation at all costs. We should allocate the memory just once if GpuBlob is empty. The end-user can reuse the same blob. We should also avoid any synchronization in the stream-based APIs.

`setInput`

void cv::dnn::Net::setInput(const cuda::GpuBlob& blob, const String& name = "");

`forward`

Mat forward(const String& outputName = String()); // already exists; should work with GPU inputs too and return output in cv::Mat
void forward(OutputArrayOfArrays outputBlobs, const String& outputName = String()); // same as above

void forward(GpuBlob& blob, const String& outputName = String());
void forward(std::vector<GpuBlob>& outputBlobs, const String& outputName = String()); // same as above
void forward(cv::cuda::Stream& stream, GpuBlob& blob, const String& outputName = String());
void forward(cv::cuda::Stream& stream, std::vector<GpuBlob>& outputBlobs, const String& outputName = String()); // same as above

The stream-based API should not cause any synchronization. This will allow users to seamlessly incorporate inference inside their CUDA pipeline.

Example

cv::cuda::Stream stream;
cv::cuda::GpuMat image_d, output_d;

auto net = cv::dnn::readNet("data/yolov4/yolov4.cfg", "data/yolov4/yolov4.weights");
net.setPreferableBackend(cv::dnn::DNN_BACKEND_CUDA);
net.setPreferableTarget(cv::dnn::DNN_TARGET_CUDA);

cv::dnn::cuda::GpuBlob blob_input;
cv::dnn::cuda::GpuBlob blob_output;

{
    // CUDA stuff

    cv::dnn::cuda::blobFromImage(stream, image_d, blob_input);
    net.setInput(blob_input);
    net.forward(stream, blob_output);
    cv::dnn::cuda::imagesFromBlob(stream, blob_output, output_d);

   // CUDA stuff

   // fully asynchronous block will allow DNN inference to be placed nicely into the end-user's stream
}

YashasSamaga commented 3 years ago

There is still a problem with the above proposal. In the imagesFromBlob step, we are still limited to GpuMat's capabilities. If we compact detection outputs to a 1D block, you would still have trouble storing the shape.

One solution would be to have member functions that return the shape in GpuBlob but all this will probably make the end-user's code look nasty. We could add some of cv::Mat's shape API to GpuBlob. It will still look awful though.

ChrisWardy commented 3 years ago

I am very interested in seeing this feature being added, so I have added another $300 bounty for this feature for a total of $400.

https://www.bountysource.com/issues/87547327-gpumat-as-input-output-to-cv-dnn-net

ChrisWardy commented 3 years ago

The bounty is now 800USD!

https://www.bountysource.com/issues/87547327-gpumat-as-input-output-to-cv-dnn-net

collinalexbell commented 3 years ago

I'll pick this issue up full time over the Thanksgiving holiday.

ChrisWardy commented 3 years ago

Hi @kuberlog any progress?

YashasSamaga commented 3 years ago

I think it would be nice to have a more general GpuMat instead of patchwork in the DNN module. The more general GpuMat has many more usecases (postprocessing or NMS on inference results on GPU for example).

nglee commented 3 years ago

I'm also interested in implementing GpuMatND.

Is this patch for the master branch only? If so, maybe we can use shared_mat for reference counting.
Does this GpuMatND need to support padding on every axis? Or is it sufficient to provide padding only on the first axis? ~~The current implementation for Mat seems to support padding only on the first axis, which seems enough to provide memory alignment.~~ => (update) This is false. Mat supports padding on every axis.
Should we allow GpuMatND with more than two axes to convert to GpuMat? Maybe we can convert to vector<GpuMat> instead, in the case of a 3D GpuMatND?

ghfue commented 1 year ago

Since there is now an implementation of GpuMatND, does it make it possible to resolve this issue?

MineYuanlu commented 1 year ago

I want to know, it's already 4.7.0. Has this problem been resolved?

nglee commented 1 year ago

I'll continue to look into this issue.

RezaHajimohammadi commented 8 months ago

Not yet. cv::dnn::blobFromImage does not accept cv::cuda::GpuMatND.

opencv / opencv