Open YashasSamaga opened 4 years ago
@YashasSamaga Thank you for collecting overview of this problem! Good job!
GpuMat cannot store an arbitrary number of dimensions
Right, current GpuMat implementation supports 2D layout only. Problem is that existed CUDA API code expects 2D layout only - need to avoid massive refactoring of code (there are tons of user code which we are not able to fix anyway - compatibility problem).
new API that would allow a stream to be specified
Consider using of dedicated cuda
(cv:dnn::cuda
) namespace.
Also we can introduce thread-local "default" stream API.
@alalek @YashasSamaga Is this issue still open for a PR? I would like to work on it.
GpuMat
cannot store an arbitrary number of dimensions. How can the batch of images be passed?
Though I am very new to this side of OpenCV. However, may be we can use row major order at the time of indexing containers with arbitrary number of dimensions and in the memory we can store these containers as linear arrays. Please let me know if I have misinterpreted something wrong here.
@czgdp1807 and how do you plan to store the shape? The DNN module would require a 4D+ tensor and you need to pass this shape information somehow through GpuMat
.
Thanks for the response.
IMO, we can store the shape as another linear array. For example, GpuMat
can have two data pointers, data_type* shape
, and another one through data_type* data_gpu
. We can use static methods to initialise both the shape
and dimension
pointers. This approach can generalise the number of dimensions in the data. Please let me know your views on this.
P.S. - I will do more research on the current architecture of GpuMat
and come up with something that can be integrated more easily. I will analyse how the API suggestions in the description of this issue can used to support arbitrary number of dimensions.
I don't have any strong opinions on how GpuMat should be designed or improved (I do not know enough about GpuMat
to discuss it anyway).
The fundamental problem with GpuMat
currently is that you cannot pass arbitrary dimension tensors (even if the actual memory layout is linear, 2D or 3D). Once this feature is implemented, it's quite easy to support GpuMat
as input and output.
Thanks for letting me know. I will look into the current code and API and will try to come up with a new API design for the current and the new use case of arbitrary number of dimensions.
Hi,
I was just going through the code and observed that _InputArray
supports 2D containers only. Is there any container already implemented that supports multidimensional arrays, something like Tensor in tensorflow? If not, then may be we can refactor the code to support arbitrary dimensions in basic containers(like _InputArray
, _OutputArray
) without changing the interface to avoid repeated code.
What do you say?
_InputArray supports 2D containers only
This is not correct. See .sizend()
method.
Main problem is GpuMat
- it is 2D only.
_InputArray
is not a container (in terms of storing buffers). It is "unified" wrapper which stores type/kind and a pointer on real container object. Also it doesn't even hold container (doesn't extend lifetime through reference counters), so it should not be stored anywhere (except local stack).
refactor the code
Please don't try to start changing of such "core" interfaces. It is bad idea in general (we don't want to break users code).
Sure, so probably we need changes only in GpuMat
. Or Is supporting arbitrary dimension in GpuMat
feasible in the current state. May be, we can just work on the APIs suggested in the OP.
@czgdp1807
Excerpt from alalek's comment https://github.com/opencv/opencv/issues/16433#issuecomment-578436958
Right, current GpuMat implementation supports 2D layout only. Problem is that existed CUDA API code expects 2D layout only - need to avoid massive refactoring of code (there are tons of user code which we are not able to fix anyway - compatibility problem).
The CUDA backend of the DNN module operates on fully packed tensors (i.e. 1D memory irrespective of the logical shape of the tensor).
Before you can begin working on the APIs suggested in the OP, GpuMat
needs to support arbitrary shape with 1D memory layout. The issue currently at hand is that the existing codebase (non-DNN stuff like cuda::resize
) assumes that GpuMat
data is stored in 2D pitched memory. Hence, when arbitrary dimension support on 1D memory is added to GpuMat
, it would require updating the existing codebase to deal with new GpuMat
need not necessarily store data in 2D pitched memory. Otherwise, you'll have algorithms operating on 1D memory thinking it's 2D memory which could cause unexpected problems. One way to go about would be to add checks to enforce the 2D layout requirement wherever necessary.
Adding arbitrary dimension support isn't very hard but updating the entire codebase is quite a lot of work.
Adding arbitrary dimension support isn't very hard but updating the entire codebase is quite a lot of work.
Agreed. That's why I asked for feasibility and infact it's now clear to me that the APIs suggested in OP need arbitrary dimension support so first step would be to do that.
One way to go about would be to add checks to enforce the 2D layout requirement wherever necessary.
This is reminding me of a scenario where if-else
checks were added in every method of the class to separate handling of two use cases. This suggests that we can create a new class for arbitrary dimension support and use it wherever required. Another alternative is for whatever new methods are added from OP should use the arbitrary dimension support and the already existing are using the 2D layout. We can make changes to the existing method where ever required to use the arbitrary dimension 1D layout. I am assuming that algorithms right now using GpuMat in OpenCV are using 2D layout and we will add new code(not added yet) for the algorithms which will use the arbitrary dimension and in that case what to use from GpuMat
.
Hi,
I was thinking about this issue. I have the following two API designs in mind for multidimensional support in GpuMat
.
Overloading functions
We can overload functions of GpuMat
which will support multiple dimensions. For example,
currently one of the constructors of GpuMat
is,
GpuMat(int rows, int cols, int type, Allocator* allocator = defaultAllocator());
Here we have to give, rows
, cols
and type
. We can replace rows
and cols
with something like, MatShape
(I came to know about this while working on a recent PR) or a data pointer, like, int* shape
.
GpuMat(MatShape shape, type, Allocator* allocator = defaultAllocator());
or,
GpuMat(int* shape, type, Allocator* allocator = defaultAllocator());
Creating a separate class
To keep the current implementation of GpuMat
untouched we can create a new class, Tensor
which will suit the purpose of arbitrary dimensions. We can keep it something like torch.tensor
.
Tensor(int* shape, int type, Allocator* allocator = defaultAllocator());
May be tensors can go to opencv_contrib
till they become more mature.
Please let me know your views. Thanks.
Greetings people. I'm looking for a tesseract ocr coding about find duplicate numbers in an image and marking them. For example I have numbers listed from 1 to 99 in an image and i dont want any number to repeat more than once, but in my paperworks there are some repetitions (like two "65" randomly in the paper like 5 23 65 86 65 12 33 21 etc.) and i want to discard those repeated paperworks. In real life it is a huge struggle to read them all can you guide me sir? I want to make an app for mobile phone to use that with camera is it possible? Or is there an existing program like this scans and finds dublicate numbers/texts? Please God help me I'm so much in a mess...
Perhaps separate new class GpuMatND
('ND' states for n-dimensional) should work great.
Tensor
is a more common name (may conflict with other existed containers in OpenCV or should be designed perfectly to unify all of them)
I would like to make a PR with the interface of GpuMatND
and will continue with adding implementation for the same. Please let me know if it's feasible and where should I make the PR, to opencv
or opencv_contrib
. Thanks.
Wouldn't it be easier to just reuse the existing ndarray/tensor type cv::dnn::cuda4dnn::csl::Tensor
and support copies from/to GpuMat
?
blob_in = cv.cuda_GpuBlobFromImage(gpu_mat, stream)
blob_in = cv.cuda_GpuBlobFromArray([gpu_mat1, gpu_mat2, gpu_mat3], stream=stream)
net.setInput(blob)
blob_out = net.forward(stream=stream)
gpu_mat_out = cv.cuda_GpuMatFromBlob(blob_out)
The blob type itself could be a completely opaque handle with no public interface to commit to, akin to the existing OpenCV concept of creating a "blob" as input argument from existing Mat types.
@pwuertz cv::dnn::cuda4dnn::csl::Tensor
belongs to the CUDA backend's internal API and is not exposed to the end-user. Exposing Tensor
through public API will require exposing several other classes from the internal API which doesn't seem like a great idea.
We can maybe use a new lightweight proxy which takes the role of Tensor
from blobFromImage
to net.forward
and vice-versa. But this will complicate the currently straightforward simple-to-use DNN API (which is I think is one of the biggest pros of OpenCV DNN).
@YashasSamaga Thats what I meant by "opaque blob type", a lightweight, i.e. no or minimal public interface handle, internally backed by your Tensor
.
Why do you think this will complicate the existing DNN API? The example above looks like a normal OpenCV DNN pattern to me.
Besides, doesn't this project require an alternative API anyway due to the CUDA specific stream
argument?
@pwuertz Are you trying to say that we can create, GpuMatND
as a wrapper for accessing the functionality of Tensor
?
I was proposing that initially, we don't need to design / define any new type at all. For this specific task, an opaque blob handle you can convert from/to Mat
/GpuMat
would be sufficient. Internally this could be backed by Tensor
.
Generally, it would be really nice to have a GpuMatND
, and OpenCV shifting more towards ndarray containers instead of 2d focused data objects. But as mentioned by alalek, maybe this is a topic best tackled in conjunction with a generalized OpenCV tensor / ndarray API?
Is there any update on the progress of this?
@MonocleSecurity I am working on it in #16666 I am just waiting for some comments to get a direction in which I should proceed.
I think we should use a lightweight opaque proxy object instead of GpuMatND
. The GpuMat
(even cv::Mat
) interfaces are quite old. Maybe in the near feature, OpenCV will get new C++11 based interface for all these containers. Maybe it's not the right time to add a new container. This is just my opinion.
The proxy object is quite simple to make. It's an RAII object which essentially contains a smart pointer and a shape array. That's it.
We can do it. To the best of my understanding, the proxy object will be a wrapper to GpuMat
which uses 2D layout internally, so supporting multi-dimensional arrays will again be problem. May be we can make a proxy to Tensor
(with limited access) as suggested in https://github.com/opencv/opencv/issues/16433#issuecomment-591408101
@czgdp1807 You just need one single block of linear memory.
class Proxy {
public:
private:
GpuMat data; // 1D memory
std::vector<int> shape; // store shape here
};
I got your point. Thanks for the hint, GpuMat can be used for 1D memory by keeping either row or col 1. Will update my already open PR with Proxy object.
@czgdp1807 I assume it must be a single row (i.e. not a single column) though. GpuMat
does impose certain row-stride constraints like requiring multiple of 512 or something to that effect, right?
@czgdp1807 are you still working on this?
I am keen to see this feature added, I posted a bounty but I am unsure if I am allowed to link to it.
Here is a link to the bounty if anyone is interested: https://www.bountysource.com/issues/87547327-gpumat-as-input-output-to-cv-dnn-net
It's best if we could come up with a plan before anyone begins to write the code. There would be changes in the public API. It will have to be approved by the maintainers. This is my proposal.
A new header #include <opencv2/dnn/cuda.hpp>
will be added which won't be included by default by dnn.hpp
. All the CUDA specific API will be added to its own namespace in cv::dnn::cuda
.
cuda4dnn::csl::Tensor
:// public API
namespace cv { namespace dnn { namespace cuda {
class GpuBlob {
public:
GpuBlob();
~GpuBlob();
private:
friend class GpuBlobAccessor;
struct C4DTensorProxy;
std::unique_ptr<C4DTensorProxy> proxy;
};
}}};
// internal implementation
template <class T>
using Tensor = cuda4dnn::csl::Tensor<T>;
struct GpuBlob::C4DTensorProxy {
std::variant<Tensor<float>, Tensor<half>> tensor; // or some other type-safe union
};
void blobFromImage(const cv::cuda::GpuMat& image, cv::dnn::cuda::GpuBlob& blob, double scalefactor=1.0,
const Size& size = Size(), const Scalar& mean = Scalar(),
bool swapRB=false, bool crop=false, int ddepth=CV_32F);
void blobFromImage(const cv::cuda::GpuMat& image, cv::dnn::cuda::GpuBlob& blob, double scalefactor=1.0,
const Size& size = Size(), const Scalar& mean = Scalar(),
bool swapRB=false, bool crop=false, int ddepth=CV_32F);
void imagesFromBlob(const cv::dnn::cuda GpuBlob& blob, OutputArrayOfArrays images);
We need to provide a way to specify the stream. I think the best way would be to duplicate the three APIs and add a stream argument. This seems to be more user-friendly than having a default argument.
void blobFromImage(cv::cuda::Stream& stream, const cv::cuda::GpuMat& image, cv::dnn::cuda::GpuBlob& blob, double scalefactor=1.0,
const Size& size = Size(), const Scalar& mean = Scalar(),
bool swapRB=false, bool crop=false, int ddepth=CV_32F);
void blobFromImage(cv::cuda::Stream& stream,const cv::cuda::GpuMat& image, cv::dnn::cuda::GpuBlob& blob, double scalefactor=1.0,
const Size& size = Size(), const Scalar& mean = Scalar(),
bool swapRB=false, bool crop=false, int ddepth=CV_32F);
void imagesFromBlob(cv::cuda::Stream& stream, const cv::dnn::cuda GpuBlob& blob, OutputArrayOfArrays images);
The declarations will be present in opencv2/dnn/cuda.hpp
. The implementation will be in dnn.cpp
which is a wrapper around the actual implementation. The actual implementation should be inside cuda4dnn
directory. The dnn.cpp
implementation will invoke the actual implementation depending on whether the DNN module was built with the CUDA backend. Something like:
// dnn.cpp
#ifdef HAVE_CUDA
#include "cuda4dnn/blobFromImages.hpp"
#endif
void blobFromImages(InputArrayOfArrays images_, cuda::GpuBlob& blob, ...)
{
CV_TRACE_FUNCTION();
#ifdef HAVE_CUDA
std::vector<cv::cuda::GpuMat> images;
images_.getGpuMatVector(images);
cuda4dnn::blobFromImages(images, blob, ...);
#else
CV_Error(Error::GpuNotSupported, "DNN module was not built with CUDA backend");
#endif
}
We could use cudaimgproc module for resizing images but I think we should roll our own kernel for the DNN module to avoid a new dependency (maybe we will get rid of the opencv_contrib dependency altogether in the future; see #17488 ). There are resize kernels in the CUDA backend which are used by ResizeLayer
. We might be able to use it as it is or maybe as a template for a new resize kernel.
We could also require new kernels that can perform scaling, mean subtraction and channel swap. It would be best to fuse all these together into one kernel. The operation is memory bound and hence we will probably have enough compute resources to accommodate all the calculations (even if redundant).
Currently, with the FP16 target, the first operation is to convert the input (which is always transferred to the GPU in FP32 format) to FP16 on the GPU. We can fuse this operation with the previous step. The first step during inference would then reduce to copying the tensor in GpuBlob
to the input blob. We could directly use the GpuBlob tensor as input as well.
Memory allocations are costly in CUDA. We should avoid any allocation at all costs. We should allocate the memory just once if GpuBlob
is empty. The end-user can reuse the same blob. We should also avoid any synchronization in the stream-based APIs.
setInput
void cv::dnn::Net::setInput(const cuda::GpuBlob& blob, const String& name = "");
forward
Mat forward(const String& outputName = String()); // already exists; should work with GPU inputs too and return output in cv::Mat
void forward(OutputArrayOfArrays outputBlobs, const String& outputName = String()); // same as above
void forward(GpuBlob& blob, const String& outputName = String());
void forward(std::vector<GpuBlob>& outputBlobs, const String& outputName = String()); // same as above
void forward(cv::cuda::Stream& stream, GpuBlob& blob, const String& outputName = String());
void forward(cv::cuda::Stream& stream, std::vector<GpuBlob>& outputBlobs, const String& outputName = String()); // same as above
The stream-based API should not cause any synchronization. This will allow users to seamlessly incorporate inference inside their CUDA pipeline.
cv::cuda::Stream stream;
cv::cuda::GpuMat image_d, output_d;
auto net = cv::dnn::readNet("data/yolov4/yolov4.cfg", "data/yolov4/yolov4.weights");
net.setPreferableBackend(cv::dnn::DNN_BACKEND_CUDA);
net.setPreferableTarget(cv::dnn::DNN_TARGET_CUDA);
cv::dnn::cuda::GpuBlob blob_input;
cv::dnn::cuda::GpuBlob blob_output;
{
// CUDA stuff
cv::dnn::cuda::blobFromImage(stream, image_d, blob_input);
net.setInput(blob_input);
net.forward(stream, blob_output);
cv::dnn::cuda::imagesFromBlob(stream, blob_output, output_d);
// CUDA stuff
// fully asynchronous block will allow DNN inference to be placed nicely into the end-user's stream
}
There is still a problem with the above proposal. In the imagesFromBlob
step, we are still limited to GpuMat
's capabilities. If we compact detection outputs to a 1D block, you would still have trouble storing the shape.
One solution would be to have member functions that return the shape in GpuBlob
but all this will probably make the end-user's code look nasty. We could add some of cv::Mat
's shape API to GpuBlob
. It will still look awful though.
I am very interested in seeing this feature being added, so I have added another $300 bounty for this feature for a total of $400.
https://www.bountysource.com/issues/87547327-gpumat-as-input-output-to-cv-dnn-net
The bounty is now 800USD!
https://www.bountysource.com/issues/87547327-gpumat-as-input-output-to-cv-dnn-net
I'll pick this issue up full time over the Thanksgiving holiday.
Hi @kuberlog any progress?
I think it would be nice to have a more general GpuMat
instead of patchwork in the DNN module. The more general GpuMat
has many more usecases (postprocessing or NMS on inference results on GPU for example).
I'm also interested in implementing GpuMatND
.
Is this patch for the master branch only? If so, maybe we can use shared_mat
for reference counting.
Does this GpuMatND
need to support padding on every axis? Or is it sufficient to provide padding only on the first axis? The current implementation for => (update) This is false. Mat
seems to support padding only on the first axis, which seems enough to provide memory alignment.Mat
supports padding on every axis.
Should we allow GpuMatND
with more than two axes to convert to GpuMat
? Maybe we can convert to vector<GpuMat>
instead, in the case of a 3D GpuMatND
?
Since there is now an implementation of GpuMatND, does it make it possible to resolve this issue?
I want to know, it's already 4.7.0. Has this problem been resolved?
I'll continue to look into this issue.
Not yet. cv::dnn::blobFromImage does not accept cv::cuda::GpuMatND.
System information (version)
APIs that need changes
Suitable APIs for
GpuMat
input:void setInput(InputArray blob, const String& name = "", double scalefactor = 1.0, const Scalar& mean = Scalar());
Suitable APIs for
GpuMat
output:void forward(OutputArrayOfArrays outputBlobs, const String& outputName = String());
void forward(OutputArrayOfArrays outputBlobs, const std::vector<String>& outBlobNames);
Modifications to support
GpuMat
:void blobFromImage(InputArray image, OutputArray blob, double scalefactor=1.0, const Size& size = Size(), const Scalar& mean = Scalar(), bool swapRB=false, bool crop=false, int ddepth=CV_32F);
Might also require a new API that would allow a stream to be specified (required if
blobFromImage
needs to be performed asynchronously).The CUDA backend uses fully packed tensors everywhere and expects a fully packed tensor as input.
GpuMat
given as input might be using a 2D memory layout which may not be contiguous.blobFromImage
must return a contiguous copy of the input.The outputs of the CUDA backend are also in fully packed tensors. Hence, for further optimal use of the output (in the case it's an image) would require a transformation to 2D memory layout.
Problems:
GpuMat
cannot store an arbitrary number of dimensions. How can the batch of images be passed?