Can you show the code for "In this other case, we are updating a temporal Tensor of 15 images ..."

chenscottus commented 10 months ago

Hello,

  Can you show the code for:
       In this other case, we are updating a temporal Tensor of 15 images, with a new image that needs to be resized and normalized, and other 14 images that where normalized in previous iterations, that need to be split to planar mode and copied in diferent positions of the temporal Tensor. Some CUDA threads will be doing the normalization, and some others will be just copying the old images, all in parallel.

 Thanks!

-Scott

morousg commented 10 months ago

Hi!

The code necessary to make that example is all published. I can't show the exact line of code that combines the specific operations, because it's closed source, but you can find very similar examples in the file "test_circularbatchread_x_write3D.cu", and you can very easily write the same example yourself by using this library.

The difference is in the cvGS::CircularTensor::update method, where you can use a different set of operations to be applied to the new image.

Sorry for the lack of comments and documentation. I'm doing this 90% in my spare time.

chenscottus commented 10 months ago

How about you modify thte example you post in the home page? https://github.com/morousg/cvGPUSpeedup/blob/2e9bfb1410c7dd8fb1bc4de279466637881b8843/cvGPUSpeedupVersion.png

Thanks!

chenscottus commented 10 months ago

cvGPUSpeedupVersion

morousg commented 10 months ago

If you tell me what you need, I can try to write an example that does what you need using CircularTensor.

chenscottus commented 10 months ago

For example, we need to use the latest/last 32 frames (frame 1 to frame 32) as the input. If the new frame - frame 33, how can I reuse what (frame 2 to frame 32) we have proccessed in (CPU/GPU)

In GPU, it requires continued memory, it looks like we have do a lot of memory copying, can we avoid that?

Thanks!

-Scott

chenscottus commented 10 months ago

In GPU, it requires continued memory, I am not sure CircularTensor will work. Thanks, -Scott

morousg commented 10 months ago

I'm guessing many things here, I will list them and correct me if I'm wrong:

1 You are using OpenCV, therefore you want to use the cvGS interface 2 Your input is uchar4 3 Your output is float4 4 You need the last 32 images, including the new one for this iteration

To create the circular tensor:

cvGS::CircularTensor<CV_8UC4, CV_32FC4, 1, 32, fk::CircularTensorOrder::OldestFirst> myTensor(WIDTH, HEIGHT);

To add a new image:

myTensor.update(opencv_cuda_stream, /sequence of operations to apply to the new image, before saving it/);

To take the raw pointer and pass it to the network (or whatever you do after):

// Note that you should use the same CUDA stream for the inference kernels float4* myRawContiguousData = myTensor.data();

This will keep always the last 32 images, starting with the oldest image, and ending with the newest image, as you show in your diagrams.

All this will be performed by a single kernel, and yes, there will be some intermediate pointer inside the CircularTensor, but since it's all done by the same kernel, it's done in parallel, and it's fast.

morousg commented 10 months ago

Regarding the difference in the case of the GPU memory, I think you can use several 32 image CircularTensor objects, that will do the sorting of the images very fast, and then, just copy the entire buffer to the contiguous memory region.

I'm sure this will be faster than using OpenCV.

// Do this allocation somewhere, only once, and reuse the variable. float4 myContiguousBigMemory; cudaMalloc(&myContiguousBigMemory, (32 NUM_BUFFERS) * sizeof(float4));

std::array<cvGS::CircularTensor<...>, NUM_BUFFERS> circularTensors; // Initialize each tensor with the Alloc method, do it once and reuse for (int i=0; i<NUM_BUFFERS, i++) { circularTensors[i].Alloc(WIDTH, HEIGHT); }

// Now, on each iteration float4* currentPointer; for (int i=0; i<NUM_BUFFERS, i++) { // I'm more used to CUDA than OpenCV, you can use the OpenCV way, but use the opencv_cuda_tream circularTensors[i].update(opencv_cuda_stream, ...); // Destination pointer if (i==0) currentPointer = myContiguousBigMemory; else currentPointer = currentPointer + circularTensors[i-1].getNumElems()); const cudaStream_t cuda_stream = cv::cuda::StreamAccessor::getStream(opencv_cuda_stream); cudaMemcpyAsync(currentPointer, circularTensors[i].data(), circularTensors[i].sizeInBytes(), cudaMemcpyDeviceToDevice, cuda_stream); }

chenscottus commented 10 months ago

Good! I will try it later! Many thanks! -Scott

morousg commented 10 months ago

Hi!

Just found out that there was a bug with the code snippet that I shared.

I just fixed it in with the latest commit in the main branch.

Let me know if you have any issues.

Thanks!

morousg commented 10 months ago

If it works, please, let me know. If you are working for a company, and you are using cvGPUSpeedup, could you tell me which company, so that I can put it in the front page?

If it does not work, let me know too, maybe I can do something about it.

Thanks!

chenscottus commented 10 months ago

I am looking for action recognition solutions. And it looks like they require several frames as input. Probably cvGPUSpeedup will be useful. Once i find the solution, i will try cvGPUSpeedup.

Right now I am using OpenCV 4.8.1, CUDA 12.3 and TensorRT 8.6.1.6 for detections And I still need to find out how to use cvGPUSpeedup.

If you have more use cases and examples, please let me know.

Many thanks!

On Sat, Nov 11, 2023 at 1:50 PM morousg @.***> wrote:

If it works, please, let me know. If you are working for a company, and you are using cvGPUSpeedup, could you tell me which company, so that I can put it in the front page?

If it does not work, let me know too, maybe I can do something about it.

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/morousg/cvGPUSpeedup/issues/67#issuecomment-1806925665, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHZOK56IMHF6CDTI464VCLYD7XJFAVCNFSM6AAAAAA63TSFNCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBWHEZDKNRWGU . You are receiving this because you authored the thread.Message ID: @.***>

morousg commented 9 months ago

Closed for inactivity. Feel free to use the discussions section to ask questions, or open a new issue.

Thanks!

morousg / cvGPUSpeedup

Can you show the code for "In this other case, we are updating a temporal Tensor of 15 images ..." #67