marcoslucianops / DeepStream-Yolo

NVIDIA DeepStream SDK 7.0 / 6.4 / 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 / 5.1 implementation for YOLO models
MIT License
1.39k stars 344 forks source link

why GPU bbox parser is slightly slower than CPU bbox parser on V100 GPU tests #438

Open ccqedq opened 10 months ago

marcoslucianops commented 10 months ago

Probably because the data needs to be copied from the CPU to the GPU and then from the GPU to the CPU again. It's not possible to get the data directly from the GPU in the current DeepStream version.

ccqedq commented 10 months ago

from the code: thrust::device_vector objects(outputSize);

float minPreclusterThreshold = (std::min_element(detectionParams.perClassPreclusterThreshold.begin(), detectionParams.perClassPreclusterThreshold.end())); int threads_per_block = 1024; int number_of_blocks = ((outputSize - 1) / threads_per_block) + 1; decodeTensorYoloECuda<<<number_of_blocks, threads_per_block>>>( thrust::raw_pointer_cast(objects.data()), (float) (boxes.buffer), (float) (scores.buffer), (float) (classes.buffer), outputSize, networkInfo.width, networkInfo.height, minPreclusterThreshold); objectList.resize(outputSize); thrust::copy(objects.begin(), objects.end(), objectList.begin()); it seems that the data from whole realization only copied from GPU to CPU, and the decodeTensorYoloECuda function did not copy data from CPU to GPU,

marcoslucianops commented 10 months ago

Raw pointer to access on GPU: thrust::raw_pointer_cast GPU to CPU: thrust::copy

marcoslucianops commented 10 months ago

I said CPU to GPU to be easier to understand the process. But the best approach would be full processing in a GPU batch but it's not available in the DeepStream.