Open ccqedq opened 10 months ago
from the code: thrust::device_vector
float minPreclusterThreshold = (std::min_element(detectionParams.perClassPreclusterThreshold.begin(), detectionParams.perClassPreclusterThreshold.end())); int threads_per_block = 1024; int number_of_blocks = ((outputSize - 1) / threads_per_block) + 1; decodeTensorYoloECuda<<<number_of_blocks, threads_per_block>>>( thrust::raw_pointer_cast(objects.data()), (float) (boxes.buffer), (float) (scores.buffer), (float) (classes.buffer), outputSize, networkInfo.width, networkInfo.height, minPreclusterThreshold); objectList.resize(outputSize); thrust::copy(objects.begin(), objects.end(), objectList.begin()); it seems that the data from whole realization only copied from GPU to CPU, and the decodeTensorYoloECuda function did not copy data from CPU to GPU,
Raw pointer to access on GPU: thrust::raw_pointer_cast
GPU to CPU: thrust::copy
I said CPU to GPU to be easier to understand the process. But the best approach would be full processing in a GPU batch but it's not available in the DeepStream.
Probably because the data needs to be copied from the CPU to the GPU and then from the GPU to the CPU again. It's not possible to get the data directly from the GPU in the current DeepStream version.