Closed Rad-hi closed 2 months ago
This is interesting, thanks for your effort! Do you have a sample code where you use the GPU buffer, maybe with Pytorch/TensorRT? Did you notice a significant performance improvement?
Thanks! Currently, I only used it with a custom TF that applies to the whole pointcloud (rotation along the X axis) using numba.cuda accelerated Python code:
import math
import time
import numpy as np
import cupy as cp
from numba import cuda
from numba.cuda.cudadrv.devicearray import DeviceNDArray
import numba
import torch
from typing import Union, Tuple, Callable, TypeVar
from typing_extensions import ParamSpec
T = TypeVar('T')
P = ParamSpec('P')
RotateCallable = Callable[[float, float, float, float], Tuple[float, float]]
Y = 1
Z = 2
ROT = 69.0
class TimeMe:
'''
Timing decorator that takes into account whether the timing would be done in GPU or CPU
@depends: torch
Ref: https://discuss.pytorch.org/t/how-to-measure-time-in-pytorch/26964/2
'''
def __init__(self, gpu: bool = False) -> None:
self._gpu: bool = gpu
if gpu:
self._start = torch.cuda.Event(enable_timing=True)
self._end = torch.cuda.Event(enable_timing=True)
def __call__(self, func: Callable[P, T]) -> Callable[P, T]:
def wrapper(*args: P.args, **kwargs: P.kwargs) -> Tuple[float, T]:
all_ts: float = 0.0
ts: float = 0.0
if self._gpu:
self._start.record()
else:
ts = time.time()
result = func(*args, **kwargs)
if self._gpu:
self._end.record()
torch.cuda.synchronize()
all_ts = self._start.elapsed_time(self._end) if self._gpu else (time.time() - ts) * 1000.0
return all_ts, result
return wrapper
def rotate_yz_sin_cos(y: float, z: float, sin_: float, cos_: float) -> Tuple[float, float]:
if math.isfinite(y) and math.isfinite(z):
yy = y * cos_ - z * sin_
zz = y * sin_ + z * cos_
return yy, zz
return math.nan, math.nan
# JIT compile for both CPU and GPU
rotate_yz_sin_cos_cpu: RotateCallable = numba.njit(nogil=True, cache=True)(rotate_yz_sin_cos)
rotate_yz_sin_cos_gpu: RotateCallable = cuda.jit(device=True, cache=True)(rotate_yz_sin_cos)
@cuda.jit
def rotate_pcl_kernel(pcl: DeviceNDArray, rot: float) -> DeviceNDArray:
'''
GPU kernel to rotate pointclouds with an angle rot
Ref: https://github.com/harrism/numba_examples/blob/master/mandelbrot_numba.ipynb
'''
sin_ = math.sin(rot)
cos_ = math.cos(rot)
h, w, _ = pcl.shape
startX, startY = cuda.grid(2)
gridX = cuda.gridDim.x * cuda.blockDim.x
gridY = cuda.gridDim.y * cuda.blockDim.y
for i in range(startY, h, gridY):
for j in range(startX, w, gridX):
pcl[i][j][Y], pcl[i][j][Z] = rotate_yz_sin_cos_gpu(
pcl[i][j][Y], pcl[i][j][Z], sin_, cos_)
@TimeMe(gpu=True)
def rotate_pcl_gpu(pcl: Union[cp.ndarray, np.ndarray],
rot: float,
h2d: bool = False) -> Union[cp.ndarray, np.ndarray]:
blockdim = (16, 16)
griddim = (16, 16)
pcl = cuda.to_device(pcl) if h2d else pcl
rotate_pcl_kernel[griddim, blockdim](pcl, rot)
pcl = pcl.copy_to_host() if h2d else pcl
# The rotation doesn't happen inplace only when H2D/D2H
# transfers were requested; pcl originally on host memory
return pcl
@TimeMe()
@numba.njit(nogil=True, parallel=True, cache=True)
def rotate_pcl_cpu(pcl: np.ndarray, rot: float) -> None:
sin_ = math.sin(rot)
cos_ = math.cos(rot)
h, w, _ = pcl.shape
for i in numba.prange(h):
for j in numba.prange(w):
pcl[i][j][Y], pcl[i][j][Z] = rotate_yz_sin_cos_cpu(
pcl[i][j][Y], pcl[i][j][Z], sin_, cos_)
if __name__ == '__main__':
# TODO: add the Zed init and Mat creation
pcl_gpu = pcl_mat.get_data(memory_type=sl.MEM.GPU)
# Note that this will take "D2H" time
# Ref: https://docs.cupy.dev/en/stable/reference/generated/cupy.ndarray.html#cupy.ndarray.get
pcl_cpu = pcl_gpu.get()
# rotation will happen inplace since pcl is on Device memory
rot_gpu_lat, _ = rotate_pcl_gpu(pcl_gpu, ROT)
rot_cpu_lat, _ = rotate_pcl_cpu(pcl_cpu, ROT)
# NOTE: Tests for correctness were performed to make sure the rotation is identical on GPU/CPU,
# and that the rotation actually happened; inplace.
As you can see, the gain is significant, and we can write all pcl processing code in the same manner, thus, never having to bring the pcl to the host memory. Additionally, we can run DL inference on the frames using TRT without having to pay for the D2H-H2D transfers.
Cupy has great interoperability with most common frameworks (notably numba.cuda
, Pytorch
, and TensorRT
) https://docs.cupy.dev/en/stable/user_guide/interoperability.html.
The same results might be achieved using numba.cuda DeviceNDArrays
, but we didn't test it.
Impressive! We need to think about how we could go forward with this. Ideally, I think it would be best to have the cupy dependency optional to limit complexity for most users. I'll try to test this soon
When it comes to dependency management, I do agree that cupy being an optional dependency is the way to go.
I also want to note that we've seen an increase of the maximum FPS. We went from around 12 FPS (grab/retrieve on CPU), to around 15 when we started keeping the data on the GPU (even with 2 Zed mini
cameras working at the same time, on an Orin 32Gb). This led us to believe that your specs are measured using the C++ API !?
I spent some time testing it in depth, and I saw some confusing behavior that I would love your help about. So, as I mentioned, I had code to test the correctness of the retrieved data. The code compares the CPU frame/pcl with its GPU counterpart (I considered the CPU data as the ground truth) element wise, and gives me back the match percentage:
@numba.njit(nogil=True, cache=True)
def almost_eq(a: Union[int, float], b: Union[int, float], e: Union[int, float]) -> bool:
if not (math.isfinite(a) and math.isfinite(b)):
# If either is nan, consider them equal
return True
return abs(a - b) <= e
@numba.njit(nogil=True, cache=True)
def count_almost_eq(mat1: np.ndarray, mat2: np.ndarray, eupsilon: float) -> None:
'''
This method is needed (instead of np.count_nonzero(mat1 == mat2)) becauce
np.nan == np.nan >>> False. Thus we need to manually check the elements.
'''
sh1 = mat1.shape
sh2 = mat2.shape
assert sh1 == sh2
assert mat1.size == mat2.size
h, w, _ = sh1
eq = mat1.size
for j in range(h):
for i in range(w):
x_eq = almost_eq(mat1[j][i][X], mat2[j][i][X], eupsilon)
y_eq = almost_eq(mat1[j][i][Y], mat2[j][i][Y], eupsilon)
z_eq = almost_eq(mat1[j][i][Z], mat2[j][i][Z], eupsilon)
eq -= 1 * (not (x_eq and y_eq and z_eq))
return eq
# Retrieve the image/measure
gpu_pcl = pcl_mat.get_data(memory_type=sl.MEM.GPU)
gpu_frame = frame_mat.get_data(memory_type=sl.MEM.GPU)
# Bring the data to Host memory to visualize it and test its correctness
gpu_pcl_np = gpu_pcl.get()
gpu_frame_np = gpu_frame.get()
print(f'PCL -- {(count_almost_eq(gpu_pcl, gpu_pcl_np, 0.000001) / pcl_cpu.size) * 100.0:.2f} % almost identical')
print(f'RGB -- {(count_almost_eq(cpu_frame, gpu_frame_np, 1) / cpu_frame.size) * 100.0:.2f} % almost identical')
I tested with live streams from a ZED Mini
, a ZED 2
, and SVO
s, both on my laptop, and an Nvidia AGX Orin Dev kit, with the latest SDK (4.0).
The confusing behavior is that in HD1080
and HD720
(in all FPS options), everything works perfectly (100% match on both PCL, and Frame), but when I choose either HD2K
or VGA
(in all FPS options), the PCL still matches 100%, but the Frame only matches around 74 ±3 %.
Additionally, when I visualize the Frames (after converting Cupy arr to Numpy arr), the GPU one looks kinda shifted (for HD2K
and VGA
only):
I tried:
But nothing seems to fix this,
Am I missing something about how CUDA works, or how the frame data is “organized” in memory ? Is this related to a striding mechanism in the frame representation in GPU (I am just guessing, not really sure what I am asking) ?
Thank you for indulging me.
@adujardin I was wondering if you had time to test this out, or maybe if you can point me towards why this issue might be happening.
In order to further test the data retrieval, I created the image from the RGBA
data in the PCL
using this function:
@numba.njit(cache=True, parallel=True)
def xyzrgba2rgba(src: np.ndarray, dst: np.ndarray) -> None:
'''
Fill dst matrix with RGBA data from the XYZRGBA pcl.
'''
RGBA_AXIS = 3
BYTE_MASK = 0xFF
R, G, B = (0, 1, 2)
RED_MASK = np.uint32(BYTE_MASK << (R * 8))
GREEN_MASK = np.uint32(BYTE_MASK << (G * 8))
BLUE_MASK = np.uint32(BYTE_MASK << (B * 8))
h, w, _ = dst.shape
for i in numba.prange(w):
for j in numba.prange(h):
if not np.isfinite(src[j][i][RGBA_AXIS]):
continue
rgba_bin = np.asarray(src[j][i][RGBA_AXIS], np.float32).view(np.uint32)
dst[j][i][R] = np.uint8((rgba_bin & RED_MASK) >> (R * 8))
dst[j][i][G] = np.uint8((rgba_bin & GREEN_MASK) >> (G * 8))
dst[j][i][B] = np.uint8((rgba_bin & BLUE_MASK) >> (B * 8))
dst[j][i][RGBA_AXIS] = BYTE_MASK
def image_from_pcl_rgba(pcl):
im = np.ones_like(pcl, dtype=np.uint8)
xyzrgba2rgba(pcl, im)
return im
And it gave me the next two images (CPU
retrieval VS image from GPU PCL
), which points me towards the fact that the data is in fact being read correctly.
Do you have any ideas for me to try ?
Hi! I'm using just now the python API for a project, and I realized that one of the main bottlenecks is, also for me, the double passage of textures from gpu to cpu. I'd be super interested in testing your PR but I'm rather confused on how to test the repo. Could I ask you some directions? Thanks a lot!
Hey, @Neeklow! I would appreciate another tester. So to test it, you need to build the Python API from source, for that:
src
folder and follow the README for instructions for building the pluglinmemory_type
argument, supply sl.MEM.GPU
instead of sl.MEM.CPU
It's to note that I only tested this with the ZED 4.0 SDK
and with Python 3.8
.
Thanks a lot @Rad-hi !
I'm having troubles to build the plugin.
I'm currently on a python 3.8 virtual environment (I'm assuming that shouldn't be a problem?).
After I install all the requirements, I run python setup.py build
and I first receive
ZED SDK Version: OK
compilation flags:
include dirs: ['C:\\Users\\neeklo\\Documents\\Repos\\ng-stories-people-room-mono\\workflow\\camera-feed-system\\.venv\\lib\\site-packages\\numpy\\core\\include', 'C:\\Program Files (x86)\\ZED SDK/include', ``'C:\\Users\\neeklo\\scoop\\apps\\cuda\\current/include']
library dirs: ['C:\\Users\\neeklo\\Documents\\Repos\\ng-stories-people-room-mono\\workflow\\camera-feed-system\\.venv\\lib\\site-packages\\numpy\\core\\include', 'C:\\Program Files (x86)\\ZED SDK/lib', 'C:\\Users\\neeklo\\scoop\\apps\\cuda\\current/lib/x64']
libraries: ['sl_zed64']
Building module: ('pyzed.sl', ['pyzed/sl.pyx'])
Compiling pyzed/sl.pyx because it changed.
[1/1] Cythonizing pyzed/sl.pyx
Followed by a mega Cython error from (Cython.Compiler.Errors.CompileError: pyzed/sl.pyx
... the error is huge, so I'm attaching errors-build.txt printed to avoid cluttering here). The currently installed Cython is Cython 3.0.10
(not sure if this make any difference).
Any help is really appreciated!!
Okay so after inspecting the build log, I think I can see the issue:
When running this line, from sl_c cimport ( String, to_str, Camera as c_Camera, ERROR_CODE as c_ERROR_CODE, toString
, python is not able to find sl_c.pxd
:
from sl_c cimport ( String, to_str, Camera as c_Camera, ERROR_CODE as c_ERROR_CODE, toString
^
------------------------------------------------------------
pyzed\sl.pyx:28:0: 'sl_c.pxd' not found
this is from the errors-build.txt ~ line 21
Two possibilities:
python3 setup.py build
command under the src
foldersl_c.pxd
file, in which case you could help Python by adding the path to the pyzed
folder to the system path like:from libcpp.map cimport map
# Add the next two lines to sl.pyx, ~ line 27
import sys
sys.path.append('/ABSOLUTE/PATH/TO/ZED/API/zed-python-api/src/pyzed')
# And modify the import: note the pyzed.sl_c
from pyzed.sl_c cimport ( String, to_str, Camera as c_Camera, ERROR_CODE as c_ERROR_CODE, toString
Make sure to update the path with the appropriate value (I see you're using windows, and I have no idea how it works)
Hope this helps.
@Rad-hi thansk for the tips. That defineteyl helped but it seems I'm having troubles with MSCV compiler:
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.29.30133\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
Is where I get to (i've uploaded the whole build error.txt again just in case). I've been bashing my head around this for the past few days, and I'm a bit short of ideas. If you have any experience/i
dea where to look for solution I'd really appreciate (but I understand you're not on Win and I seem to understand this is a strictly windows Issue)! Thanks a lot!
I see @Neeklow, you seem to be encountering the same issues I encountered when trying to build this for the Orin (an NVIDIA board I am working on).
The actual error you're encountering is (line ~303 in the build log):
pyzed/sl.cpp(146250): error C2039: 'gnss_ignore_threshold': is not a member of 'sl::PositionalTrackingFusionParameters'
And the quickest fix for me was to comment out the gnss_ignore_threshold
related code. You can find the changes in this branch https://github.com/Rad-hi/zed-python-api/tree/changes-for-orin.
Beware of the changes to the sl.pyx
in the branch:
# This is added because sl_c.pxd wasn't being found
import os
import sys
HOME = os.path.expanduser('~')
sys.path.append(os.path.join(HOME, 'Desktop/zed-python-api/src'))
Since I placed the API under my desktop at the time of testing this. Apply the changes that you applied in the past fix here instead of keeping it this way.
I hope this helps and the API builds.
Hi @Rad-hi , I'm also interested since I grab image, depth and pointcloud and the moving of data back and forth from GPU memory is a waste of time (the processing is entirely in GPU with YOLO and Pytorch). I followed your instruction:
the I run and the system asked to install cupy. Ok I forgot it. I installed thourgh pip but I had to downgrade cython to 0.29.37 from 3.0.11 (compilation error). Now cupy is ok and almost everything runs with sl.MEM.CPU
if I switch to sl.MEM.GPU I receive the following runtime error:
File "pyzed/sl.pyx", line 4556, in pyzed.sl.Mat.get_data
ValueError: Provided MEM type doesn't match Mat's memory_type.
I tried to change the default memory to MEM.GPU (looks like an enum error) but I got the same error.
System: Core i5 7600k Nvidia GTX1050
Cuda 12.2 Zed SDK 4.0 Ubuntu 22.04
Note: after the use of custom pyzed I got an error about POSITIONAL_TRACKING_MODE. The error on my python script said that there is no "mode" in positional tracking module. I commented it and go ahead.
Hello @andreacelani, the feature is indeed very useful,
As for the error message you are getting, that's actually an error message I added:
Make sure you initialize the Mat
with the MEM.GPU
type.
yes, you are right. I forgot that Mat
has a type initialization.
Let me try. I'll keep you updated about performances.
wow, I got 2X speed up with GPU reading of SVO file! And even more with real Camera Stream! Thank you @Rad-hi !
The following takes around 32-33 msec with CPU mem and streaming from SVO file:
self.zed.retrieve_image(self.image, view=sl.VIEW.LEFT, type=sl.MEM.CPU) # read image from left imager grabber
self.zed.retrieve_measure(self.depth, measure=sl.MEASURE.DEPTH, type=sl.MEM.CPU) # read depth image from grabber
self.zed.retrieve_measure(self.point_cloud, measure=sl.MEASURE.XYZRGBA, type=sl.MEM.CPU) # Get the point cloud
self.zed_image = self.image.get_data(memory_type=sl.MEM.CPU)
self.zed_depth = self.depth.get_data(memory_type=sl.MEM.CPU)
self.zed_pointcloud = self.point_cloud.get_data(memory_type=sl.MEM.CPU)
While this takes only 15-16 msec with GPU mem and streaming from SVO file:
self.zed.retrieve_image(self.image_gpu, view=sl.VIEW.LEFT, type=sl.MEM.GPU) # read image from left imager grabber
self.zed.retrieve_measure(self.depth_gpu, measure=sl.MEASURE.DEPTH, type=sl.MEM.GPU) # read depth image from grabber
self.zed.retrieve_measure(self.point_cloud_gpu, measure=sl.MEASURE.XYZRGBA, type=sl.MEM.GPU) # Get the point cloud
self.zed_image = self.image_gpu.get_data(memory_type=sl.MEM.GPU)
self.zed_depth = self.depth_gpu.get_data(memory_type=sl.MEM.GPU)
self.zed_pointcloud = self.point_cloud_gpu.get_data(memory_type=sl.MEM.GPU)
With real camera stream I went from 15-17 msec to 5.5-8 msec with GPU MEM (the same pipeline of above, image, depth and pointcloud)
SVO and real camera settings is: 1280X720
SVO rate: 30fps real camera rate: 60fps with capping at 30fps ( I think the 60 fps setting helped to reach 5.5-8 msec)
Depth in both (SVO and real camera):
init_params.depth_mode = sl.DEPTH_MODE.ULTRA # Use ULTRA depth mode
init_params.depth_minimum_distance = 0.3 # Set the minimum depth perception distance to 30cm
init_params.depth_maximum_distance = 10 # Set the maximum depth perception distance to 10m
After the readout of the image from Zed2i in GPU I only need to convert to torch with .from_dlpack
(from "cupy" to torch)
self.zed_image_without_alpha = self.zed_image[:, :, :3]
torch_image = torch.from_dlpack(self.zed_image_without_alpha)
(I removed alpha channel because I fed it into YOLO input, the torch_image tensor is then rearranged to match the input need of YOLO)
same for depth: ZedDepthGPU = torch.from_dlpack(self.zed_depth)
and pointcloud: PCtensorGPU = torch.from_dlpack(self.zed_pointcloud)
the CPU load (4 core@4Ghz) went from 48% to 32% while the GPU load from 25% to 37%
The complete pipeline went from 100msec to 56/60 msec (tested with SVO file stream) (no cython compilation, pure python)
The only small drawback that I found it's related to POSITIONAL_TRACKING_MODE
ptParams.mode = 1 # sl.POSITIONAL_TRACKING_MODE.GEN_2
AttributeError: 'pyzed.sl.PositionalTrackingParameters' object has no attribute 'mode'
where ptParams:
ptParams = sl.PositionalTrackingParameters() # set parameters for Positional Tracking
It starts to give me the Attribute error when I switched to the new pyzed. I suppose it use default right now (from API, the default is GEN1). I cannot understand how it is related to the custom pyzed since there is no change in code there.
For the sake of completeness I initialized Mat
as follows:
self.image_gpu = sl.Mat(memory_type=sl.MEM.GPU)
self.depth_gpu = sl.Mat(memory_type=sl.MEM.GPU)
self.point_cloud_gpu = sl.Mat(memory_type=sl.MEM.GPU)
self.image = sl.Mat()
self.depth = sl.Mat()
self.point_cloud = sl.Mat()
@andreacelani these are some great findings, thank you for the detailed measurements !
I have no idea about the positional tracking thing, though, I never worked with it. But I am impressed by the impact !
Can you please try grabbing a single frame at HD2K resolution and tell me if you get the same output as I got ? For this PR, I only had a couple of weeks available for me to get it working, and since I needed HD2K, I couldn't continue working on it when I encountered the issue described in the previous comments. I would greatly appreciate you test it for me !
@Rad-hi grabbing to MEM.CPU is always fine. Resolution VGA, HD720, HD1080 and HD2K (15 fps, capped at 10 fps, if these settings could affect something) are always fine. If I grab to MEM.GPU, resolution HD720, HD1080 are fine while VGA and HD2K gives lines. Seems the byte order is messed up.
VGA grabbed to GPU:
HD720 grabbed to GPU:
HD1080 grabbed to GPU:
HD2K grabbed to GPU:
the same lines, less evident, with depth (HD2K grabbed to GPU):
attach also a test with grabbing to CPU.MEM @ HD2K for reference:
If I look at resolution spec here: https://www.stereolabs.com/docs/api/python/classpyzed_1_1sl_1_1RESOLUTION.html
HD2K 2208 X 1242 and VGA (actually is not a real VGA but 672 X 376) are both "strange" resolutions that could mess up byte order if standard function are used. If you look at picture you can see the plant pot on the left that is splitted over several columns. Try to look at your code in this sense. I'll do the same.
All the acquisitions has been done with live camera, ZED 2i. Zed SDK: 4.0.8 CUDA: 12.1.105
@andreacelani I appreciate your efforts, thank you ! As for your findings, they do seems to align with what my tests (back when I worked on this) yielded. On the grabbing side, all I am doing is wrapping the GPU data pointer into a cupy array, with the same shape as the one you would supply to the CPU grab.
In this reply, I compared the frame grabbed on CPU, with the frame extracted from the PCL
grabbed on GPU in XYZ_RGBA
mode. And I got a good-ish frame in HD2K. So, coupled with you validating my observations that the frames are correct but shifted somehow, I am starting to think that the issue is in the D2H
transfer and visualization, instead of the grabbing.
I will try to test these hypotheses in the upcoming days, and hopefully we can close this chapter.
@Rad-hi I made the D2H by using image_cpu = torch.from_dlpack(image_gpu).detach().cpu().numpy()
I cannot understand how this can affect byte order.
Maybe one test cloud be the following:
In gpu try to crop and pad the VGA image from 672x376 to 640x480, send to cpu and see what happens.
But making some basic math I found the following:
In VGA image I got 7 repetitions. This means that instead of reading an image of 672 pixels wide the image should be 672/7 = 96 pixels wider. 96 pixels wider means a total of 768 pixels that sounds “more common” than 672. If I do the same with HD2K 2208 pixels wide image I got 23 repetitions —> 2208/23 = 96 pixels. Same extra pixels of VGA. Pretty strange. The image is 2304 pixels wide that is not common but it’s exactly 3 times 768.
From these calculation I suppose images in VGA and HD2K are larger (wide and height) in gpu than in cpu.
In cpu images are cropped to ex: 672x376. In GPU the cupy pointer point to the same memory but the readout try to squeeze a larger image in smaller format.
Make sense to you?
@andreacelani you are spot on !
The moment I read 96 pixels
shifts, it clicked ! Because if you look at the light bulb in the images you provided in your tests, it's “spread” across multiple columns.
Here's a small test I did:
import cv2
import numpy as np
import os
IMG_VGA = 'test_vga.jpg'
IMG_HD2K = 'test_hd2k.jpg'
TRUE_VGA_WIDTH = 672 + 96
TRUE_HD2K_WIDTH = 2208 + 96
def get_corrected_img(faulty: np.ndarray, true_width: int) -> np.ndarray:
h, w_old, c = faulty.shape
canvas = np.zeros((h, true_width, c), dtype=faulty.dtype)
for i in range(0, canvas.size, c):
if i == h * w_old:
break
y_new = i // true_width
x_new = i % true_width
y_old = i // w_old
x_old = i % w_old
canvas[y_new][x_new] = faulty[y_old][x_old]
return canvas
def correct_and_show_res(img_name: str, vga: bool = False) -> None:
img_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), img_name)
faulty = cv2.imread(img_path)
corrected = get_corrected_img(faulty, TRUE_VGA_WIDTH if vga else TRUE_HD2K_WIDTH)
cv2.imshow(f"FAULTY-{'VGA' if vga else 'HD2K'}", faulty)
cv2.imshow(f"CORRECTED-{'VGA' if vga else 'HD2K'}", corrected)
correct_and_show_res(IMG_VGA, True)
correct_and_show_res(IMG_HD2K, False)
cv2.waitKey(0)
Where my test image is the image you supplied for the grab on GPU at the “VGA” resolution:
And, here's the output of the above script:
Which concludes that the issue is indeed in the shaping of the data. Nice catch, I don't think I would've ever noticed such a thing !
Also, I tested on the HD2K image, and the 96 pixels shift is again the thing (tested on the image you uploaded in your test of HD2K):
NOTE: I don't currently have access to any camera, but I expect to have it back in next week. I will update the code and test on resizing from the GPU side once I have access.
After looking at the code,
elif memory_type.value == MEM.GPU.value and not deep_copy:
src_p = self.get_pointer(memory_type=memory_type)
mem = cp.cuda.UnownedMemory(src_p, size, self)
memptr = cp.cuda.MemoryPointer(mem, offset=0)
arr = cp.ndarray(shape, dtype=dtype, memptr=memptr)
And as you can see, I am using shape
and size
which I am calculating using the supplied functions (link to referenced code):
shape = (self.mat.getHeight(), self.mat.getWidth(), self.mat.getChannels())
...
size = self.mat.getHeight()*self.mat.getWidth()*self.mat.getChannels()
Where getHeight()
, getWidth()
are exposed from the C++ SDK:
cdef cppclass Mat 'sl::Mat':
...
size_t getWidth() const
size_t getHeight() const
And the SDK is not open source (as far as I can tell), which puts us in a situation where if we want to make it work, we'd need to hardcode the shape when the grabbing is in GPU (or maybe we open an issue ?)
@adujardin I have found the issue. The trick is in the striding. I am closing this PR, since it has lived for so long, and has fallen behind the master branch.
I'll test thoroughly my current solution, and will open a better quality PR.
@andreacelani, thank you for your input. It was invaluable in figuring this out!
@Rad-hi thank you, but it's nothing compared with the cupy integration. Excellent feature (or excellent patch for a long waited Zed feature). I purchased an RTX4060 and I was busy testing it out. I will try to hardcode the image size as you suggest and report here after. Yes, the C++ sdk is locked so we cannot do anything more than rise an issue. If nobody raised one I guess very few people use gpu readout with 2K resolution (but there could be someone use 640). I'm looking forward to see what you call better quality PR.
Keep the
retrieved
data on GPUDescription
Through the utilization of cupy.ndarrays (extra dependency), we were able to keep the retrieved data on GPU and provide a view of the memory on the Python side.
This PR introduces a slight modification to the
get_data()
function in a way that extends the API, but doesn't break any of its current functionalities.Notes
In an effort to stay true to the original API, data on the GPU could be either
deeply copied
, orviewed
by the Python API consumer.A list of references is kept as a comment above the added code for potential enhancements, and/or better understanding of the feature.
Tests
This was tested with the
ZED SDK 4.0
, on both SVOs (recorded in HD2K-15), and on a live stream from aZED Mini
, with a custom wrapper of the Python API, on aUbuntu 20.04.6 LTS
system, with an Nvidia GTX 1650 4Gb graphics card.Additionally, we are attempting its test on an
Nvidia AGX Orin 32Gb developer kit
, though we're having some Cython compilation errors that are prohibiting us from proceeding with the tests (would love to discuss, but not our topic).Disclaimers
This is not thoroughly tested, and this PR is intended to spark a discussion around how this feature might be developed by the community, rather than being a final product (though we hope it's ready as is).