stereolabs / zed-python-api

Python API for the ZED SDK
https://www.stereolabs.com/docs/app-development/python/install/
MIT License
213 stars 94 forks source link

Feature/get data gpu #230

Closed Rad-hi closed 2 months ago

Rad-hi commented 1 year ago

Keep the retrieved data on GPU

Description

Through the utilization of cupy.ndarrays (extra dependency), we were able to keep the retrieved data on GPU and provide a view of the memory on the Python side.

This PR introduces a slight modification to the get_data() function in a way that extends the API, but doesn't break any of its current functionalities.

Notes

In an effort to stay true to the original API, data on the GPU could be either deeply copied, or viewed by the Python API consumer.

A list of references is kept as a comment above the added code for potential enhancements, and/or better understanding of the feature.

Tests

This was tested with the ZED SDK 4.0, on both SVOs (recorded in HD2K-15), and on a live stream from a ZED Mini, with a custom wrapper of the Python API, on a Ubuntu 20.04.6 LTS system, with an Nvidia GTX 1650 4Gb graphics card.

Additionally, we are attempting its test on an Nvidia AGX Orin 32Gb developer kit, though we're having some Cython compilation errors that are prohibiting us from proceeding with the tests (would love to discuss, but not our topic).

Disclaimers

This is not thoroughly tested, and this PR is intended to spark a discussion around how this feature might be developed by the community, rather than being a final product (though we hope it's ready as is).

adujardin commented 1 year ago

This is interesting, thanks for your effort! Do you have a sample code where you use the GPU buffer, maybe with Pytorch/TensorRT? Did you notice a significant performance improvement?

Rad-hi commented 1 year ago

Thanks! Currently, I only used it with a custom TF that applies to the whole pointcloud (rotation along the X axis) using numba.cuda accelerated Python code:

import math
import time
import numpy as np
import cupy as cp
from numba import cuda
from numba.cuda.cudadrv.devicearray import DeviceNDArray
import numba
import torch

from typing import Union, Tuple, Callable, TypeVar
from typing_extensions import ParamSpec

T = TypeVar('T')
P = ParamSpec('P')
RotateCallable = Callable[[float, float, float, float], Tuple[float, float]]

Y = 1
Z = 2
ROT = 69.0

class TimeMe:
    '''
    Timing decorator that takes into account whether the timing would be done in GPU or CPU
    @depends: torch
    Ref: https://discuss.pytorch.org/t/how-to-measure-time-in-pytorch/26964/2
    '''
    def __init__(self, gpu: bool = False) -> None:
        self._gpu: bool = gpu
        if gpu:
            self._start = torch.cuda.Event(enable_timing=True)
            self._end = torch.cuda.Event(enable_timing=True)

    def __call__(self, func: Callable[P, T]) -> Callable[P, T]:
        def wrapper(*args: P.args, **kwargs: P.kwargs) -> Tuple[float, T]:
            all_ts: float = 0.0
            ts: float = 0.0

            if self._gpu:
                self._start.record()
            else:
                ts = time.time()

            result = func(*args, **kwargs)

            if self._gpu:
                self._end.record()
                torch.cuda.synchronize()

            all_ts = self._start.elapsed_time(self._end) if self._gpu else (time.time() - ts) * 1000.0

            return all_ts, result

        return wrapper

def rotate_yz_sin_cos(y: float, z: float, sin_: float, cos_: float) -> Tuple[float, float]:
    if math.isfinite(y) and math.isfinite(z):
        yy = y * cos_ - z * sin_
        zz = y * sin_ + z * cos_
        return yy, zz
    return math.nan, math.nan

# JIT compile for both CPU and GPU
rotate_yz_sin_cos_cpu: RotateCallable = numba.njit(nogil=True, cache=True)(rotate_yz_sin_cos)
rotate_yz_sin_cos_gpu: RotateCallable = cuda.jit(device=True, cache=True)(rotate_yz_sin_cos)

@cuda.jit
def rotate_pcl_kernel(pcl: DeviceNDArray, rot: float) -> DeviceNDArray:
    '''
    GPU kernel to rotate pointclouds with an angle rot
    Ref: https://github.com/harrism/numba_examples/blob/master/mandelbrot_numba.ipynb
    '''
    sin_ = math.sin(rot)
    cos_ = math.cos(rot)
    h, w, _ = pcl.shape

    startX, startY = cuda.grid(2)
    gridX = cuda.gridDim.x * cuda.blockDim.x
    gridY = cuda.gridDim.y * cuda.blockDim.y

    for i in range(startY, h, gridY):
        for j in range(startX, w, gridX):
            pcl[i][j][Y], pcl[i][j][Z] = rotate_yz_sin_cos_gpu(
                pcl[i][j][Y], pcl[i][j][Z], sin_, cos_)

@TimeMe(gpu=True)
def rotate_pcl_gpu(pcl: Union[cp.ndarray, np.ndarray],
                   rot: float,
                   h2d: bool = False) -> Union[cp.ndarray, np.ndarray]:
    blockdim = (16, 16)
    griddim = (16, 16)

    pcl = cuda.to_device(pcl) if h2d else pcl
    rotate_pcl_kernel[griddim, blockdim](pcl, rot)
    pcl = pcl.copy_to_host() if h2d else pcl

    # The rotation doesn't happen inplace only when H2D/D2H
    # transfers were requested; pcl originally on host memory 
    return pcl

@TimeMe()
@numba.njit(nogil=True, parallel=True, cache=True)
def rotate_pcl_cpu(pcl: np.ndarray, rot: float) -> None:
    sin_ = math.sin(rot)
    cos_ = math.cos(rot)
    h, w, _ = pcl.shape
    for i in numba.prange(h):
        for j in numba.prange(w):
            pcl[i][j][Y], pcl[i][j][Z] = rotate_yz_sin_cos_cpu(
                pcl[i][j][Y], pcl[i][j][Z], sin_, cos_)

if __name__ == '__main__':

    # TODO: add the Zed init and Mat creation
    pcl_gpu = pcl_mat.get_data(memory_type=sl.MEM.GPU)

    # Note that this will take "D2H" time
    # Ref: https://docs.cupy.dev/en/stable/reference/generated/cupy.ndarray.html#cupy.ndarray.get
    pcl_cpu = pcl_gpu.get()

    # rotation will happen inplace since pcl is on Device memory
    rot_gpu_lat, _ = rotate_pcl_gpu(pcl_gpu, ROT)
    rot_cpu_lat, _ = rotate_pcl_cpu(pcl_cpu, ROT)

    # NOTE: Tests for correctness were performed to make sure the rotation is identical on GPU/CPU,
    # and that the rotation actually happened; inplace.

Benchmark

System info:

GPU warmup iterations:

Numbers:

As you can see, the gain is significant, and we can write all pcl processing code in the same manner, thus, never having to bring the pcl to the host memory. Additionally, we can run DL inference on the frames using TRT without having to pay for the D2H-H2D transfers.

Notes about the cupy choice:

Cupy has great interoperability with most common frameworks (notably numba.cuda, Pytorch, and TensorRT) https://docs.cupy.dev/en/stable/user_guide/interoperability.html.

The same results might be achieved using numba.cuda DeviceNDArrays, but we didn't test it.

adujardin commented 1 year ago

Impressive! We need to think about how we could go forward with this. Ideally, I think it would be best to have the cupy dependency optional to limit complexity for most users. I'll try to test this soon

Rad-hi commented 1 year ago

When it comes to dependency management, I do agree that cupy being an optional dependency is the way to go.

I also want to note that we've seen an increase of the maximum FPS. We went from around 12 FPS (grab/retrieve on CPU), to around 15 when we started keeping the data on the GPU (even with 2 Zed mini cameras working at the same time, on an Orin 32Gb). This led us to believe that your specs are measured using the C++ API !?

Testing session's findings

I spent some time testing it in depth, and I saw some confusing behavior that I would love your help about. So, as I mentioned, I had code to test the correctness of the retrieved data. The code compares the CPU frame/pcl with its GPU counterpart (I considered the CPU data as the ground truth) element wise, and gives me back the match percentage:

@numba.njit(nogil=True, cache=True)
def almost_eq(a: Union[int, float], b: Union[int, float], e: Union[int, float]) -> bool: 
    if not (math.isfinite(a) and math.isfinite(b)):
        # If either is nan, consider them equal
        return True
    return abs(a - b) <= e

@numba.njit(nogil=True, cache=True)
def count_almost_eq(mat1: np.ndarray, mat2: np.ndarray, eupsilon: float) -> None:
    '''
    This method is needed (instead of np.count_nonzero(mat1 == mat2)) becauce
    np.nan == np.nan >>> False. Thus we need to manually check the elements.
    '''
    sh1 = mat1.shape
    sh2 = mat2.shape
    assert sh1 == sh2
    assert mat1.size == mat2.size

    h, w, _ = sh1

    eq = mat1.size

    for j in range(h):
        for i in range(w):
            x_eq = almost_eq(mat1[j][i][X], mat2[j][i][X], eupsilon)
            y_eq = almost_eq(mat1[j][i][Y], mat2[j][i][Y], eupsilon)
            z_eq = almost_eq(mat1[j][i][Z], mat2[j][i][Z], eupsilon)
            eq -= 1 * (not (x_eq and y_eq and z_eq))
    return eq

# Retrieve the image/measure
gpu_pcl = pcl_mat.get_data(memory_type=sl.MEM.GPU)
gpu_frame = frame_mat.get_data(memory_type=sl.MEM.GPU)

# Bring the data to Host memory to visualize it and test its correctness
gpu_pcl_np = gpu_pcl.get()
gpu_frame_np = gpu_frame.get()

print(f'PCL -- {(count_almost_eq(gpu_pcl, gpu_pcl_np, 0.000001) / pcl_cpu.size) * 100.0:.2f} % almost identical')
print(f'RGB -- {(count_almost_eq(cpu_frame, gpu_frame_np, 1) / cpu_frame.size) * 100.0:.2f} % almost identical')

I tested with live streams from a ZED Mini, a ZED 2, and SVOs, both on my laptop, and an Nvidia AGX Orin Dev kit, with the latest SDK (4.0).

The confusing behavior is that in HD1080 and HD720 (in all FPS options), everything works perfectly (100% match on both PCL, and Frame), but when I choose either HD2K or VGA (in all FPS options), the PCL still matches 100%, but the Frame only matches around 74 ±3 %.

Additionally, when I visualize the Frames (after converting Cupy arr to Numpy arr), the GPU one looks kinda shifted (for HD2K and VGA only): CPU_frame_HD2K GPU_frame_HD2K

I tried:

But nothing seems to fix this,

Am I missing something about how CUDA works, or how the frame data is “organized” in memory ? Is this related to a striding mechanism in the frame representation in GPU (I am just guessing, not really sure what I am asking) ?

Thank you for indulging me.

Rad-hi commented 12 months ago

@adujardin I was wondering if you had time to test this out, or maybe if you can point me towards why this issue might be happening. In order to further test the data retrieval, I created the image from the RGBA data in the PCL using this function:

@numba.njit(cache=True, parallel=True)
def xyzrgba2rgba(src: np.ndarray, dst: np.ndarray) -> None:
    '''
    Fill dst matrix with RGBA data from the XYZRGBA pcl.
    '''
    RGBA_AXIS = 3
    BYTE_MASK = 0xFF
    R, G, B = (0, 1, 2)
    RED_MASK = np.uint32(BYTE_MASK << (R * 8))
    GREEN_MASK = np.uint32(BYTE_MASK << (G * 8))
    BLUE_MASK = np.uint32(BYTE_MASK << (B * 8))

    h, w, _ = dst.shape
    for i in numba.prange(w):
        for j in numba.prange(h):
            if not np.isfinite(src[j][i][RGBA_AXIS]):
                continue
            rgba_bin = np.asarray(src[j][i][RGBA_AXIS], np.float32).view(np.uint32)
            dst[j][i][R] = np.uint8((rgba_bin & RED_MASK) >> (R * 8))
            dst[j][i][G] = np.uint8((rgba_bin & GREEN_MASK) >> (G * 8))
            dst[j][i][B] = np.uint8((rgba_bin & BLUE_MASK) >> (B * 8))
            dst[j][i][RGBA_AXIS] = BYTE_MASK

def image_from_pcl_rgba(pcl):
    im = np.ones_like(pcl, dtype=np.uint8)
    xyzrgba2rgba(pcl, im)
    return im

And it gave me the next two images (CPU retrieval VS image from GPU PCL), which points me towards the fact that the data is in fact being read correctly.

CPU frame_screenshot_28 11 2023 GPU frame_screenshot_28 11 2023

Do you have any ideas for me to try ?

Neeklow commented 7 months ago

Hi! I'm using just now the python API for a project, and I realized that one of the main bottlenecks is, also for me, the double passage of textures from gpu to cpu. I'd be super interested in testing your PR but I'm rather confused on how to test the repo. Could I ask you some directions? Thanks a lot!

Rad-hi commented 7 months ago

Hey, @Neeklow! I would appreciate another tester. So to test it, you need to build the Python API from source, for that:

It's to note that I only tested this with the ZED 4.0 SDK and with Python 3.8.

Neeklow commented 7 months ago

Thanks a lot @Rad-hi ! I'm having troubles to build the plugin. I'm currently on a python 3.8 virtual environment (I'm assuming that shouldn't be a problem?). After I install all the requirements, I run python setup.py build and I first receive

ZED SDK Version: OK
compilation flags: 
include dirs: ['C:\\Users\\neeklo\\Documents\\Repos\\ng-stories-people-room-mono\\workflow\\camera-feed-system\\.venv\\lib\\site-packages\\numpy\\core\\include', 'C:\\Program Files (x86)\\ZED SDK/include', ``'C:\\Users\\neeklo\\scoop\\apps\\cuda\\current/include']
library dirs: ['C:\\Users\\neeklo\\Documents\\Repos\\ng-stories-people-room-mono\\workflow\\camera-feed-system\\.venv\\lib\\site-packages\\numpy\\core\\include', 'C:\\Program Files (x86)\\ZED SDK/lib', 'C:\\Users\\neeklo\\scoop\\apps\\cuda\\current/lib/x64']
libraries: ['sl_zed64']
Building module: ('pyzed.sl', ['pyzed/sl.pyx'])
Compiling pyzed/sl.pyx because it changed.
[1/1] Cythonizing pyzed/sl.pyx

Followed by a mega Cython error from (Cython.Compiler.Errors.CompileError: pyzed/sl.pyx ... the error is huge, so I'm attaching errors-build.txt printed to avoid cluttering here). The currently installed Cython is Cython 3.0.10 (not sure if this make any difference).

Any help is really appreciated!!

Rad-hi commented 7 months ago

Okay so after inspecting the build log, I think I can see the issue:

When running this line, from sl_c cimport ( String, to_str, Camera as c_Camera, ERROR_CODE as c_ERROR_CODE, toString, python is not able to find sl_c.pxd:

from sl_c cimport ( String, to_str, Camera as c_Camera, ERROR_CODE as c_ERROR_CODE, toString
^
------------------------------------------------------------
pyzed\sl.pyx:28:0: 'sl_c.pxd' not found

this is from the errors-build.txt ~ line 21

Two possibilities:

  1. you're not running the python3 setup.py build command under the src folder
  2. this might be a path issue of some sort, where python is not able to locate the sl_c.pxd file, in which case you could help Python by adding the path to the pyzed folder to the system path like:
from libcpp.map cimport map

# Add the next two lines to sl.pyx, ~ line 27
import sys
sys.path.append('/ABSOLUTE/PATH/TO/ZED/API/zed-python-api/src/pyzed')

# And modify the import: note the pyzed.sl_c
from pyzed.sl_c cimport ( String, to_str, Camera as c_Camera, ERROR_CODE as c_ERROR_CODE, toString

Make sure to update the path with the appropriate value (I see you're using windows, and I have no idea how it works)

Hope this helps.

Neeklow commented 6 months ago

@Rad-hi thansk for the tips. That defineteyl helped but it seems I'm having troubles with MSCV compiler:

error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.29.30133\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2

Is where I get to (i've uploaded the whole build error.txt again just in case). I've been bashing my head around this for the past few days, and I'm a bit short of ideas. If you have any experience/i

dea where to look for solution I'd really appreciate (but I understand you're not on Win and I seem to understand this is a strictly windows Issue)! Thanks a lot!

Rad-hi commented 6 months ago

I see @Neeklow, you seem to be encountering the same issues I encountered when trying to build this for the Orin (an NVIDIA board I am working on).

The actual error you're encountering is (line ~303 in the build log):

pyzed/sl.cpp(146250): error C2039: 'gnss_ignore_threshold': is not a member of 'sl::PositionalTrackingFusionParameters'

And the quickest fix for me was to comment out the gnss_ignore_threshold related code. You can find the changes in this branch https://github.com/Rad-hi/zed-python-api/tree/changes-for-orin.

Beware of the changes to the sl.pyx in the branch:

# This is added because sl_c.pxd wasn't being found
import os
import sys
HOME = os.path.expanduser('~')
sys.path.append(os.path.join(HOME, 'Desktop/zed-python-api/src'))

Since I placed the API under my desktop at the time of testing this. Apply the changes that you applied in the past fix here instead of keeping it this way.

I hope this helps and the API builds.

andreacelani commented 2 months ago

Hi @Rad-hi , I'm also interested since I grab image, depth and pointcloud and the moving of data back and forth from GPU memory is a waste of time (the processing is entirely in GPU with YOLO and Pytorch). I followed your instruction:

the I run and the system asked to install cupy. Ok I forgot it. I installed thourgh pip but I had to downgrade cython to 0.29.37 from 3.0.11 (compilation error). Now cupy is ok and almost everything runs with sl.MEM.CPU

if I switch to sl.MEM.GPU I receive the following runtime error:

File "pyzed/sl.pyx", line 4556, in pyzed.sl.Mat.get_data
ValueError: Provided MEM type doesn't match Mat's memory_type.

I tried to change the default memory to MEM.GPU (looks like an enum error) but I got the same error.

System: Core i5 7600k Nvidia GTX1050

Cuda 12.2 Zed SDK 4.0 Ubuntu 22.04

Note: after the use of custom pyzed I got an error about POSITIONAL_TRACKING_MODE. The error on my python script said that there is no "mode" in positional tracking module. I commented it and go ahead.

Rad-hi commented 2 months ago

Hello @andreacelani, the feature is indeed very useful,

As for the error message you are getting, that's actually an error message I added:

image

Make sure you initialize the Mat with the MEM.GPU type.

andreacelani commented 2 months ago

yes, you are right. I forgot that Mat has a type initialization. Let me try. I'll keep you updated about performances.

andreacelani commented 2 months ago

wow, I got 2X speed up with GPU reading of SVO file! And even more with real Camera Stream! Thank you @Rad-hi !

The following takes around 32-33 msec with CPU mem and streaming from SVO file:

self.zed.retrieve_image(self.image, view=sl.VIEW.LEFT, type=sl.MEM.CPU)  # read image from left imager grabber
self.zed.retrieve_measure(self.depth, measure=sl.MEASURE.DEPTH, type=sl.MEM.CPU)  # read depth image from grabber
self.zed.retrieve_measure(self.point_cloud, measure=sl.MEASURE.XYZRGBA, type=sl.MEM.CPU)  # Get the point cloud
self.zed_image = self.image.get_data(memory_type=sl.MEM.CPU)
self.zed_depth = self.depth.get_data(memory_type=sl.MEM.CPU)
self.zed_pointcloud = self.point_cloud.get_data(memory_type=sl.MEM.CPU)

While this takes only 15-16 msec with GPU mem and streaming from SVO file:

self.zed.retrieve_image(self.image_gpu, view=sl.VIEW.LEFT, type=sl.MEM.GPU)  # read image from left imager grabber
self.zed.retrieve_measure(self.depth_gpu, measure=sl.MEASURE.DEPTH, type=sl.MEM.GPU)  # read depth image from grabber
self.zed.retrieve_measure(self.point_cloud_gpu, measure=sl.MEASURE.XYZRGBA, type=sl.MEM.GPU)  # Get the point cloud
self.zed_image = self.image_gpu.get_data(memory_type=sl.MEM.GPU)
self.zed_depth = self.depth_gpu.get_data(memory_type=sl.MEM.GPU)
self.zed_pointcloud = self.point_cloud_gpu.get_data(memory_type=sl.MEM.GPU)

With real camera stream I went from 15-17 msec to 5.5-8 msec with GPU MEM (the same pipeline of above, image, depth and pointcloud)

SVO and real camera settings is: 1280X720

SVO rate: 30fps real camera rate: 60fps with capping at 30fps ( I think the 60 fps setting helped to reach 5.5-8 msec)

Depth in both (SVO and real camera):

init_params.depth_mode = sl.DEPTH_MODE.ULTRA  # Use ULTRA depth mode
init_params.depth_minimum_distance = 0.3  # Set the minimum depth perception distance to 30cm
init_params.depth_maximum_distance = 10  # Set the maximum depth perception distance to 10m

After the readout of the image from Zed2i in GPU I only need to convert to torch with .from_dlpack (from "cupy" to torch)

self.zed_image_without_alpha = self.zed_image[:, :, :3]
torch_image = torch.from_dlpack(self.zed_image_without_alpha)

(I removed alpha channel because I fed it into YOLO input, the torch_image tensor is then rearranged to match the input need of YOLO)

same for depth: ZedDepthGPU = torch.from_dlpack(self.zed_depth)

and pointcloud: PCtensorGPU = torch.from_dlpack(self.zed_pointcloud)

the CPU load (4 core@4Ghz) went from 48% to 32% while the GPU load from 25% to 37%

The complete pipeline went from 100msec to 56/60 msec (tested with SVO file stream) (no cython compilation, pure python)

The only small drawback that I found it's related to POSITIONAL_TRACKING_MODE

ptParams.mode = 1                 # sl.POSITIONAL_TRACKING_MODE.GEN_2  
AttributeError: 'pyzed.sl.PositionalTrackingParameters' object has no attribute 'mode'

where ptParams: ptParams = sl.PositionalTrackingParameters() # set parameters for Positional Tracking

It starts to give me the Attribute error when I switched to the new pyzed. I suppose it use default right now (from API, the default is GEN1). I cannot understand how it is related to the custom pyzed since there is no change in code there.

For the sake of completeness I initialized Mat as follows:

self.image_gpu = sl.Mat(memory_type=sl.MEM.GPU)
self.depth_gpu = sl.Mat(memory_type=sl.MEM.GPU)
self.point_cloud_gpu = sl.Mat(memory_type=sl.MEM.GPU)
self.image = sl.Mat()
self.depth = sl.Mat()
self.point_cloud = sl.Mat()
Rad-hi commented 2 months ago

@andreacelani these are some great findings, thank you for the detailed measurements !

I have no idea about the positional tracking thing, though, I never worked with it. But I am impressed by the impact !

Can you please try grabbing a single frame at HD2K resolution and tell me if you get the same output as I got ? For this PR, I only had a couple of weeks available for me to get it working, and since I needed HD2K, I couldn't continue working on it when I encountered the issue described in the previous comments. I would greatly appreciate you test it for me !

andreacelani commented 2 months ago

@Rad-hi grabbing to MEM.CPU is always fine. Resolution VGA, HD720, HD1080 and HD2K (15 fps, capped at 10 fps, if these settings could affect something) are always fine. If I grab to MEM.GPU, resolution HD720, HD1080 are fine while VGA and HD2K gives lines. Seems the byte order is messed up.

VGA grabbed to GPU: imageGPUtest_VGA

HD720 grabbed to GPU: imageGPUtest_720P

HD1080 grabbed to GPU: imageGPUtest_1080P

HD2K grabbed to GPU: imageGPUtest_HD2K

the same lines, less evident, with depth (HD2K grabbed to GPU): depthGPUtest_HD2K

attach also a test with grabbing to CPU.MEM @ HD2K for reference: imageCPUtest_HD2K

If I look at resolution spec here: https://www.stereolabs.com/docs/api/python/classpyzed_1_1sl_1_1RESOLUTION.html image

HD2K 2208 X 1242 and VGA (actually is not a real VGA but 672 X 376) are both "strange" resolutions that could mess up byte order if standard function are used. If you look at picture you can see the plant pot on the left that is splitted over several columns. Try to look at your code in this sense. I'll do the same.

All the acquisitions has been done with live camera, ZED 2i. Zed SDK: 4.0.8 CUDA: 12.1.105

Rad-hi commented 2 months ago

@andreacelani I appreciate your efforts, thank you ! As for your findings, they do seems to align with what my tests (back when I worked on this) yielded. On the grabbing side, all I am doing is wrapping the GPU data pointer into a cupy array, with the same shape as the one you would supply to the CPU grab.

In this reply, I compared the frame grabbed on CPU, with the frame extracted from the PCL grabbed on GPU in XYZ_RGBA mode. And I got a good-ish frame in HD2K. So, coupled with you validating my observations that the frames are correct but shifted somehow, I am starting to think that the issue is in the D2H transfer and visualization, instead of the grabbing.

I will try to test these hypotheses in the upcoming days, and hopefully we can close this chapter.

andreacelani commented 2 months ago

@Rad-hi I made the D2H by using image_cpu = torch.from_dlpack(image_gpu).detach().cpu().numpy() I cannot understand how this can affect byte order. Maybe one test cloud be the following: In gpu try to crop and pad the VGA image from 672x376 to 640x480, send to cpu and see what happens. But making some basic math I found the following: In VGA image I got 7 repetitions. This means that instead of reading an image of 672 pixels wide the image should be 672/7 = 96 pixels wider. 96 pixels wider means a total of 768 pixels that sounds “more common” than 672. If I do the same with HD2K 2208 pixels wide image I got 23 repetitions —> 2208/23 = 96 pixels. Same extra pixels of VGA. Pretty strange. The image is 2304 pixels wide that is not common but it’s exactly 3 times 768. From these calculation I suppose images in VGA and HD2K are larger (wide and height) in gpu than in cpu. In cpu images are cropped to ex: 672x376. In GPU the cupy pointer point to the same memory but the readout try to squeeze a larger image in smaller format. Make sense to you?

Rad-hi commented 2 months ago

EDITED (added the HD2K test as well, and better code)

@andreacelani you are spot on !

The moment I read 96 pixels shifts, it clicked ! Because if you look at the light bulb in the images you provided in your tests, it's “spread” across multiple columns.

Here's a small test I did:

import cv2
import numpy as np
import os

IMG_VGA = 'test_vga.jpg'
IMG_HD2K = 'test_hd2k.jpg'

TRUE_VGA_WIDTH = 672 + 96
TRUE_HD2K_WIDTH = 2208 + 96

def get_corrected_img(faulty: np.ndarray, true_width: int) -> np.ndarray:
    h, w_old, c = faulty.shape

    canvas = np.zeros((h, true_width, c), dtype=faulty.dtype)

    for i in range(0, canvas.size, c):
        if i == h * w_old:
            break

        y_new = i // true_width
        x_new = i % true_width
        y_old = i // w_old
        x_old = i % w_old

        canvas[y_new][x_new] = faulty[y_old][x_old]

    return canvas

def correct_and_show_res(img_name: str, vga: bool = False) -> None:
    img_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), img_name)
    faulty = cv2.imread(img_path)
    corrected = get_corrected_img(faulty, TRUE_VGA_WIDTH if vga else TRUE_HD2K_WIDTH)
    cv2.imshow(f"FAULTY-{'VGA' if vga else 'HD2K'}", faulty)
    cv2.imshow(f"CORRECTED-{'VGA' if vga else 'HD2K'}", corrected)

correct_and_show_res(IMG_VGA, True)
correct_and_show_res(IMG_HD2K, False)

cv2.waitKey(0)

Where my test image is the image you supplied for the grab on GPU at the “VGA” resolution:

test_vga

And, here's the output of the above script:

fixed_shift

Which concludes that the issue is indeed in the shaping of the data. Nice catch, I don't think I would've ever noticed such a thing !

Do you want to apply the changes to the code (happy to assist) ?

Also, I tested on the HD2K image, and the 96 pixels shift is again the thing (tested on the image you uploaded in your test of HD2K):

CORRECTED_HD2K

NOTE: I don't currently have access to any camera, but I expect to have it back in next week. I will update the code and test on resizing from the GPU side once I have access.

Another edit:

After looking at the code,

elif memory_type.value == MEM.GPU.value and not deep_copy:
    src_p = self.get_pointer(memory_type=memory_type)
    mem = cp.cuda.UnownedMemory(src_p, size, self)
    memptr = cp.cuda.MemoryPointer(mem, offset=0)
    arr = cp.ndarray(shape, dtype=dtype, memptr=memptr)

And as you can see, I am using shape and size which I am calculating using the supplied functions (link to referenced code):

shape = (self.mat.getHeight(), self.mat.getWidth(), self.mat.getChannels())
...
size = self.mat.getHeight()*self.mat.getWidth()*self.mat.getChannels()

Where getHeight(), getWidth() are exposed from the C++ SDK:

cdef cppclass Mat 'sl::Mat':
    ...
    size_t getWidth() const
    size_t getHeight() const

And the SDK is not open source (as far as I can tell), which puts us in a situation where if we want to make it work, we'd need to hardcode the shape when the grabbing is in GPU (or maybe we open an issue ?)

Rad-hi commented 2 months ago

@adujardin I have found the issue. The trick is in the striding. I am closing this PR, since it has lived for so long, and has fallen behind the master branch.

I'll test thoroughly my current solution, and will open a better quality PR.

@andreacelani, thank you for your input. It was invaluable in figuring this out!

andreacelani commented 2 months ago

@Rad-hi thank you, but it's nothing compared with the cupy integration. Excellent feature (or excellent patch for a long waited Zed feature). I purchased an RTX4060 and I was busy testing it out. I will try to hardcode the image size as you suggest and report here after. Yes, the C++ sdk is locked so we cannot do anything more than rise an issue. If nobody raised one I guess very few people use gpu readout with 2K resolution (but there could be someone use 640). I'm looking forward to see what you call better quality PR.