Feature/get data gpu - Githubissues

Rad-hi commented 10 months ago

Keep the `retrieved` data on GPU

Description

Through the utilization of cupy.ndarrays (extra dependency), we were able to keep the retrieved data on GPU and provide a view of the memory on the Python side.

This PR introduces a slight modification to the get_data() function in a way that extends the API, but doesn't break any of its current functionalities.

Notes

In an effort to stay true to the original API, data on the GPU could be either deeply copied, or viewed by the Python API consumer.

A list of references is kept as a comment above the added code for potential enhancements, and/or better understanding of the feature.

Tests

This was tested with the ZED SDK 4.0, on both SVOs (recorded in HD2K-15), and on a live stream from a ZED Mini, with a custom wrapper of the Python API, on a Ubuntu 20.04.6 LTS system, with an Nvidia GTX 1650 4Gb graphics card.

Additionally, we are attempting its test on an Nvidia AGX Orin 32Gb developer kit, though we're having some Cython compilation errors that are prohibiting us from proceeding with the tests (would love to discuss, but not our topic).

Disclaimers

This is not thoroughly tested, and this PR is intended to spark a discussion around how this feature might be developed by the community, rather than being a final product (though we hope it's ready as is).

adujardin commented 10 months ago

This is interesting, thanks for your effort! Do you have a sample code where you use the GPU buffer, maybe with Pytorch/TensorRT? Did you notice a significant performance improvement?

Rad-hi commented 10 months ago

Thanks! Currently, I only used it with a custom TF that applies to the whole pointcloud (rotation along the X axis) using numba.cuda accelerated Python code:

import math
import time
import numpy as np
import cupy as cp
from numba import cuda
from numba.cuda.cudadrv.devicearray import DeviceNDArray
import numba
import torch

from typing import Union, Tuple, Callable, TypeVar
from typing_extensions import ParamSpec

T = TypeVar('T')
P = ParamSpec('P')
RotateCallable = Callable[[float, float, float, float], Tuple[float, float]]

Y = 1
Z = 2
ROT = 69.0

class TimeMe:
    '''
    Timing decorator that takes into account whether the timing would be done in GPU or CPU
    @depends: torch
    Ref: https://discuss.pytorch.org/t/how-to-measure-time-in-pytorch/26964/2
    '''
    def __init__(self, gpu: bool = False) -> None:
        self._gpu: bool = gpu
        if gpu:
            self._start = torch.cuda.Event(enable_timing=True)
            self._end = torch.cuda.Event(enable_timing=True)

    def __call__(self, func: Callable[P, T]) -> Callable[P, T]:
        def wrapper(*args: P.args, **kwargs: P.kwargs) -> Tuple[float, T]:
            all_ts: float = 0.0
            ts: float = 0.0

            if self._gpu:
                self._start.record()
            else:
                ts = time.time()

            result = func(*args, **kwargs)

            if self._gpu:
                self._end.record()
                torch.cuda.synchronize()

            all_ts = self._start.elapsed_time(self._end) if self._gpu else (time.time() - ts) * 1000.0

            return all_ts, result

        return wrapper

def rotate_yz_sin_cos(y: float, z: float, sin_: float, cos_: float) -> Tuple[float, float]:
    if math.isfinite(y) and math.isfinite(z):
        yy = y * cos_ - z * sin_
        zz = y * sin_ + z * cos_
        return yy, zz
    return math.nan, math.nan

# JIT compile for both CPU and GPU
rotate_yz_sin_cos_cpu: RotateCallable = numba.njit(nogil=True, cache=True)(rotate_yz_sin_cos)
rotate_yz_sin_cos_gpu: RotateCallable = cuda.jit(device=True, cache=True)(rotate_yz_sin_cos)

@cuda.jit
def rotate_pcl_kernel(pcl: DeviceNDArray, rot: float) -> DeviceNDArray:
    '''
    GPU kernel to rotate pointclouds with an angle rot
    Ref: https://github.com/harrism/numba_examples/blob/master/mandelbrot_numba.ipynb
    '''
    sin_ = math.sin(rot)
    cos_ = math.cos(rot)
    h, w, _ = pcl.shape

    startX, startY = cuda.grid(2)
    gridX = cuda.gridDim.x * cuda.blockDim.x
    gridY = cuda.gridDim.y * cuda.blockDim.y

    for i in range(startY, h, gridY):
        for j in range(startX, w, gridX):
            pcl[i][j][Y], pcl[i][j][Z] = rotate_yz_sin_cos_gpu(
                pcl[i][j][Y], pcl[i][j][Z], sin_, cos_)

@TimeMe(gpu=True)
def rotate_pcl_gpu(pcl: Union[cp.ndarray, np.ndarray],
                   rot: float,
                   h2d: bool = False) -> Union[cp.ndarray, np.ndarray]:
    blockdim = (16, 16)
    griddim = (16, 16)

    pcl = cuda.to_device(pcl) if h2d else pcl
    rotate_pcl_kernel[griddim, blockdim](pcl, rot)
    pcl = pcl.copy_to_host() if h2d else pcl

    # The rotation doesn't happen inplace only when H2D/D2H
    # transfers were requested; pcl originally on host memory 
    return pcl

@TimeMe()
@numba.njit(nogil=True, parallel=True, cache=True)
def rotate_pcl_cpu(pcl: np.ndarray, rot: float) -> None:
    sin_ = math.sin(rot)
    cos_ = math.cos(rot)
    h, w, _ = pcl.shape
    for i in numba.prange(h):
        for j in numba.prange(w):
            pcl[i][j][Y], pcl[i][j][Z] = rotate_yz_sin_cos_cpu(
                pcl[i][j][Y], pcl[i][j][Z], sin_, cos_)

if __name__ == '__main__':

    # TODO: add the Zed init and Mat creation
    pcl_gpu = pcl_mat.get_data(memory_type=sl.MEM.GPU)

    # Note that this will take "D2H" time
    # Ref: https://docs.cupy.dev/en/stable/reference/generated/cupy.ndarray.html#cupy.ndarray.get
    pcl_cpu = pcl_gpu.get()

    # rotation will happen inplace since pcl is on Device memory
    rot_gpu_lat, _ = rotate_pcl_gpu(pcl_gpu, ROT)
    rot_cpu_lat, _ = rotate_pcl_cpu(pcl_cpu, ROT)

    # NOTE: Tests for correctness were performed to make sure the rotation is identical on GPU/CPU,
    # and that the rotation actually happened; inplace.

Benchmark

System info:

GPU: NVIDIA GeForce GTX 1650, 4096 MB
CPU: 11th Gen Intel® Core™ i7-11370H @ 3.30GHz × 8

GPU warmup iterations:

SVO had 198 frames, and we used a queue of length 100 for profiling

Numbers:

[ROT_GPU] Mean: 1114.543 us, Std: 103.850 us, Max: 2000.896 us, Min: 1052.640 us, N Samples: 100.
[ROT_CPU] Mean: 5875.752 us, Std: 2744.054 us, Max: 16989.231 us, Min: 4649.878 us, N Samples: 100.
[D2H] Mean: 31647.370 us, Std: 1649.754 us, Max: 42148.352 us, Min: 29398.680 us, N Samples: 100.

As you can see, the gain is significant, and we can write all pcl processing code in the same manner, thus, never having to bring the pcl to the host memory. Additionally, we can run DL inference on the frames using TRT without having to pay for the D2H-H2D transfers.

Notes about the cupy choice:

Cupy has great interoperability with most common frameworks (notably numba.cuda, Pytorch, and TensorRT) https://docs.cupy.dev/en/stable/user_guide/interoperability.html.

The same results might be achieved using numba.cuda DeviceNDArrays, but we didn't test it.

adujardin commented 10 months ago

Impressive! We need to think about how we could go forward with this. Ideally, I think it would be best to have the cupy dependency optional to limit complexity for most users. I'll try to test this soon

Rad-hi commented 10 months ago

When it comes to dependency management, I do agree that cupy being an optional dependency is the way to go.

I also want to note that we've seen an increase of the maximum FPS. We went from around 12 FPS (grab/retrieve on CPU), to around 15 when we started keeping the data on the GPU (even with 2 Zed mini cameras working at the same time, on an Orin 32Gb). This led us to believe that your specs are measured using the C++ API !?

Testing session's findings

I spent some time testing it in depth, and I saw some confusing behavior that I would love your help about. So, as I mentioned, I had code to test the correctness of the retrieved data. The code compares the CPU frame/pcl with its GPU counterpart (I considered the CPU data as the ground truth) element wise, and gives me back the match percentage:

@numba.njit(nogil=True, cache=True)
def almost_eq(a: Union[int, float], b: Union[int, float], e: Union[int, float]) -> bool: 
    if not (math.isfinite(a) and math.isfinite(b)):
        # If either is nan, consider them equal
        return True
    return abs(a - b) <= e

@numba.njit(nogil=True, cache=True)
def count_almost_eq(mat1: np.ndarray, mat2: np.ndarray, eupsilon: float) -> None:
    '''
    This method is needed (instead of np.count_nonzero(mat1 == mat2)) becauce
    np.nan == np.nan >>> False. Thus we need to manually check the elements.
    '''
    sh1 = mat1.shape
    sh2 = mat2.shape
    assert sh1 == sh2
    assert mat1.size == mat2.size

    h, w, _ = sh1

    eq = mat1.size

    for j in range(h):
        for i in range(w):
            x_eq = almost_eq(mat1[j][i][X], mat2[j][i][X], eupsilon)
            y_eq = almost_eq(mat1[j][i][Y], mat2[j][i][Y], eupsilon)
            z_eq = almost_eq(mat1[j][i][Z], mat2[j][i][Z], eupsilon)
            eq -= 1 * (not (x_eq and y_eq and z_eq))
    return eq

# Retrieve the image/measure
gpu_pcl = pcl_mat.get_data(memory_type=sl.MEM.GPU)
gpu_frame = frame_mat.get_data(memory_type=sl.MEM.GPU)

# Bring the data to Host memory to visualize it and test its correctness
gpu_pcl_np = gpu_pcl.get()
gpu_frame_np = gpu_frame.get()

print(f'PCL -- {(count_almost_eq(gpu_pcl, gpu_pcl_np, 0.000001) / pcl_cpu.size) * 100.0:.2f} % almost identical')
print(f'RGB -- {(count_almost_eq(cpu_frame, gpu_frame_np, 1) / cpu_frame.size) * 100.0:.2f} % almost identical')

I tested with live streams from a ZED Mini, a ZED 2, and SVOs, both on my laptop, and an Nvidia AGX Orin Dev kit, with the latest SDK (4.0).

The confusing behavior is that in HD1080 and HD720 (in all FPS options), everything works perfectly (100% match on both PCL, and Frame), but when I choose either HD2K or VGA (in all FPS options), the PCL still matches 100%, but the Frame only matches around 74 ±3 %.

Additionally, when I visualize the Frames (after converting Cupy arr to Numpy arr), the GPU one looks kinda shifted (for HD2K and VGA only): CPU_frame_HD2K GPU_frame_HD2K

I tried:

Adjusting the polling rate (how frequently I called grab/retrieve) by adding and removing some delays here and there,
Only retrieving the frame,

But nothing seems to fix this,

Am I missing something about how CUDA works, or how the frame data is “organized” in memory ? Is this related to a striding mechanism in the frame representation in GPU (I am just guessing, not really sure what I am asking) ?

Thank you for indulging me.

Rad-hi commented 9 months ago

@adujardin I was wondering if you had time to test this out, or maybe if you can point me towards why this issue might be happening. In order to further test the data retrieval, I created the image from the RGBA data in the PCL using this function:

@numba.njit(cache=True, parallel=True)
def xyzrgba2rgba(src: np.ndarray, dst: np.ndarray) -> None:
    '''
    Fill dst matrix with RGBA data from the XYZRGBA pcl.
    '''
    RGBA_AXIS = 3
    BYTE_MASK = 0xFF
    R, G, B = (0, 1, 2)
    RED_MASK = np.uint32(BYTE_MASK << (R * 8))
    GREEN_MASK = np.uint32(BYTE_MASK << (G * 8))
    BLUE_MASK = np.uint32(BYTE_MASK << (B * 8))

    h, w, _ = dst.shape
    for i in numba.prange(w):
        for j in numba.prange(h):
            if not np.isfinite(src[j][i][RGBA_AXIS]):
                continue
            rgba_bin = np.asarray(src[j][i][RGBA_AXIS], np.float32).view(np.uint32)
            dst[j][i][R] = np.uint8((rgba_bin & RED_MASK) >> (R * 8))
            dst[j][i][G] = np.uint8((rgba_bin & GREEN_MASK) >> (G * 8))
            dst[j][i][B] = np.uint8((rgba_bin & BLUE_MASK) >> (B * 8))
            dst[j][i][RGBA_AXIS] = BYTE_MASK

def image_from_pcl_rgba(pcl):
    im = np.ones_like(pcl, dtype=np.uint8)
    xyzrgba2rgba(pcl, im)
    return im

And it gave me the next two images (CPU retrieval VS image from GPU PCL), which points me towards the fact that the data is in fact being read correctly.

CPU frame_screenshot_28 11 2023 GPU frame_screenshot_28 11 2023

Do you have any ideas for me to try ?

Neeklow commented 4 months ago

Hi! I'm using just now the python API for a project, and I realized that one of the main bottlenecks is, also for me, the double passage of textures from gpu to cpu. I'd be super interested in testing your PR but I'm rather confused on how to test the repo. Could I ask you some directions? Thanks a lot!

Rad-hi commented 4 months ago

Hey, @Neeklow! I would appreciate another tester. So to test it, you need to build the Python API from source, for that:

clone this repo
navigate to src folder and follow the README for instructions for building the pluglin
Use the API as you are used to, but for the memory_type argument, supply sl.MEM.GPU instead of sl.MEM.CPU

It's to note that I only tested this with the ZED 4.0 SDK and with Python 3.8.

Neeklow commented 4 months ago

Thanks a lot @Rad-hi ! I'm having troubles to build the plugin. I'm currently on a python 3.8 virtual environment (I'm assuming that shouldn't be a problem?). After I install all the requirements, I run python setup.py build and I first receive

ZED SDK Version: OK
compilation flags: 
include dirs: ['C:\\Users\\neeklo\\Documents\\Repos\\ng-stories-people-room-mono\\workflow\\camera-feed-system\\.venv\\lib\\site-packages\\numpy\\core\\include', 'C:\\Program Files (x86)\\ZED SDK/include', ``'C:\\Users\\neeklo\\scoop\\apps\\cuda\\current/include']
library dirs: ['C:\\Users\\neeklo\\Documents\\Repos\\ng-stories-people-room-mono\\workflow\\camera-feed-system\\.venv\\lib\\site-packages\\numpy\\core\\include', 'C:\\Program Files (x86)\\ZED SDK/lib', 'C:\\Users\\neeklo\\scoop\\apps\\cuda\\current/lib/x64']
libraries: ['sl_zed64']
Building module: ('pyzed.sl', ['pyzed/sl.pyx'])
Compiling pyzed/sl.pyx because it changed.
[1/1] Cythonizing pyzed/sl.pyx

Followed by a mega Cython error from (Cython.Compiler.Errors.CompileError: pyzed/sl.pyx ... the error is huge, so I'm attaching errors-build.txt printed to avoid cluttering here). The currently installed Cython is Cython 3.0.10 (not sure if this make any difference).

Any help is really appreciated!!

Rad-hi commented 4 months ago

Okay so after inspecting the build log, I think I can see the issue:

When running this line, from sl_c cimport ( String, to_str, Camera as c_Camera, ERROR_CODE as c_ERROR_CODE, toString, python is not able to find sl_c.pxd:

from sl_c cimport ( String, to_str, Camera as c_Camera, ERROR_CODE as c_ERROR_CODE, toString
^
------------------------------------------------------------
pyzed\sl.pyx:28:0: 'sl_c.pxd' not found

this is from the errors-build.txt ~ line 21

Two possibilities:

you're not running the python3 setup.py build command under the src folder
this might be a path issue of some sort, where python is not able to locate the sl_c.pxd file, in which case you could help Python by adding the path to the pyzed folder to the system path like:

from libcpp.map cimport map

# Add the next two lines to sl.pyx, ~ line 27
import sys
sys.path.append('/ABSOLUTE/PATH/TO/ZED/API/zed-python-api/src/pyzed')

# And modify the import: note the pyzed.sl_c
from pyzed.sl_c cimport ( String, to_str, Camera as c_Camera, ERROR_CODE as c_ERROR_CODE, toString

Make sure to update the path with the appropriate value (I see you're using windows, and I have no idea how it works)

Hope this helps.

Neeklow commented 4 months ago

@Rad-hi thansk for the tips. That defineteyl helped but it seems I'm having troubles with MSCV compiler:

error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.29.30133\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2

Is where I get to (i've uploaded the whole build error.txt again just in case). I've been bashing my head around this for the past few days, and I'm a bit short of ideas. If you have any experience/i

dea where to look for solution I'd really appreciate (but I understand you're not on Win and I seem to understand this is a strictly windows Issue)! Thanks a lot!

Rad-hi commented 4 months ago

I see @Neeklow, you seem to be encountering the same issues I encountered when trying to build this for the Orin (an NVIDIA board I am working on).

The actual error you're encountering is (line ~303 in the build log):

pyzed/sl.cpp(146250): error C2039: 'gnss_ignore_threshold': is not a member of 'sl::PositionalTrackingFusionParameters'

And the quickest fix for me was to comment out the gnss_ignore_threshold related code. You can find the changes in this branch https://github.com/Rad-hi/zed-python-api/tree/changes-for-orin.

Beware of the changes to the sl.pyx in the branch:

# This is added because sl_c.pxd wasn't being found
import os
import sys
HOME = os.path.expanduser('~')
sys.path.append(os.path.join(HOME, 'Desktop/zed-python-api/src'))

Since I placed the API under my desktop at the time of testing this. Apply the changes that you applied in the past fix here instead of keeping it this way.

I hope this helps and the API builds.

stereolabs / zed-python-api

Feature/get data gpu #230

Keep the `retrieved` data on GPU

Description

Notes

Tests

Disclaimers

Benchmark

System info:

GPU warmup iterations:

Numbers:

Notes about the cupy choice:

Testing session's findings

stereolabs / zed-python-api

Feature/get data gpu #230

Keep the retrieved data on GPU

Description

Notes

Tests

Disclaimers

Benchmark

System info:

GPU warmup iterations:

Numbers:

Notes about the cupy choice:

Testing session's findings

Keep the `retrieved` data on GPU