get_data on GPU using cupy

Previously, I tried to extend the Python API with the ability to keep the data on the GPU (https://github.com/stereolabs/zed-python-api/pull/230), and I ran into some weird behaviors (back then they were weird, but now, it's obvious that it was just a lack on understanding of how the data is laid out in memory).

This PR, however, provides a fully functional extension.

NOTE: this change adds an extra dependency; cupy.

The targeted function is get_data(), and both modes of providing data (memory view / deep copy) were implemented for GPU as well.

This was tested on an Nvidia AGX Orin 32Gb, with JetPack 5.1.2, and ZED_SDK_4.1.4.

Shoutout to @andreacelani for the discussion that lead to figuring out how to implement this correctly (look into the closed PR #230 for details).

Benchmarking with an ML pipeline:

@andreacelani did some benchmarking with impressive results: https://github.com/stereolabs/zed-python-api/pull/230#issuecomment-2347310516

Additionally, I tested it myself using a real feed from a ZED Mini with a simple pipeline (see picture), and here are my findings:

TL;DR:

grabbing is 60% faster
preprocessing on the GPU would be faster (when implemented correctly)

Details:

"""
HD2K @15FPS:

GPU:
[GPU_GRAB]             Mean: 8.531 ms, Std: 2.563 ms, Max: 15.460 ms, Min: 5.205 ms, N Samples: 200.
[GPU_PREP_RESIZE]      Mean: 5.205 ms, Std: 1.553 ms, Max: 7.745 ms, Min: 2.473 ms, N Samples: 200.
[GPU_PREP]             Mean: 6.004 ms, Std: 1.554 ms, Max: 8.721 ms, Min: 3.259 ms, N Samples: 200.
[GPU_ROT]              Mean: 0.916 ms, Std: 0.061 ms, Max: 1.162 ms, Min: 0.827 ms, N Samples: 200.
[GPU_INF]              Mean: 24.066 ms, Std: 0.701 ms, Max: 28.860 ms, Min: 23.353 ms, N Samples: 200.
[GPU_STEP]             Mean: 39.537 ms, Std: 1.452 ms, Max: 44.720 ms, Min: 38.024 ms, N Samples: 200.
[GPU_CPU_DUMMY_SLEEP]  Mean: 30.065 ms, Std: 0.003 ms, Max: 30.084 ms, Min: 30.046 ms, N Samples: 200.

Throughput: ~13 iter/s

CPU:
[CPU_GRAB]             Mean: 21.728 ms, Std: 1.193 ms, Max: 25.891 ms, Min: 20.530 ms, N Samples: 200.
[CPU_PREP_RESIZE]      Mean: 5.252 ms, Std: 0.167 ms, Max: 6.051 ms, Min: 5.183 ms, N Samples: 200.
[CPU_PREP_D2H]         Mean: 1.123 ms, Std: 0.066 ms, Max: 1.445 ms, Min: 0.772 ms, N Samples: 200.
[CPU_PREP]             Mean: 13.468 ms, Std: 0.468 ms, Max: 15.780 ms, Min: 13.130 ms, N Samples: 200.
[CPU_ROT]              Mean: 1.767 ms, Std: 0.475 ms, Max: 3.314 ms, Min: 1.053 ms, N Samples: 200.
[CPU_INF]              Mean: 24.054 ms, Std: 1.301 ms, Max: 31.345 ms, Min: 23.337 ms, N Samples: 200.
[CPU_STEP]             Mean: 61.058 ms, Std: 2.245 ms, Max: 70.546 ms, Min: 58.555 ms, N Samples: 200.
[GPU_CPU_DUMMY_SLEEP]  Mean: 30.064 ms, Std: 0.012 ms, Max: 30.084 ms, Min: 30.016 ms, N Samples: 200.

Throughput: ~10 iter/s

HD1080 @30FPS:

GPU:
[GPU_GRAB]             Mean: 6.146 ms, Std: 1.574 ms, Max: 11.672 ms, Min: 4.429 ms, N Samples: 200.
[GPU_PREP_RESIZE]      Mean: 6.188 ms, Std: 1.396 ms, Max: 7.494 ms, Min: 1.917 ms, N Samples: 200.
[GPU_PREP]             Mean: 6.907 ms, Std: 1.404 ms, Max: 8.313 ms, Min: 2.610 ms, N Samples: 200.
[GPU_ROT]              Mean: 0.851 ms, Std: 0.051 ms, Max: 1.244 ms, Min: 0.795 ms, N Samples: 200.
[GPU_INF]              Mean: 23.864 ms, Std: 0.697 ms, Max: 30.536 ms, Min: 22.047 ms, N Samples: 200.
[GPU_STEP]             Mean: 37.785 ms, Std: 0.774 ms, Max: 44.811 ms, Min: 35.756 ms, N Samples: 200.
[GPU_CPU_DUMMY_SLEEP]  Mean: 0.005 ms, Std: 0.003 ms, Max: 0.038 ms, Min: 0.003 ms, N Samples: 200.

Throughput: ~26 iter/s

CPU:
[CPU_GRAB]             Mean: 18.501 ms, Std: 1.092 ms, Max: 22.510 ms, Min: 17.040 ms, N Samples: 200.
[CPU_PREP_RESIZE]      Mean: 4.796 ms, Std: 0.139 ms, Max: 5.671 ms, Min: 4.714 ms, N Samples: 200.
[CPU_PREP_D2H]         Mean: 1.107 ms, Std: 0.062 ms, Max: 1.447 ms, Min: 0.901 ms, N Samples: 200.
[CPU_PREP]             Mean: 11.538 ms, Std: 0.361 ms, Max: 13.599 ms, Min: 11.297 ms, N Samples: 200.
[CPU_ROT]              Mean: 1.319 ms, Std: 0.350 ms, Max: 1.848 ms, Min: 0.862 ms, N Samples: 200.
[CPU_INF]              Mean: 24.247 ms, Std: 1.295 ms, Max: 31.933 ms, Min: 22.330 ms, N Samples: 200.
[CPU_STEP]             Mean: 55.640 ms, Std: 2.117 ms, Max: 69.769 ms, Min: 52.252 ms, N Samples: 200.
[GPU_CPU_DUMMY_SLEEP]  Mean: 0.009 ms, Std: 0.011 ms, Max: 0.163 ms, Min: 0.003 ms, N Samples: 200.

Throughput: ~17 iter/s
"""

Notes:

I used the generic YOLO (from ultralytics import YOLO), and a custom trained Pytorch YOLOV8 model.
I added the sleep because in the case of HD2K grabbing, my pipeline wasn't saturating the 15FPS rate, thus grabbing was seemingly slower in GPU (faulty read).
The preprocessing includes 4 channel to 3 channel reduction, resizing (to meet the 640x640 expected input), and normalization.
There's a step that I didn't put in the pipeline, which is a rotation on the X axis of the PCL just to simulate real work. (code details are here https://github.com/stereolabs/zed-python-api/pull/230#issuecomment-1787067065.)

stereolabs / zed-python-api

get_data on GPU using cupy #241