pmj110119 commented 11 months ago

How to manually load an image into cuda using C++ and display with HolovizOp?

My main doubt is that I don't know what Class should use and how to store image data into it.

The python version can be easily displayed using numpy.ndarray, but C++ version is very difficult to learn. All examples only show how to get the image directly from the VideoStreamReplayerOp operator.

I encountered obstacles. Please help.

pmj110119 commented 11 months ago

@agirault @grlee77 @jjomier I'm sorry to at you, thanks so much if anyone can help!!

grlee77 commented 11 months ago

Hi @pmj110119, this is unfortunately not as easy to do this from the C++ API at the moment, but is definitely possible. I can provide some example guidance later today.

If you need to get an existing image from the host to the device, you can use standard CUDA runtime APIs like cudaMalloc to allocate device memory and then cudaMemcpy to transfer data from the host to the device. Once you have a pointer to the device memory, it will be possible to wrap it as a Tensor without making a copy. It is not currently very obvious or well documented on how to do that, so let me find a concrete example to help.

grlee77 commented 11 months ago

Option 1: wrapping existing tensors capable of exporting a DLManagedTensor*.

One case that is currently fairly easy from C++ is when you are working with data that is already in a 3rd party library that supports exporting a DLPack DLManagedTensor*. An example of such a library is NVIDIA's MatX . In that case you can call a method that exports the pointer and pass it directly to the Tensor constructor as in this public Holoscan SDK example: https://github.com/nvidia-holoscan/holohub/blob/main/applications/multiai_endoscopy/cpp/post-proc-matx-gpu/multi_ai.cu#L131C69-L131C84

Option 2: use underlying GXF Tensor APIs to wrap existing memory (without a copy)

An example using cudaMalloc, cudaMemcpy, cudaFree and underlying NVIDIA GXF library APIs is the following example compute method which generates synthetic data, copies it to the device and emits a device tensor.:


void SendTensorTxOp::compute(InputContext&, OutputContext& op_output, ExecutionContext& context) {
  // Define the dimensions for the CUDA memory (64 x 32, uint8).
  int rows = 768;
  int columns = 1024;
  int channels = 3;

  // Available types below are:
  //   kInt8
  //   kUnsigned8
  //   kInt16
  //   kUnsigned16
  //   kInt32
  //   kUnsigned32
  //   kInt64
  //   kUnsigned64
  //   kFloat32
  //   kFloat64
  //   kComplex64
  //   kComplex128
  nvidia::gxf::PrimitiveType element_type = nvidia::gxf::PrimitiveType::kUnsigned8;

  int element_size = nvidia::gxf::PrimitiveTypeSize(element_type);

  // Shape does not have to be 3D, could be 1D, 2D, etc. instead
  nvidia::gxf::Shape shape = nvidia::gxf::Shape{rows, columns, channels};
  size_t nbytes = rows * columns * channels * element_size;

  // Create a shared pointer for the CUDA memory with a custom deleter that will
  // free the device memory via cudaFree when done.
  auto pointer = std::shared_ptr<void*>(new void*, [](void** pointer) {
    if (pointer != nullptr) {
      if (*pointer != nullptr) { CUDA_TRY(cudaFree(*pointer)); }
      delete pointer;
    }
  });

  // Allocate and initialize the CUDA memory.
  CUDA_TRY(cudaMalloc(pointer.get(), nbytes));

  // Replace this intiailization of synthetic host `data` with however your application gets data
  // into host memory.
  std::vector<uint8_t> data(nbytes);
  for (size_t index = 0; index < data.size(); ++index) {
    data[index] = (index_ + index) % 256;
  }

  // copy the data from host to device
  CUDA_TRY(cudaMemcpy(*pointer, data.data(), nbytes, cudaMemcpyKind::cudaMemcpyHostToDevice));

  // Holoscan Tensor doesn't support direct memory allocation.
  // Thus, create an Entity and use GXF tensor to wrap the CUDA memory.
  auto out_message = nvidia::gxf::Entity::New(context.context());
  auto gxf_tensor = out_message.value().add<nvidia::gxf::Tensor>("out_tensor");
  gxf_tensor.value()->wrapMemory(shape,
                                 element_type,
                                 element_size,
                                 nvidia::gxf::ComputeTrivialStrides(shape, element_size),
                                 // change to nvidia::gxf::MemoryStorageType::kCPU if using CPU memory
                                 nvidia::gxf::MemoryStorageType::kDevice,
                                 *pointer,
                                 [orig_pointer = pointer](void*) mutable {
                                   orig_pointer.reset();  // decrement ref count
                                   return nvidia::gxf::Success;
                                 });
  // Emit the tensor.
  op_output.emit(out_message.value(), "out");
}

where you would need to include at least the following up top to use the CUDA runtime APIs and underlying nvidia::gxf::Tensor API.

#include <cuda_runtime.h>  // probably also automatically pulled in by holoscan/holoscan.hpp

#include <holoscan/holoscan.hpp>
// #include "gxf/std/tensor.hpp"  // pulled in automatically by #include <holoscan/holoscan.hpp>

#define CUDA_TRY(stmt)                                                                        \
  ({                                                                                          \
    cudaError_t _holoscan_cuda_err = stmt;                                                    \
    if (cudaSuccess != _holoscan_cuda_err) {                                                  \
      HOLOSCAN_LOG_ERROR("CUDA Runtime call {} in line {} of file {} failed with '{}' ({}).", \
                         #stmt,                                                               \
                         __LINE__,                                                            \
                         __FILE__,                                                            \
                         cudaGetErrorString(_holoscan_cuda_err),                              \
                         _holoscan_cuda_err);                                                 \
    }                                                                                         \
    _holoscan_cuda_err;                                                                       \
  })

pmj110119 commented 11 months ago

It works, thanks!!!!!!

nvidia-holoscan / holoscan-sdk

How to manually load an image into cuda using C++ and display with HolovizOp? #17

Option 1: wrapping existing tensors capable of exporting a DLManagedTensor*.

Option 2: use underlying GXF Tensor APIs to wrap existing memory (without a copy)

nvidia-holoscan / holoscan-sdk

How to manually load an image into cuda using C++ and display with HolovizOp? #17

Option 1: wrapping existing tensors capable of exporting a **DLManagedTensor***.

Option 2: use underlying GXF Tensor APIs to wrap existing memory (without a copy)

Option 1: wrapping existing tensors capable of exporting a DLManagedTensor*.