A minimalistic example of using Cuda Shared memory

sayanmutd commented 1 month ago

Can you please provide a minimalistic example of using cuda shared memory from a client application which streams preprocessed pytorch tensors located in GPU to the pytriton server. The pytriton server is also using the same pytorch tensors through dlpack and doing inference in the pytriton server. I need examples for both the client and the server.

sayanmutd commented 3 weeks ago

@piotrm-nvidia Can you please suggest a minimal example to get started.

piotrm-nvidia commented 3 weeks ago

Let's start with simple Linear model, which takes a single input tensor and returns the negative of the input tensor:


import numpy as np
from pytriton.decorators import batch

@batch
def infer_fn(data):
    result = data * np.array([[-1]], dtype=np.float32)  # Process inputs and produce result
    return [result]

from pytriton.model_config import Tensor
from pytriton.triton import Triton
triton = Triton()
triton.bind(
    model_name="Linear",
    infer_func=infer_fn,
    inputs=[Tensor(name="data", dtype=np.float32, shape=(-1,)),],
    outputs=[Tensor(name="result", dtype=np.float32, shape=(-1,)),],
)
triton.run()

This code will create a simple Triton model that takes a single input tensor named data and returns the negative of the input tensor as the output tensor named result. The infer_fn function processes the input data and produces the output result. The @batch decorator indicates that the model supports batching.

You can test this model using the following code:

import numpy as np
from pytriton.client import ModelClient

client = ModelClient("localhost", "Linear")
data = np.array([1, 2, ], dtype=np.float32)
print(client.infer_sample(data=data))

The ModelClient class is a simple client for interacting with Triton models. If you need more advanced features, you can use the Triton client library directly. It provides a more flexible and powerful interface for working with Triton models and also supports shared memory for input and output data.

import numpy as np
import tritonclient.http as httpclient
import tritonclient.utils.shared_memory as shm
from tritonclient import utils

# Configuration
url = "localhost:8000"
model_name = "Linear"
input_share = (2, 1)
input_dtype = np.float32
data = np.array([1, 2], dtype=np.float32)
input_byte_size = 8
output_byte_size = int(np.prod(input_share) * input_dtype().itemsize)

# Expected output shape and type
output_shape = (2, 1)
output_dtype = np.float32

# Create Triton client
try:
    triton_client = httpclient.InferenceServerClient(url=url, verbose=True)
except Exception as e:
    print("Channel creation failed: " + str(e))
    raise e

# Ensure no shared memory regions are registered with the server
triton_client.unregister_system_shared_memory()
triton_client.unregister_cuda_shared_memory()

# Create shared memory regions for input and output
shm_ip_handle = shm.create_shared_memory_region("input_data", "/input_simple", input_byte_size)
output_byte_size = int(np.prod(output_shape) * output_dtype().itemsize)
shm_op_handle = shm.create_shared_memory_region("output_data", "/output_simple", output_byte_size)

# Register the shared memory regions with the Triton server
triton_client.register_system_shared_memory("input_data", "/input_simple", input_byte_size)
triton_client.register_system_shared_memory("output_data", "/output_simple", output_byte_size)

# Put input data into shared memory
shm.set_shared_memory_region(shm_ip_handle, [data])

# Set up the inputs and outputs to use shared memory
inputs = []
inputs.append(httpclient.InferInput("data", [1, 2], "FP32"))
inputs[-1].set_shared_memory("input_data", input_byte_size)

outputs = []
outputs.append(httpclient.InferRequestedOutput("result", binary_data=True))
outputs[-1].set_shared_memory("output_data", output_byte_size)

# Perform inference
results = triton_client.infer(model_name=model_name, inputs=inputs, outputs=outputs)

# Read results from the shared memory
output = results.get_output("result")
output_data = shm.get_contents_as_numpy(
    shm_op_handle,
    output_dtype,
    output_shape,
)

# Print the results
print(f"Input data: {data}")
print(f"Output data: {output_data}")

# Clean up
triton_client.unregister_system_shared_memory()
shm.destroy_shared_memory_region(shm_ip_handle)
shm.destroy_shared_memory_region(shm_op_handle)

This example demonstrates how to use shared memory.

While using shared memory can lead to performance improvements by reducing data transfer overhead, it's important to consider the following:

Complexity: The use of shared memory introduces additional complexity into the code.
Local Machine: Shared memory is only applicable on the same machine, so if your tensors are already in the process memory, calling your Python code directly might be simpler and more efficient.
Performance: The internal implementation of PyTriton doesn't use shared memory, so using shared memory with the Triton Inference Server might not provide significant performance gains in this context.

Don't hesitate to ask if you have any questions or need further assistance.

piotrm-nvidia commented 3 weeks ago

I'm afraid that PyTorch support needs some fixes:

https://github.com/triton-inference-server/client/issues/789

github-actions[bot] commented 2 days ago

This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days.

triton-inference-server / pytriton

A minimalistic example of using Cuda Shared memory #84