Open sayanmutd opened 1 month ago
@piotrm-nvidia Can you please suggest a minimal example to get started.
Let's start with simple Linear model, which takes a single input tensor and returns the negative of the input tensor:
import numpy as np
from pytriton.decorators import batch
@batch
def infer_fn(data):
result = data * np.array([[-1]], dtype=np.float32) # Process inputs and produce result
return [result]
from pytriton.model_config import Tensor
from pytriton.triton import Triton
triton = Triton()
triton.bind(
model_name="Linear",
infer_func=infer_fn,
inputs=[Tensor(name="data", dtype=np.float32, shape=(-1,)),],
outputs=[Tensor(name="result", dtype=np.float32, shape=(-1,)),],
)
triton.run()
This code will create a simple Triton model that takes a single input tensor named data
and returns the negative of the input tensor as the output tensor named result
. The infer_fn
function processes the input data and produces the output result. The @batch
decorator indicates that the model supports batching.
You can test this model using the following code:
import numpy as np
from pytriton.client import ModelClient
client = ModelClient("localhost", "Linear")
data = np.array([1, 2, ], dtype=np.float32)
print(client.infer_sample(data=data))
The ModelClient
class is a simple client for interacting with Triton models. If you need more advanced features, you can use the Triton client library directly. It provides a more flexible and powerful interface for working with Triton models and also supports shared memory for input and output data.
import numpy as np
import tritonclient.http as httpclient
import tritonclient.utils.shared_memory as shm
from tritonclient import utils
# Configuration
url = "localhost:8000"
model_name = "Linear"
input_share = (2, 1)
input_dtype = np.float32
data = np.array([1, 2], dtype=np.float32)
input_byte_size = 8
output_byte_size = int(np.prod(input_share) * input_dtype().itemsize)
# Expected output shape and type
output_shape = (2, 1)
output_dtype = np.float32
# Create Triton client
try:
triton_client = httpclient.InferenceServerClient(url=url, verbose=True)
except Exception as e:
print("Channel creation failed: " + str(e))
raise e
# Ensure no shared memory regions are registered with the server
triton_client.unregister_system_shared_memory()
triton_client.unregister_cuda_shared_memory()
# Create shared memory regions for input and output
shm_ip_handle = shm.create_shared_memory_region("input_data", "/input_simple", input_byte_size)
output_byte_size = int(np.prod(output_shape) * output_dtype().itemsize)
shm_op_handle = shm.create_shared_memory_region("output_data", "/output_simple", output_byte_size)
# Register the shared memory regions with the Triton server
triton_client.register_system_shared_memory("input_data", "/input_simple", input_byte_size)
triton_client.register_system_shared_memory("output_data", "/output_simple", output_byte_size)
# Put input data into shared memory
shm.set_shared_memory_region(shm_ip_handle, [data])
# Set up the inputs and outputs to use shared memory
inputs = []
inputs.append(httpclient.InferInput("data", [1, 2], "FP32"))
inputs[-1].set_shared_memory("input_data", input_byte_size)
outputs = []
outputs.append(httpclient.InferRequestedOutput("result", binary_data=True))
outputs[-1].set_shared_memory("output_data", output_byte_size)
# Perform inference
results = triton_client.infer(model_name=model_name, inputs=inputs, outputs=outputs)
# Read results from the shared memory
output = results.get_output("result")
output_data = shm.get_contents_as_numpy(
shm_op_handle,
output_dtype,
output_shape,
)
# Print the results
print(f"Input data: {data}")
print(f"Output data: {output_data}")
# Clean up
triton_client.unregister_system_shared_memory()
shm.destroy_shared_memory_region(shm_ip_handle)
shm.destroy_shared_memory_region(shm_op_handle)
This example demonstrates how to use shared memory.
While using shared memory can lead to performance improvements by reducing data transfer overhead, it's important to consider the following:
Don't hesitate to ask if you have any questions or need further assistance.
I'm afraid that PyTorch support needs some fixes:
https://github.com/triton-inference-server/client/issues/789
This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Can you please provide a minimalistic example of using cuda shared memory from a client application which streams preprocessed pytorch tensors located in GPU to the pytriton server. The pytriton server is also using the same pytorch tensors through dlpack and doing inference in the pytriton server. I need examples for both the client and the server.