triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.23k stars 1.47k forks source link

[Python Backend] Send PbTensor to cpu for calling as_numpy() or add a function as_cupy() #3547

Open zzk0 opened 2 years ago

zzk0 commented 2 years ago

Is your feature request related to a problem? Please describe. In Python Backend, I send inference request to served models and get the inference response. What I need is a numpy array, so I call as_numpy(). But some error occurs, Tensor is stored in GPU and cannot be converted to NumPy.

Describe the solution you'd like

  1. send Tensor to cpu, so as_numpy() can be called
  2. add a function as_cupy(), so the result can be used.
zzk0 commented 2 years ago

Here is the code in my python backend:

The Tensor returned by get_output_tensor_by_name cannot call the method to_numpy, because the Tensor is stored in GPU. It seems that no other methods can get the data from Tensor.

        inference_request = pb_utils.InferenceRequest(
            model_name='rnet',
            requested_output_names=[self.Rnet_outputs[0], self.Rnet_outputs[1]],
            inputs=[pb_utils.Tensor(self.Rnet_inputs[0], predict_24_batch)]
        )
        inference_response = inference_request.exec()
        cls_prob = pb_utils.get_output_tensor_by_name(inference_response, self.Rnet_outputs[0]).as_numpy()
        roi_prob = pb_utils.get_output_tensor_by_name(inference_response, self.Rnet_outputs[1]).as_numpy()
zzk0 commented 2 years ago

By the way, How to debug a python backend? Anyway, Thanks for your help.

tanmayv25 commented 2 years ago

Have you read the section here.

Can you try the following? https://github.com/triton-inference-server/server/blob/main/qa/python_models/dlpack_io_identity/model.py#L85-L91

zzk0 commented 2 years ago

Yes I tried. But it has to convert to pytorch tensor, then convert to numpy array.

I wrote a function like this:

def pb_tensor_to_numpy(pb_tensor):
    if pb_tensor.is_cpu():
        return pb_tensor.as_numpy()
    else:
        pytorch_tensor = from_dlpack(pb_tensor.to_dlpack())
        return pytorch_tensor.cpu().numpy()

It seems the Triton doesn't provide any method to get numpy array from GPU. Anyway, thanks for your advice.

Tabrizian commented 2 years ago

It is not possible to copy the tensor from GPU to CPU in Python backend directly. You need to use Pytorch (or any other framework that supports DLPack) to perform the conversion.

Edit: This feature makes sense and we've put it on the road map.

manastahir commented 1 year ago

Pytorch is not available with Pyton backend.

tanmayv25 commented 1 year ago

@manastahir You can pip install pytorch in the container and the backend process should be able to access the module.

manastahir commented 1 year ago

@tanmayv25 For deployments where we don't have direct access to containers, anyway to add it to container start command ?

tanmayv25 commented 1 year ago

You would have to capture the dependency in a custom execution environment as described here.

ShuaiShao93 commented 6 months ago

We don't want to add pytorch to the package since it's too heavyweight, and importing pytorch consumes some memory. A more lightweight solution to copy to CPU like as_numpy() would work better for us.

Tabrizian commented 6 months ago

@ShuaiShao93 I understood your use-case and I have updated my comment above.

ShuaiShao93 commented 6 months ago

@ShuaiShao93 I understood your use-case and I have updated my comment above.

Thank you! Please update here when it's done. Really appreciate!