triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.06k stars 1.45k forks source link

client silent failure - E0422 05:03:24.145960 1 pb_stub.cc:402] An error occurred while trying to load GPU buffers in the Python backend stub: failed to copy data: invalid argument #7148

Open jrcavani opened 4 months ago

jrcavani commented 4 months ago

Description

The model repo is an object detection ensemble, which consists of a preprocessor written with the Python backend, and the main model in TensorRT plan. The Python backend uses CuPy to allocate GPU tensors, and pass back to Triton scheduler with pb_utils.Tensor.from_dlpack for the TensorRT model.

            inference_response = pb_utils.InferenceResponse(
                output_tensors=[
                    pb_utils.Tensor.from_dlpack(
                        self.output_name, preprocessed_full_batch.copy()
                    )
                ]
            )

The CuPy allocation during preprocessing looks like:

    def preprocess(self, batch):
        """
        batch is imgs in HWC uint8 BGR format.
        """
        import cupy as xp

        batch = xp.asarray(np.array(batch))  # uint8 array
        input_blob = batch.astype(xp.float32)  # type convert after to GPU

        input_blob = input_blob[..., ::-1]  # BGR to RGB
        input_blob = input_blob.transpose(0, 3, 1, 2)  # NHWC to NCHW
        input_blob -= self.input_mean
        input_blob /= self.input_std

        return input_blob

It works great, unless user submits a large input / big batch size that exceeds some cuda buffer limit. The error on the server side looks like

W0422 05:00:35.414218 1 memory.cc:212] Failed to allocate CUDA memory with byte size 78643200 on GPU 0: CNMEM_STATUS_OUT_OF_MEMORY, falling back to pinned system memory
E0422 05:00:35.457498 1 pb_stub.cc:402] An error occurred while trying to load GPU buffers in the Python backend stub: failed to copy data: invalid argument
E0422 05:00:35.537644 1 pb_stub.cc:402] An error occurred while trying to load GPU buffers in the Python backend stub: failed to copy data: invalid argument
E0422 05:00:35.560513 1 pb_stub.cc:402] An error occurred while trying to load GPU buffers in the Python backend stub: failed to copy data: invalid argument

However, the client is not getting an error response/exception! I tried both http and grpc Python clients, and they behaved the same - there is no error, but the output tensors were incorrect. This silent error is very alarming, because the results would be garbage but pretend to be ok...

Triton Information container 24.03 - tritonserver 2.44.0

Are you using the Triton container or did you build it yourself? NGC container

To Reproduce The description above should be clear... By matching text it looks like the error comes from this line:

https://github.com/triton-inference-server/python_backend/blob/r24.03/src/pb_stub.cc#L403-L404

It must have happened at or after pb_utils.Tensor.from_dlpack(). And how come an error within the Python backend is not forwarded to the client?

If I convert the CuPy array back to NumPy and load it the usual way, it works:

            inference_response = pb_utils.InferenceResponse(
                output_tensors=[
                    pb_utils.Tensor(
                        self.output_name, cp.asnumpy(preprocessed_full_batch[start_offset:end_offset])
                    )
                ]
            )

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

ensemble model config:

platform: "ensemble"
max_batch_size: 16

input [
  {
    name: "image"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]

output [
  {
    name: "score_8"
    data_type: TYPE_FP32
    dims: [ 12800, 1 ]
  },
  {
    name: "bbox_8"
    data_type: TYPE_FP32
    dims: [ 12800, 4 ]
  },
  {
    name: "kps_8"
    data_type: TYPE_FP32
    dims: [ 12800, 10 ]
  },
  {
    name: "score_16"
    data_type: TYPE_FP32
    dims: [ 3200, 1 ]
  },
  {
    name: "bbox_16"
    data_type: TYPE_FP32
    dims: [ 3200, 4 ]
  },
  {
    name: "kps_16"
    data_type: TYPE_FP32
    dims: [ 3200, 10 ]
  },
  {
    name: "score_32"
    data_type: TYPE_FP32
    dims: [ 800, 1 ]
  },
  {
    name: "bbox_32"
    data_type: TYPE_FP32
    dims: [ 800, 4 ]
  },
  {
    name: "kps_32"
    data_type: TYPE_FP32
    dims: [ 800, 10 ]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "detector_preprocessor"
      model_version: -1
      input_map {
        key: "image"
        value: "image"
      }
      output_map {
        key: "input.1"
        value: "input.1"
      }
    },
    {
      model_name: "detector_main_model"
      model_version: -1
      input_map {
        key: "input.1"
        value: "input.1"
      }
      output_map {
        key: "score_8"
        value: "score_8"
      }
      output_map {
        key: "bbox_8"
        value: "bbox_8"
      }
      output_map {
        key: "kps_8"
        value: "kps_8"
      }
      output_map {
        key: "score_16"
        value: "score_16"
      }
      output_map {
        key: "bbox_16"
        value: "bbox_16"
      }
      output_map {
        key: "kps_16"
        value: "kps_16"
      }
      output_map {
        key: "score_32"
        value: "score_32"
      }
      output_map {
        key: "bbox_32"
        value: "bbox_32"
      }
      output_map {
        key: "kps_32"
        value: "kps_32"
      }
    }
  ]
}

Expected behavior

The server side error that invalidates the output should be a glaring error on the client side.

In addition, I would love to get some clarity on how cuda-memory-pool-byte-size is used when GPU tensors are queued from one model to another. What's the max queue size, and do all queued tensors take up space for this globally shared cuda-memory-pool-byte-size?

jbkyang-nvi commented 4 months ago

Hello while we try to reproduce your issue, can you update your client + server to Triton 24.03? 23.04 is 1 year old and we don't really maintain containers that old.

jbkyang-nvi commented 4 months ago

for cuda-memory-pool-byte-size is per GPU. As per the tritonserver cli:

The total byte size that can be allocated as CUDA memory for the GPU " "device. If GPU support is enabled, the server will allocate CUDA " "memory to minimize data transfer between host and devices until it " "exceeds the specified byte size. This option will not affect the " "allocation conducted by the backend frameworks. "

The "queued tensors" will take up space for all models

jrcavani commented 4 months ago

@jbkyang-nvi I am indeed using 24.03, the latest container.

On cuda-memory-pool-byte-size, it is exactly this last sentence that confused me:

This option will not affect the "allocation conducted by the backend frameworks. "

It sounds like cuda-memory-pool-byte-size only affects client -> server cuda shared memory when client and server are the same host, and the backend (in this case Python) allocation is not affected by this option.

But when I increased this value to 2GB, no errors were reported anymore. So maybe the correct understanding is it affects the pool size of tensors passed between backends, and it does not affect however the backend code allocates GPU memory, such as using CuPy to allocate arrays in the Python backend. Is this right?

It's easy to get confused because CUDA pool and pinned memory pool are used between client and server, at the same time, between backends in an ensemble.

The issue still exists imo, as this essentially obviated the problem - I'm sure if server errors again, client will still get the silent treatment.

adrtsang commented 3 months ago

I am running into the same error as well and have been trying to find a way to propagate the error to the client when this happens. Is there a solution that anyone can suggest? Thanks

MarkoKostiv commented 3 months ago

We encounter the same error in the ensemble model when passing tensors from preprocessing (python backend) to the TensorRT for inference. Increasing cuda-memory-pool-byte-size saves the day, but it's unclear how to define the correct limit and ensure it doesn't crash in production with more requests and longer queues.

GuanLuo commented 3 months ago

@Tabrizian I think the root cause is that the error happend in Python backend results in invalid response data, which the backend should return an error response (or fallback to system memory).

jrcavani commented 2 months ago

@GuanLuo Do you mean the C code should return error, instead of the model.py? If so I agree. There is currently no way to get that error in Python code.

adrtsang commented 2 months ago

It will be very helpful to propagate the error to model.py instead of being handled by the backend C code