triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.71k stars 1.42k forks source link

Regression from 23.07 to 24.05 on model count lifecycle/restarts #7347

Open sboudouk opened 3 weeks ago

sboudouk commented 3 weeks ago

Hello, thanks for the work being done here.

Description

I'm trying to debug multiples issues that happens on production, and upgrading our Triton Server to 24.05 is one of the solutions i'm testing.

I'm running an ensemble that chains multiple models. One of this model is called "large_3". this model is configured like this:

instance_group [{ kind: KIND_GPU , count: 3 }]

I'm seeing many issues with unhealthy stubs on production, so I'm trying to reproduce this behaviour by killing 1 of the 3 instance of this "large_3" model. I'm using kill -9 <pid> to do so. After I kill 1 of the 3, I see it disappearing from nvidia-smi & nvitop monitors, triton server doesn't seems to care yet (no logs about it) , but as described, triton server should restart the missing instance on the next request ( as mentioned in #5983 ). But it's not the case. large_3_0_2(which is the one I killed I suppose) try to pick up the job but keeps hanging.

With exact same setup and test on 23.07, the instance is correctly detected as unhealthy and restarted, and then picks up the job successfully (on new request).

Can you help me understand what's hapenning and if this behaviour is expected ? Is there any way for me to detect/force reload of unhealthy instances of my models ? It's a real issue regarding us upgrading our TS version.

Triton Information What version of Triton are you using?

23.07 and 24.05

Are you using the Triton container or did you build it yourself?

We're using nvcr.io/nvidia/tritonserver:24.05-py3

I'm launching TS with those args :

            tritonserver
            --disable-auto-complete-config
            --model-load-thread-count=1
            --exit-timeout-secs=1
            --log-verbose=1
            --model-control-mode=explicit
            --model-repository=/common/models
            --load-model=large_3
            --load-model......

To Reproduce Steps to reproduce the behavior.

Load a model with multiple instances (count > 1 in config file) , kill one of the instance, do an inference request that should use all the available models, and observe the issue on 24.05

Same setup get the instance restarted correctly on 23.07 ( did not test any version between both beside of 24.04 )

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

large_3 cfg :

name: "large_3"
backend: "python"
max_batch_size: 1
dynamic_batching: {
    default_queue_policy: {
        timeout_action: REJECT
        default_timeout_microseconds: 300000000
    }
}

input [
  {
    name: "x"
    data_type: TYPE_FP32
    dims: [-1, 1280]
 }, {
    name: "y"
    data_type: TYPE_BOOL
    dims: [-1]
    optional: true
 }, {
    name: "z"
    data_type: TYPE_FP32
    dims: [-1]
    optional: true
 }, {
    name: "a"
    data_type: TYPE_FP32
    dims: [128, 3000]
    optional: true
 }, {
    name: "ab"
    data_type: TYPE_FP32
    dims: [-1, 1280]
 }, {
    name: "cx"
    data_type: TYPE_STRING
    dims: [-1]
    optional: true
 }, {
    name: "oo"
    data_type: TYPE_INT32
    dims: [-1]
    optional: true
 }, {
    name: "eee"
    data_type: TYPE_STRING
    dims: [-1]
    optional: true
 }, {
    name: "rrrt"
    data_type: TYPE_STRING
    dims: [-1]
    optional: true
  }, {
    name: "fdf"
    data_type: TYPE_BOOL
    dims: [-1]
    optional: true
  }, {
    name: "dfdf"
    data_type: TYPE_BOOL
    dims: [-1]
    optional: true
  }, {
    name: "dfdf"
    data_type: TYPE_STRING
    dims: [-1]
    optional: true
  }
]
output [
  {
    name: "t"
    data_type: TYPE_STRING
    dims: [-1]
  }, {
    name: "weg"
    data_type: TYPE_STRING
    dims: [-1]
  }, {
    name: "lxxx"
    data_type: TYPE_STRING
    dims: [-1]
  }, {
    name: "sss"
    data_type: TYPE_STRING
    dims: [-1]
  }
]

instance_group [{ kind: KIND_GPU , count: 3 }]

parameters {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}

v3_ensemble looks like this :

name: "v3_ensemble"
max_batch_size: 1
platform: "ensemble"

ensemble_scheduling {
  step [
    {
      model_name: "randomname2"
      model_version: -1
      input_map {
        key: "ewerwerw"
        value: "werwer"
      }
      output_map {
        key: "werwer"
        value: "ewrwer"
      }
    },
    {
      model_name: "large_3"
      model_version: -1

      ...
      input_map {
        key: "randomkey"
        value: "randomkeyx"
      }

      ...
      output_map {
        key: "xxx"
        value: "werewrwe"
      }

    },
}

input [
...
]
output [
...
]

I anonymized most of the names and value, if those are required for you to get a better understanding I can anonymize them better.

Expected behavior A clear and concise description of what you expected to happen.

Unhealthy/killed model should be picked up and restarted correctly when new request arrives instead of hanging / not being restarted.

sboudouk commented 2 weeks ago

@kthui @Tabrizian Any idea if this is an intended behaviour ? I want to make sure that unhealthy processes are restarted correctly , but playing with kill it seems that they're not.

When kill -9 : The process is killed and never restarts on new requests. The killed process tries to execute the jobs but remains hanging after this log :

I0620 11:06:12.646654 1013474 python_be.cc:1395] "model large_3, instance large_3_0_0, executing 1 requests"

When kill -11 : The process is killed, and try to restart after end of next request, resulting sometimes segfault and Triton server crashing with error code 137 and Signal 11 received. The other behaviour that can happen when sending this signal is the same as the -9 : executing one request and hanging on forever.

sboudouk commented 1 week ago

Also, after compiling python_backend and force a specific model to restart inside TRITONBACKEND_ModelInstanceExecute by forcing restart to true inside ProcessRequest , a segmentation fault is triggered and killing the whole Triton Server.

This is the closest way for me to simulate an unhealthy stub (since it's the issue we're having on production when launching multiple triton servers on the same node with Kubernetes.)

sboudouk commented 10 hours ago

Also, after compiling python_backend and force a specific model to restart inside TRITONBACKEND_ModelInstanceExecute by forcing restart to true inside ProcessRequest , a segmentation fault is triggered and killing the whole Triton Server.

This is the closest way for me to simulate an unhealthy stub (since it's the issue we're having on production when launching multiple triton servers on the same node with Kubernetes.)

I see that in 24.06 Python restarts have been disabled. Any explanation on how we should deal with this ? Is it tied to the issue i'm reporting ?