Is it possible to disable fallback on CPU?

triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html

BSD 3-Clause "New" or "Revised" License

8.4k stars 1.49k forks source link

Is it possible to disable fallback on CPU? #7628

Closed Pavloveuge closed 2 months ago

Pavloveuge commented 2 months ago

Is your feature request related to a problem? Please describe. I occasionally run into situations where I run Triton on a host with a GPU, but some issues occur that make the GPU unavailable to Triton. In such cases, Triton writes a message and switches to the CPU, after which it runs the model.

Is it possible to disable automatic fallback to CPU?

Describe the solution you'd like It might be worth adding a special flag for this

Describe alternatives you've considered Also, I can make some kind of interlayer model that will try to check that the model was launched on the GPU, but I would not like to do it myself for every model

tanmayv25 commented 2 months ago

@Pavloveuge Is there any specific backend that you are using? Did you try setting kind: KIND_GPU in your model configuration instance group?

If you don't specify the model instance_group setting, then Triton tries to auto-complete the config for you. And in doing to it looks for available GPUs in the system. If no gpus are available, then it sets kind to KIND_CPU. If you can explicitly provide that you need to load a model on GPU in its model configuration and no GPU is available in the machine, then I believe you should be able to see an appropriate error.

Pavloveuge commented 2 months ago

Thank you for your answer!

I using ORT with config.pbtxt:

platform: "onnxruntime_onnx"
max_batch_size: 16
dynamic_batching { max_queue_delay_microseconds: 1000000 }
parameters { key: "cudnn_conv_use_max_workspace" value: { string_value: "1" } }
parameters { key: "intra_op_thread_count" value: { string_value: "1" } }
parameters { key: "inter_op_thread_count" value: { string_value: "1" } }

Yeah, I really do not specify instance_group setting. Unfortunately, right now I don’t have access to the machine, which have problems with gpu (such problems happen randomly and we have already gotten rid of it by rebooting the host). I'll add the instance_group settings to the config and if the problem is reproduced, then I'll see if I get an error and report it