triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

Serving large models with FT backend keeps Triton server crashing and restarting #86

Open RajeshThallam opened 1 year ago

RajeshThallam commented 1 year ago

We are trying to run Triton with FasterTransformer backend on a GKE cluster with A100 GPUs to serve models such as T5, UL2 which are hosted on Google Cloud Storage repo. We are using BigNLP containers (nvcr.io/ea-bignlp/bignlp-inference:22.08-py3) to run Triton

Based on the instructions, we are able to bring up the Triton inference server for T5-small models. However, when the repo has large models T5-XXL or UL2, the Triton server keeps crashing and restarting without any meaningful logs to troubleshoot.

Logs when serving T5-XXL model image

Logs when serving T5-Small model image