triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
657 stars 94 forks source link

launch multi-gpu triton server Got Port already in use #243

Open yjjiang11 opened 9 months ago

yjjiang11 commented 9 months ago

when I launch multi-gpu triton server

python scripts/launch_triton_server.py --world_size 4 --model_repo /path/to/model/repo

Got port in use error

21 09:27:15.346696872 166 chttp2_server.cc:1080] UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:8001' {created_time:"2023-12-21T09:27:15.346653273+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2023-12-21T09:27:15.346644896+00:00", children:[UNKNOWN:Address family not supported by protocol {created_time:"2023-12-21T09:27:15.346609255+00:00", errno:97, os_error:"Address family not supported by protocol", syscall:"socket", target_address:"[::]:8001"}, UNKNOWN:Unable to configure socket {created_time:"2023-12-21T09:27:15.346637691+00:00", fd:113, children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2023-12-21T09:27:15.346634796+00:00"}]}]}]} E1221 09:27:15.346824 166 main.cc:244] failed to start GRPC service: Unavailable - Socket '0.0.0.0:8001' already in use E1221 09:27:15.347306746 165 chttp2_server.cc:1080] UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:8001' {created_time:"2023-12-21T09:27:15.34725917+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2023-12-21T09:27:15.347249218+00:00", children:[UNKNOWN:Address family not supported by protocol {created_time:"2023-12-21T09:27:15.347199869+00:00", errno:97, os_error:"Address family not supported by protocol", syscall:"socket", target_address:"[::]:8001"}, UNKNOWN:Unable to configure socket {fd:113, created_time:"2023-12-21T09:27:15.347237464+00:00", children:[UNKNOWN:Address already in use {created_time:"2023-12-21T09:27:15.347234311+00:00", errno:98, os_error:"Address already in use", syscall:"bind"}]}]}]} E1221 09:27:15.347460 165 main.cc:244] failed to start GRPC service: Unavailable - Socket '0.0.0.0:8001' already in use

kaiyux commented 9 months ago

Could you make sure there is not other process running the tritonserver on the smae node? If you still encounter such issue, please share the config.pbtxt.

nikhilshandilya commented 8 months ago

I ran into the same issue, no other process is running on the the same node. Not sure if multiple server instances across different GPUs are trying to access the same port?

alwayshalffull commented 8 months ago

Running into the same issue (on 2 GPUs), setting --grpc_port to a range of other values doesn't fix the issue. I'm running on v0.7.1

I also noticed that this issue only applies when --world_size is set to >1

sydnash commented 8 months ago

as the launch_triton_server.py has the code below:

def get_cmd(world_size, tritonserver, grpc_port, http_port, metrics_port,
            model_repo, log, log_file):
    cmd = ['mpirun', '--allow-run-as-root']
    for i in range(world_size):
        cmd += ['-n', '1', tritonserver]
        if log and (i == 0):
            cmd += ['--log-verbose=3', f'--log-file={log_file}']
        cmd += [
            f'--grpc-port={grpc_port}', f'--http-port={http_port}',
            f'--metrics-port={metrics_port}',
            f'--model-repository={model_repo}',
            '--disable-auto-complete-config',
            f'--backend-config=python,shm-region-prefix-name=prefix{i}_', ':'
        ]
    return cmd

is seems every process will started with the same grpc_port, http_prot, metrics_port while world_size > 1, may be you can add --allow-grpc=false --allow-http=false --allow-metrics=false option to the process which rank is greater than zero.

kaiyux commented 8 months ago

Could you please share more detailed reproducing steps and logs? We are running multi-GPU tests and did not see such issue. Thanks.

alwayshalffull commented 8 months ago

I think my models were mis-configured somehow. I made many changes to my configs, and am now no longer running into this issue. (I didn't change code versions or update the model.py files or anything else.) Sorry I wasn't able to pinpoint the exact config problem!

sydnash commented 8 months ago

i think it is because the tensorrt-llm backend has a infinite loop in the ModelInstanceState::ModelInstanceState like this

  if (getCommWorldRank() != 0)
    {
        while (true)
        {
        }
    }
}

This loop appears to be present to block execution on ranks other than rank 0, potentially causing subsequent execution of the server to be halted. It seems that only the first rank node creates the HTTP and gRPC servers as expected.

it's the reason why you don't got the error while the configure is ok.

but how other server shutdown gracefully? there isn't a way to stop the infinite loop. @kaiyux

nikhilshandilya commented 8 months ago

With no change to any code this issue disappeared briefly but its back, not sure what the root cause is yet. I was thinking the same as @sydnash; the different processes are all starting the server with the same port but clearly its working for everyone else.

my config looks like:

name: "tensorrt_llm"
backend: "python"
max_batch_size: 1

input [
  {
    name: "query_0"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]
output [
  {
    name: "query_0"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]

instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]