Open yjjiang11 opened 9 months ago
Could you make sure there is not other process running the tritonserver on the smae node? If you still encounter such issue, please share the config.pbtxt.
I ran into the same issue, no other process is running on the the same node. Not sure if multiple server instances across different GPUs are trying to access the same port?
Running into the same issue (on 2 GPUs), setting --grpc_port
to a range of other values doesn't fix the issue. I'm running on v0.7.1
I also noticed that this issue only applies when --world_size
is set to >1
as the launch_triton_server.py has the code below:
def get_cmd(world_size, tritonserver, grpc_port, http_port, metrics_port,
model_repo, log, log_file):
cmd = ['mpirun', '--allow-run-as-root']
for i in range(world_size):
cmd += ['-n', '1', tritonserver]
if log and (i == 0):
cmd += ['--log-verbose=3', f'--log-file={log_file}']
cmd += [
f'--grpc-port={grpc_port}', f'--http-port={http_port}',
f'--metrics-port={metrics_port}',
f'--model-repository={model_repo}',
'--disable-auto-complete-config',
f'--backend-config=python,shm-region-prefix-name=prefix{i}_', ':'
]
return cmd
is seems every process will started with the same grpc_port, http_prot, metrics_port while world_size > 1, may be you can add --allow-grpc=false --allow-http=false --allow-metrics=false
option to the process which rank is greater than zero.
Could you please share more detailed reproducing steps and logs? We are running multi-GPU tests and did not see such issue. Thanks.
I think my models were mis-configured somehow. I made many changes to my configs, and am now no longer running into this issue. (I didn't change code versions or update the model.py files or anything else.) Sorry I wasn't able to pinpoint the exact config problem!
i think it is because the tensorrt-llm backend has a infinite loop in the ModelInstanceState::ModelInstanceState
like this
if (getCommWorldRank() != 0)
{
while (true)
{
}
}
}
This loop appears to be present to block execution on ranks other than rank 0, potentially causing subsequent execution of the server to be halted. It seems that only the first rank node creates the HTTP and gRPC servers as expected.
it's the reason why you don't got the error while the configure is ok.
but how other server shutdown gracefully? there isn't a way to stop the infinite loop. @kaiyux
With no change to any code this issue disappeared briefly but its back, not sure what the root cause is yet. I was thinking the same as @sydnash; the different processes are all starting the server with the same port but clearly its working for everyone else.
my config looks like:
name: "tensorrt_llm"
backend: "python"
max_batch_size: 1
input [
{
name: "query_0"
data_type: TYPE_STRING
dims: [ 1 ]
}
]
output [
{
name: "query_0"
data_type: TYPE_STRING
dims: [ 1 ]
}
]
instance_group [
{
count: 1
kind : KIND_CPU
}
]
when I launch multi-gpu triton server
python scripts/launch_triton_server.py --world_size 4 --model_repo /path/to/model/repo
Got port in use error
21 09:27:15.346696872 166 chttp2_server.cc:1080] UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:8001' {created_time:"2023-12-21T09:27:15.346653273+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2023-12-21T09:27:15.346644896+00:00", children:[UNKNOWN:Address family not supported by protocol {created_time:"2023-12-21T09:27:15.346609255+00:00", errno:97, os_error:"Address family not supported by protocol", syscall:"socket", target_address:"[::]:8001"}, UNKNOWN:Unable to configure socket {created_time:"2023-12-21T09:27:15.346637691+00:00", fd:113, children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2023-12-21T09:27:15.346634796+00:00"}]}]}]} E1221 09:27:15.346824 166 main.cc:244] failed to start GRPC service: Unavailable - Socket '0.0.0.0:8001' already in use E1221 09:27:15.347306746 165 chttp2_server.cc:1080] UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:8001' {created_time:"2023-12-21T09:27:15.34725917+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2023-12-21T09:27:15.347249218+00:00", children:[UNKNOWN:Address family not supported by protocol {created_time:"2023-12-21T09:27:15.347199869+00:00", errno:97, os_error:"Address family not supported by protocol", syscall:"socket", target_address:"[::]:8001"}, UNKNOWN:Unable to configure socket {fd:113, created_time:"2023-12-21T09:27:15.347237464+00:00", children:[UNKNOWN:Address already in use {created_time:"2023-12-21T09:27:15.347234311+00:00", errno:98, os_error:"Address already in use", syscall:"bind"}]}]}]} E1221 09:27:15.347460 165 main.cc:244] failed to start GRPC service: Unavailable - Socket '0.0.0.0:8001' already in use