triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
664 stars 96 forks source link

[tensorrt-llm backend] A question about launch_triton_server.py #455

Open victorsoda opened 4 months ago

victorsoda commented 4 months ago

Question

The codes in launch_triton_server.py:

def get_cmd(world_size, tritonserver, grpc_port, http_port, metrics_port,
            model_repo, log, log_file, tensorrt_llm_model_name):
    cmd = ['mpirun', '--allow-run-as-root']
    for i in range(world_size):
        cmd += ['-n', '1', tritonserver, f'--model-repository={model_repo}']
        if log and (i == 0):
            cmd += ['--log-verbose=3', f'--log-file={log_file}']
        # If rank is not 0, skip loading of models other than `tensorrt_llm_model_name`
        if (i != 0):
            cmd += ['--model-control-mode=explicit']
            model_names = tensorrt_llm_model_name.split(',')
            for name in model_names:
                cmd += [f'--load-model={name}']
        cmd += [
            f'--grpc-port={grpc_port}', f'--http-port={http_port}',
            f'--metrics-port={metrics_port}', '--disable-auto-complete-config',
            f'--backend-config=python,shm-region-prefix-name=prefix{i}_', ':'
        ]
    return cmd

When world_size = 2 for example, 2 triton servers will be launched using the same grpc port (e.g., 8001). But how could this be possible? When I tried to do something similar, I got the following error while launching the second server:

I0513 03:43:28.353306 21205 grpc_server.cc:2466] Started GRPCInferenceService at 0.0.0.0:8001
I0513 03:43:28.353458 21205 http_server.cc:4636] Started HTTPService at 0.0.0.0:8000
E0513 03:43:28.353559006   21206 chttp2_server.cc:1080]      UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:8001' {created_time:"2024-05-13T03:43:28.353510541+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2024-05-13T03:43:28.353503146+00:00", children:[UNKNOWN:Address family not supported by protocol {target_address:"[::]:8001", syscall:"socket", os_error:"Address family not supported by protocol", errno:97, created_time:"2024-05-13T03:43:28.353465612+00:00"}, UNKNOWN:Unable to configure socket {fd:6, created_time:"2024-05-13T03:43:28.353493367+00:00", children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-05-13T03:43:28.353488259+00:00"}]}]}]}
E0513 03:43:28.353650 21206 main.cc:245] failed to start GRPC service: Unavailable - Socket '0.0.0.0:8001' already in use

Background

I've been developing my triton backend drawing on the experience of https://github.com/triton-inference-server/tensorrtllm_backend.

I have already built two engines (tensor parallel, tp_size = 2) of the llama2-7b model. It's ok to run something like mpirun -np 2 python3.8 run.py to load the two engines, run tensor-parallel inference, and get the correct results.

My goal now is to run the same two engines by the triton server.

I have already implemented the run.py logic in the model.py (initialize() and execute() functions) in my python backend.

Following launch_triton_server.py, I tried the following command line:

mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver --model-repository=./model_repository --grpc-port=8001 --http-port=8000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ : -n 1 /opt/tritonserver/bin/tritonserver --model-repository=./model_repository --model-control-mode=explicit --load-model=llama2_7b --grpc-port=8001 --http-port=8000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix1_ :

Then I got the error as above.

Could you please tell me what I did wrong and how I can fix the error? Thanks a lot!

byshiue commented 3 months ago

In tensorrt_llm_backend, when we launch several server by MPI with world_size > 1, only the rank 0 (main process) will recieve/return requests. Other ranks will skip this step and will not encounter issue of same port. So, you need to do similar thing if you want to use self-defined backend.

alokkrsahu commented 3 months ago

Any clue how to resolve this issue, please let me know?

dwq370 commented 2 months ago

i meet the same error, any solutions?

alokkrsahu commented 2 months ago

I used world size 4 and it worked


From: dwq370 @.> Sent: Friday, July 5, 2024 7:24:18 AM To: triton-inference-server/tensorrtllm_backend @.> Cc: Alok Kumar Sahu @.>; Comment @.> Subject: Re: [triton-inference-server/tensorrtllm_backend] [tensorrt-llm backend] A question about launch_triton_server.py (Issue #455)

i meet the same error, any solutions?

— Reply to this email directly, view it on GitHubhttps://github.com/triton-inference-server/tensorrtllm_backend/issues/455#issuecomment-2210258189, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMDHPKDFWEK4PCGPINTGYPTZKY3ZFAVCNFSM6AAAAABHWSGVBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGI2TQMJYHE. You are receiving this because you commented.Message ID: @.***>

dwq370 commented 2 months ago

i used world size 4 but it not worked, world size 2 worked

alokkrsahu commented 2 months ago

Okay


From: dwq370 @.> Sent: Friday, July 5, 2024 8:18:55 AM To: triton-inference-server/tensorrtllm_backend @.> Cc: Alok Kumar Sahu @.>; Comment @.> Subject: Re: [triton-inference-server/tensorrtllm_backend] [tensorrt-llm backend] A question about launch_triton_server.py (Issue #455)

i used world size 4 but it not worked, world size 2 worked

— Reply to this email directly, view it on GitHubhttps://github.com/triton-inference-server/tensorrtllm_backend/issues/455#issuecomment-2210333048, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMDHPKHRABVICFWNH4MWON3ZKZCF7AVCNFSM6AAAAABHWSGVBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGMZTGMBUHA. You are receiving this because you commented.Message ID: @.***>

DefTruth commented 2 weeks ago

In tensorrt_llm_backend, when we launch several server by MPI with world_size > 1, only the rank 0 (main process) will recieve/return requests. Other ranks will skip this step and will not encounter issue of same port. So, you need to do similar thing if you want to use self-defined backend.

Any examples? We have the same problem. We need to run trtllm in the python backend with tp_size > 1 for VLM model.