triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
664 stars 96 forks source link

RemoteDisconnected('Remote end closed connection without response') #386

Open trillionmonster opened 6 months ago

trillionmonster commented 6 months ago

System Info

CPU:x86 SYS:Ubuntu GPU:A100-SXM x 8, Driver Version: 470.161.03 CUDA Version: 12.2
trtllm: 0.6.1 triton: 2.1.0

Who can help?

No response

Information

Tasks

Reproduction

fold:tensorrt/tensorrtllm_backend/tensorrt_llm/examples/llama


MAX_INPUT_LEN=10000
MAX_OUTPUT_LEN=10000
TP_SIZE=2
MAX_BATCH_SIZE=10

python build.py --model_dir \"$ORIGINAL_MODEL_PATH\" \
                    --dtype float16 \
                    --remove_input_padding \
                    --use_gpt_attention_plugin float16 \
                    --enable_context_fmha \
                    --use_gemm_plugin float16 \
                    --output_dir \"$COMPILED_MODEL_PATH\" \
                    --paged_kv_cache \
                    --max_batch_size $MAX_BATCH_SIZE \
                    --tp_size $TP_SIZE \
                    --world_size $TP_SIZE \
                    --max_input_len $MAX_INPUT_LEN \
                    --max_output_len $MAX_OUTPUT_LEN;

cd tensorrt/tensorrtllm_backend

cp all_models/inflight_batcher_llm/* $MODEL_NAME -r;
python3 tools/fill_template.py -i $MODEL_NAME/preprocessing/config.pbtxt tokenizer_type:llama,tokenizer_dir:$ORIGINAL_MODEL_PATH,triton_max_batch_size:$MAX_BATCH_SIZE,preprocessing_instance_count:1
python3 tools/fill_template.py -i $MODEL_NAME/postprocessing/config.pbtxt tokenizer_type:llama,tokenizer_dir:$ORIGINAL_MODEL_PATH,triton_max_batch_size:$MAX_BATCH_SIZE,postprocessing_instance_count:1
python3 tools/fill_template.py -i $MODEL_NAME/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i $MODEL_NAME/ensemble/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE
python3 tools/fill_template.py -i $MODEL_NAME/tensorrt_llm/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE,decoupled_mode:False,max_beam_width:1,kv_cache_free_gpu_mem_fraction:0.9,engine_dir:$COMPILED_MODEL_PATH,max_attention_window_size:2560,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:600
python3 scripts/launch_triton_server_block.py --world_size $TP_SIZE --model_repo=$MODEL_NAME/ 

python call src:

def gen_call(prompt,max_tokens=512,temperature=0.4,top_p=0.8):
    payload = {
        "text_input": prompt,
        "max_tokens": max_tokens,
        "bad_words": "",
        "stop_words": "",
        "end_id": 119204,
        "temperature": temperature,
        "top_p": top_p,
        "return_log_probs": False
    }

    for _ in range(3):
        try :
            model_response = requests.post(model_url, json=payload,timeout=40).json()[
                "text_output"]

            break

        except Exception as e:

            print (f"error requests.post {e} prompt-hash:{hash(prompt)}")

    return model_response

def process_prompt(prompt :str,max_tokens=512):
    prompt = prompt[:20000]
    result = gen_call(prompt,max_tokens=max_tokens,temperature=0.4,top_p=0.8)
    if result == None or result.strip()=="":
        return text
    return result

def batch_infer(prompts, task_name):
    answers = [""] * len(prompts)  
    with ProcessPoolExecutor(max_workers=10) as executor:

        future_to_index = {executor.submit(process_prompt, prompt): index for index, prompt in enumerate(prompts)}

        for future in tqdm(as_completed(future_to_index), total=len(prompts), desc=f"infer {task_name}"):
            index = future_to_index[future]
            try:
                result = future.result(timeout=300)
                answers[index] = result 
            except Exception as e:
                print(f"Task at index {index} failed with exception: {e} ,prompt = {prompts[index]}")

    return answers

Expected behavior

60 hour 10 process finish without error

actual behavior

error requests.post HTTPConnectionPool(host='127.0.0.1', port=18000): Read timed out. (read timeout=40) prompt-hash:3747463518456110343 error requests.post HTTPConnectionPool(host='127.0.0.1', port=18000): Read timed out. (read timeout=40) prompt-hash:3747463518456110343 error requests.post HTTPConnectionPool(host='127.0.0.1', port=18000): Read timed out. (read timeout=40) prompt-hash:3747463518456110343 error requests.post HTTPConnectionPool(host='127.0.0.1', port=18000): Read timed out. (read timeout=40) prompt-hash:-9199190014300159609 error requests.post HTTPConnectionPool(host='127.0.0.1', port=18000): Read timed out. (read timeout=40) prompt-hash:-9199190014300159609 error requests.post HTTPConnectionPool(host='127.0.0.1', port=18000): Read timed out. (read timeout=40) prompt-hash:-9199190014300159609

Its' about 19hours with 10 processes

企业微信截图_17119348052598

and finally

I can't request it again ,can't attach the container . but the container is running.

additional notes

during the inferring task , the gpu util is 0% or 100% without other number . when timeout, it's 0% ; retry success ,its' 100% .

lkm2835 commented 6 months ago

same error in latest version.