how to enable tensor parallelism

sfireworks commented 7 months ago

i build my model with --tp_size 2 --world_size 2, and put two generated model files into the backend directory and use the default config.pbtxt. then i run the script/launch_triton_server.py --model_repo /all_models/mymodel/ --world_size 2, the server report error message like this: is there any thing wrong in my configuration.

byshiue commented 6 months ago

Please share your building scripts and the config file of generated engine.

sfireworks commented 6 months ago

Please share your building scripts and the config file of generated engine.

I build the model with https://github.com/NVIDIA/TensorRT-LLM/blob/rel/examples/llama/build.py , the command is python3 build.py --model_dir ./tmp/20b_hf/ --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --remove_input_padding --use_inflight_batching --paged_kv_cache --parallel_build --world_size 2 --tp_size 2 --output_dir ./tmp/20b-trt-llm/

sfireworks commented 6 months ago

The config file is :

{ "builder_config": { "fp8": false, "hidden_act": "silu", "hidden_size": 5120, "int8": false, "max_batch_size": 8, "max_input_len": 2048, "max_num_tokens": null, "max_output_len": 512, "max_position_embeddings": 4096, "name": "llama", "num_heads": 40, "num_kv_heads": 40, "num_layers": 60, "parallel_build": false, "pipeline_parallel": 1, "precision": "float16", "quant_mode": 0, "tensor_parallel": 2, "use_refit": false, "vocab_size": 103168 }, "plugin_config": { "attention_qk_half_accumulation": false, "bert_attention_plugin": false, "context_fmha_type": 0, "gemm_plugin": "float16", "gpt_attention_plugin": "float16", "identity_plugin": false, "layernorm_plugin": false, "layernorm_quantization_plugin": false, "lookup_plugin": false, "nccl_plugin": "float16", "paged_kv_cache": true, "quantize_per_token_plugin": false, "quantize_tensor_plugin": false, "remove_input_padding": true, "rmsnorm_plugin": false, "rmsnorm_quantization_plugin": false, "smooth_quant_gemm_plugin": false, "tokens_per_block": 64, "use_custom_all_reduce": false, "weight_only_groupwise_quant_matmul_plugin": false, "weight_only_quant_matmul_plugin": false } }

byshiue commented 6 months ago

Do you use latest main branch? If not, please take a try. Please remember to rebuild tensorrt llm and engine.

triton-inference-server / tensorrtllm_backend

how to enable tensor parallelism #179