chore: Add Llama3.1-8B support for vLLM and use KIND_MODEL for vLLM config by default

triton-inference-server / triton_cli

Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inference Server.

48 stars 2 forks source link

chore: Add Llama3.1-8B support for vLLM and use KIND_MODEL for vLLM config by default #82

Closed rmccorm4 closed 1 month ago

rmccorm4 commented 1 month ago

Add Llama3.1-8B support for vLLM (not TRT-LLM yet), and use KIND_MODEL by default on vllm generated config.pbtxt for multi-gpu issues.

Note Llama3.1 support requires a higher vLLM version than what is currently in the 24.07 release. pip install "vllm==0.5.3.post1" in the 24.07 vLLM container worked though.

rmccorm4 commented 1 month ago

CC @kthui changing the default to KIND_MODEL here may resolve the test failures when landing on multi-gpu nodes, and will generally be more friendly towards users customizing their TP/PP settings > 1 after generating a model as a starting point.