profiler script was using hard-coded TRT LLM backend and sending the wrong payload (sampling parameters) to vLLM backend, causing issues with calculations and expected number of tokens
vLLM has a default "max_tokens" of 16, and we weren't correctly overriding it if backend == "trtllm"
Added logic to get model backend and pass it to profiler script for context
Add fallback logic to triton server start
Rather than deciding between default of "local" or "docker", which has a better default depending on environment, I moved to a "fallback" logic by default. So it will try "local" first, and if it fails to find "tritonserver" binary then it will try "docker" mode.
If you explicitly specify a mode, it will only try that one.
Unify logic and helper functions between "bench" and other commands
Add more defaults to argparse help texts
Locally fixed and verified that "all-in-one" bench workflow, and individual subcommand workflows behave the same:
triton bench -m gpt2
and
triton repo clear
triton repo add -m gpt2
triton server start
triton model profile -m gpt2
triton server start
Locally fixed and verified that "all-in-one" bench workflow, and individual subcommand workflows behave the same:
and