VLLM / OpenAI Compatible endpoint support

I am also facing issues with both VLLM and Llama-stack. VLLM seems to allocate too much KV memory and gets OOM errors from CUDA.

Llama-stack seems to rely on models/sku_list.py which hard-codes the model parallelism and does not seem easy to modify. code ref

When I run Llama3.2-90B-Vision-Instruct with two A100 chips it tries with 8:

initializing model parallel with size 8

which results in:

RuntimeError: CUDA error: invalid device ordinal

Is there a way to modify model parallelism as a config item in the llama-stack? Thanks!

meta-llama / llama-stack