meta-llama / llama-stack

Composable building blocks to build Llama Apps
MIT License
4.66k stars 599 forks source link

VLLM / OpenAI Compatible endpoint support #152

Open matbee-eth opened 1 month ago

matbee-eth commented 1 month ago

The current implementation of local means no sharding/tensor parallelism, etc, and refuses to work on my dual 4090 setup. How do I enable multi gpu, or how do I enable a proper system like VLLM to run inferencing?

JamesAntisdel commented 1 month ago

I am also facing issues with both VLLM and Llama-stack. VLLM seems to allocate too much KV memory and gets OOM errors from CUDA.

Llama-stack seems to rely on models/sku_list.py which hard-codes the model parallelism and does not seem easy to modify. code ref

When I run Llama3.2-90B-Vision-Instruct with two A100 chips it tries with 8:

initializing model parallel with size 8

which results in:

RuntimeError: CUDA error: invalid device ordinal

Is there a way to modify model parallelism as a config item in the llama-stack? Thanks!