Open matbee-eth opened 1 month ago
I am also facing issues with both VLLM and Llama-stack. VLLM seems to allocate too much KV memory and gets OOM errors from CUDA.
Llama-stack seems to rely on models/sku_list.py
which hard-codes the model parallelism and does not seem easy to modify. code ref
When I run Llama3.2-90B-Vision-Instruct with two A100 chips it tries with 8:
initializing model parallel with size 8
which results in:
RuntimeError: CUDA error: invalid device ordinal
Is there a way to modify model parallelism as a config item in the llama-stack? Thanks!
The current implementation of local means no sharding/tensor parallelism, etc, and refuses to work on my dual 4090 setup. How do I enable multi gpu, or how do I enable a proper system like VLLM to run inferencing?