microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.76k stars 163 forks source link

Limit VRAM usage in serving the model #453

Open risedangel opened 3 months ago

risedangel commented 3 months ago

is it possible to limit "max_memory" while serving the model ?

risedangel commented 3 months ago

both for standart and openai serving

risedangel commented 3 months ago

The problem is, i have tried to serve the model in two different card. Both on 3090 and rtx 6000 ada generation. Mosel serving ate up all the vram in both scenarios. I want to run an embedding model on the same gpu but it leaves no space.