Open risedangel opened 7 months ago
both for standart and openai serving
The problem is, i have tried to serve the model in two different card. Both on 3090 and rtx 6000 ada generation. Mosel serving ate up all the vram in both scenarios. I want to run an embedding model on the same gpu but it leaves no space.
is it possible to limit "max_memory" while serving the model ?