I've got a bit of an issue here. I'm running two GPUs - one Nvidia with 8GB of VRAM and an AMD card with 4GB of VRAM. When loading up a model, I'm limited to the total of 8GB of VRAM, which isn't too bad. However, if there's any load on the AMD card, the model just won't load.
I've also noticed that when the model is loaded onto the Nvidia card (which has more than enough memory to handle it), about 3.5GB ends up on the AMD card and the rest stays on the Nvidia card. This seems like a waste of resources to me - why not just let the model use all the VRAM available on the Nvidia card?
I'm wondering if this is a bug or if there's some specific reason for how OpenWebUI handles GPU memory. If it's intentional, maybe we could revisit the memory allocation strategy? I'd love to get your thoughts on this.
I've got a bit of an issue here. I'm running two GPUs - one Nvidia with 8GB of VRAM and an AMD card with 4GB of VRAM. When loading up a model, I'm limited to the total of 8GB of VRAM, which isn't too bad. However, if there's any load on the AMD card, the model just won't load.
I've also noticed that when the model is loaded onto the Nvidia card (which has more than enough memory to handle it), about 3.5GB ends up on the AMD card and the rest stays on the Nvidia card. This seems like a waste of resources to me - why not just let the model use all the VRAM available on the Nvidia card?
I'm wondering if this is a bug or if there's some specific reason for how OpenWebUI handles GPU memory. If it's intentional, maybe we could revisit the memory allocation strategy? I'd love to get your thoughts on this.
Thanks in advance!