Multi Model Performance

I need two run two models at the same time. One is for logical type stuff and the other is for conversational type stuff. Both are 8b/7b (neuralchat 3.3 for logic and llama 3 for conversation). These models both excel at different things. The llama model isn't very good at producing accurate output, but neural chat is very good at that. neural chat produces junk dialog, but llama 3 excels at it.

Each of these models, when run on it's own gives me 60t/s. When i load both models in multi model mode they both drop to 20-30t/s which isn't acceptable for my use. I'm making sequential, not concurrent requests, and my VRAM is only 2/3 used.

Why does this performance drop happen and is there anything I can do to mitigate it?

lmstudio-ai / lmstudio-bug-tracker

Multi Model Performance #67