There is a noticeable delay in loading the model on Windows machines when the user sends the first message. This delay is even more pronounced on GPUs like the RTX 4070, where it can take up to 10 seconds before generating a response. In some cases, the “Generating Response” bar gets stuck at 80% for up to 10 seconds, giving the impression that the software is hanging.
Some early proposed solution:
Option 1: Pre-load the model as soon as the user selects it, before input is allowed.
(Though doesn't seem like a good idea as user may accidentally click on a not enough RAM model & cause the device to hang).
Option 2: Trigger model loading when the user clicks in the input box or other proactive actions that imply intent to send a message.
Option 3: ... @imtuyethan
Key Scenarios Affected (To consider before proposing a solution):
Model setting adjustments: Adjusting context length or other settings causes the model to reload unnecessarily.
Thread switching: Switching threads without changing the model still triggers a reload.
Retry on load failure: Model load may fail, and reloading isn’t handled smoothly.
Users select models without much attention til they mean to use it.
UX Considerations:
Reduce perceived wait times by loading the model in the background and triggering it before user sends a message.
Transparency: Improve the user’s awareness of model loading states to prevent confusion.
Problem Statement
There is a noticeable delay in loading the model on Windows machines when the user sends the first message. This delay is even more pronounced on GPUs like the RTX 4070, where it can take up to 10 seconds before generating a response. In some cases, the “Generating Response” bar gets stuck at 80% for up to 10 seconds, giving the impression that the software is hanging.
Some early proposed solution:
not enough RAM
model & cause the device to hang).Key Scenarios Affected (To consider before proposing a solution):
UX Considerations: