Closed dezoito closed 7 months ago
Upon further investigation, it looks like Ollama server might be at fault for some reason.
Sending 4 SEQUENTIAL requests and looking at the logs generated by Ollama server:
Looks like it started a new context for request 1 (which returned a full response), but not for request 2 (partial response with no final_data
).
Pattern repeated above, with full response for request 3, but not for request 4.
For information, before this project, I used the following code, to retrieve multiple generation requests with no issues (notice that these are blocking calls):
fn send_request(
url: &str,
body: RequestObject,
test_timeout: u64,
) -> Result<reqwest::blocking::Response> {
let timeout = Duration::from_secs(test_timeout); // in seconds
let client = Client::new();
// Send the POST request with the provided body
let response = client.post(url).json(&body).timeout(timeout).send()?;
// Process the response
if response.status().is_success() {
Ok(response)
} else {
// Handle unsuccessful response (e.g., print status code)
eprintln!("Request failed with status: {:?}", response.status());
// Create a custom error using anyhow
Err(anyhow!(
"Request failed with status: {:?}",
response.status()
))
}
}
Again, this looks like an issue with Ollama itself (perhaps even with the newer versions), but any ideas on how to "force" it to start a new context whenever handling a non-chat call?
Fixed (or at least worked around the issue by adding the keep_alive
param to the completion call, as:
let req = GenerationRequest::new(model, prompt)
.options(options)
.system(system_prompt)
.keep_alive(KeepAlive::UnloadOnCompletion);
The old code I posted was working in a previous version of Ollama that didn't have the "keep_alive" option, so that was a clue ;) Another clue was that I just noticed that this only happened when I made sequential calls to the same model (switching to a different model restarted the context).
This workaround will unfortunately add some overhead (as the same model will have to be removed and reloaded in memory for every call), but I'll take it ;)
I'm getting some erratic behavior when using
GenerationRequest
.Some responses are marked as
done
but do not include the data expected infinal_data
.Here's the output from the call for one of those events:
Although
done
is true, we don't get the rest of the data.Same prompt submission, seconds before, but the response was complete:
To my understanding those are completely random and will occur regardless of the options passed to the call.
stream:false
is being passed in all requests too.Looking at the source code for generate(), I can't spot what should be fixed, but at a glance it looks like it's not implementing a consistency check before returning.