pepperoni21 / ollama-rs

A Rust library allowing to interact with the Ollama API.
MIT License
478 stars 73 forks source link

Responses `final_data` is `None` when the response is `done`. #27

Closed dezoito closed 7 months ago

dezoito commented 7 months ago

I'm getting some erratic behavior when using GenerationRequest.

Some responses are marked as done but do not include the data expected in final_data.

Here's the output from the call for one of those events:

[src/commands.rs:228] &res = GenerationResponse {
    model: "gemma:2b-instruct",
    created_at: "2024-02-24T20:45:52.674216365Z",
    response: "Git is a command-line version control system that allows users to track changes in code and revert or merge them into different locations on the local machine or repository.",
    done: true,
    final_data: None,
}

Although done is true, we don't get the rest of the data.

Same prompt submission, seconds before, but the response was complete:

[src/commands.rs:228] &res = GenerationResponse {
    model: "gemma:2b-instruct",
    created_at: "2024-02-24T20:45:46.237804801Z",
    response: "Git is a command-line version control system that allows users to track changes in source code and revert to previous versions.",
    done: true,
    final_data: Some(
        GenerationFinalResponseData {
            context: GenerationContext(
                [
                    106,
                    ...
                ],
            ),
            total_duration: 6024419654,
            prompt_eval_count: 8,
            prompt_eval_duration: 1395994000,
            eval_count: 25,
            eval_duration: 4627294000,
        },
    ),
}

To my understanding those are completely random and will occur regardless of the options passed to the call. stream:false is being passed in all requests too.

Looking at the source code for generate(), I can't spot what should be fixed, but at a glance it looks like it's not implementing a consistency check before returning.

dezoito commented 7 months ago

Upon further investigation, it looks like Ollama server might be at fault for some reason.

Sending 4 SEQUENTIAL requests and looking at the logs generated by Ollama server:

logs1

Looks like it started a new context for request 1 (which returned a full response), but not for request 2 (partial response with no final_data).

logs2

Pattern repeated above, with full response for request 3, but not for request 4.

For information, before this project, I used the following code, to retrieve multiple generation requests with no issues (notice that these are blocking calls):

fn send_request(
    url: &str,
    body: RequestObject,
    test_timeout: u64,
) -> Result<reqwest::blocking::Response> {
    let timeout = Duration::from_secs(test_timeout); // in seconds
    let client = Client::new();

    // Send the POST request with the provided body
    let response = client.post(url).json(&body).timeout(timeout).send()?;

    // Process the response
    if response.status().is_success() {
        Ok(response)
    } else {
        // Handle unsuccessful response (e.g., print status code)
        eprintln!("Request failed with status: {:?}", response.status());

        // Create a custom error using anyhow
        Err(anyhow!(
            "Request failed with status: {:?}",
            response.status()
        ))
    }
}

Again, this looks like an issue with Ollama itself (perhaps even with the newer versions), but any ideas on how to "force" it to start a new context whenever handling a non-chat call?

dezoito commented 7 months ago

Fixed (or at least worked around the issue by adding the keep_alive param to the completion call, as:

    let req = GenerationRequest::new(model, prompt)
        .options(options)
        .system(system_prompt)
        .keep_alive(KeepAlive::UnloadOnCompletion);

The old code I posted was working in a previous version of Ollama that didn't have the "keep_alive" option, so that was a clue ;) Another clue was that I just noticed that this only happened when I made sequential calls to the same model (switching to a different model restarted the context).

This workaround will unfortunately add some overhead (as the same model will have to be removed and reloaded in memory for every call), but I'll take it ;)