patterns-ai-core / langchainrb

Build LLM-powered applications in Ruby
https://rubydoc.info/gems/langchainrb
MIT License
1.39k stars 194 forks source link

Ollama improvements #686

Open easydatawarehousing opened 4 months ago

easydatawarehousing commented 4 months ago

Temperature and seed parameters should be part of 'options'

According to the docs temperature and seed should be passed as options:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "options": {
    "seed": 101,
    "temperature": 0
  }
}'

In the current implementation these are passed at the same level as parameters like 'model'.

Changing code of Langchain::LLM::Ollama like this works, but is probably not the best place to implement this.

def chat(messages:, model: nil, **params, &block)
  parameters = chat_parameters.to_params(params.merge(messages:, model:, stream: block.present?))

  if parameters.key?(:seed) || parameters.key?(:temperature)
    parameters[:options] = {}

    if parameters.key?(:seed)
      parameters[:options][:seed] = parameters.delete(:seed)
    end

    if parameters.key?(:temperature)
      parameters[:options][:temperature] = parameters.delete(:temperature)
    end
  end

  # ...

Non-streaming response chunks should be joined before parsing?

I am using Ollama 0.1.45. When requesting a non-streaming response (i.e. not passing a block to chat method) and the response is large (more than ~4000 characters) Ollama will send multiple chunks of data.

In the current implementation each chunk is JSON.parse'd seperately. For smaller responses which fit in a single chunck this is obviously not a problem. For multiple chunks I need to join all chunks first and then JSON parse it.

Changing code of Langchain::LLM::Ollama like this works for me.

def chat(messages:, model: nil, **params, &block)
  parameters = chat_parameters.to_params(params.merge(messages:, model:, stream: block.present?))
  responses_stream = []

  if parameters[:stream]
    # Existing code
    client.post("api/chat", parameters) do |req|
      req.options.on_data = json_responses_chunk_handler do |parsed_chunk|
        responses_stream << parsed_chunk

        block&.call(OllamaResponse.new(parsed_chunk, model: parameters[:model]))
      end
    end

    generate_final_chat_completion_response(responses_stream, parameters)
    # /Existing code
  else
    client.post("api/chat", parameters) do |req|
      req.options.on_data = proc do |chunk, _size, _env|
        puts "RECEIVED #{_size} CHARS, LAST CHAR IS: '#{chunk[-1]}'" # DEBUG
        responses_stream << chunk
      end
    end

    OllamaResponse.new(
      {
        "message" => {
          "role"    => "assistant",
          "content" => JSON.parse(responses_stream.join).dig("message", "content")
        }
      },
      model: parameters[:model]
    )
  end
end

Ollama docs say nothing about this behavior. Might be a bug in Ollama. Or a feature. This happens at least with llama3-8b-q8 and phi3-14b-q5 models. Should langchainrb code around this? Checking if response chunks are complete JSON documents or not.

Inherit from Langchain::LLM::OpenAI ?

Since Ollama is compatible with OpenAI's API, isn't it easier to let Langchain::LLM::Ollama inherit from Langchain::LLM::OpenAI ? Overwriting default values where needed.

Fodoj commented 2 months ago

I confirm the bug with chunks, once too big of an input is sent, I get `parse': 451: unexpected token at '' (JSON::ParserError), simply because the chunk ends in a way that the line is not a valid json.