Support openai compatibility mode

codefromthecrypt commented 2 months ago

Right now, completions code relies on the ollama /api/generate endpoint. This means we need to use ollama, which is good but also limits applicability.

I would like to be able to use either the /v1/completions or /v1/chat/completions endpoints as they are presented by openai and things that emulate it, including ollama!, but also llama.cpp's llama-server (used internally by ollama)

For example, if you run this

$ llama-server --log-disable \
  --hf-repo Qwen/Qwen2-0.5B-Instruct-GGUF \
  --hf-file qwen2-0_5b-instruct-q5_k_m.gguf

The following will work:

Legacy (explicit prompt)

$ curl -s -X POST localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
  "prompt": "<|im_start|>user\nWhich ocean contains the falkland islands?\n<|im_end|>\n<|im_start|>assistant\n"
}' | jq -r '.content'

Current (messages)

$ curl -s -X POST localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "whatever", "messages": [{ "role": "user","content": "Which ocean contains the falkland islands?"}]}'|jq .

I'm not sure the code impact, but this could at least for me better pitch ollama as even though it uses llama.cpp under the scenes, it allows more features such as model selection.

See https://github.com/openai/openai-openapi

codefromthecrypt commented 2 months ago

p.s. for many tools, the base URL is given inclusive of the '/v1' suffix, so that could be a way to heuristically know the caller wants to try openai vs ollama endpoint. (even on ollama which has these endpoints)

codefromthecrypt commented 2 months ago

incidentally, there is portability in the chat completions, but not the legacy one. port 8080 being llama.cpp's llama-server

/v1/completions isn't consistent in response format

$ curl -s -X POST localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
  "model": "qwen2:0.5b",
  "prompt": "<|im_start|>user\nWhich ocean contains the falkland islands?\n<|im_end|>\n<|im_start|>assistant\n"
}' |jq -r .content
The Falkland Islands are located in the South Atlantic Ocean.
$ curl -s -X POST localhost:11434/v1/completions -H "Content-Type: application/json" -d '{
  "model": "qwen2:0.5b",
  "prompt": "<|im_start|>user\nWhich ocean contains the falkland islands?\n<|im_end|>\n<|im_start|>assistant\n"
}' |jq -r .content
null
$ curl -s -X POST localhost:11434/v1/completions -H "Content-Type: application/json" -d '{
  "model": "qwen2:0.5b",
  "prompt": "<|im_start|>user\nWhich ocean contains the falkland islands?\n<|im_end|>\n<|im_start|>assistant\n"
}' |jq -r .choices[0].text
The Falkland Islands are in the Southern Ocean.

Trivia: In ollama you don't need to specify the ChatML template for it to work!

$ curl -s -X POST localhost:11434/v1/completions -H "Content-Type: application/json" -d '{
  "model": "qwen2:0.5b",
  "prompt": "Which ocean contains the falkland islands?"
}' |jq -r .choices[0].text
The Falklands Islands are in the Atlantic Ocean.

/v1/chat/completions is consistent

$ curl -s -X POST localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "qwen2:0.5b", "messages": [{ "role": "user","content": "Which ocean contains the falkland islands?"}]}'|jq -r .choices[0].message.content
The South Atlantic Ocean.
$ curl -s -X POST localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "qwen2:0.5b", "messages": [{ "role": "user","content": "Which ocean contains the falkland islands?"}]}'|jq -r .choices[0].message.content
The Falklands Islands can't be part of the Atlantic Ocean.

codefromthecrypt commented 2 months ago

ps ollama is correct on /completions, but I asked why llama-server has a different format. In any case the /v1/chat/completions is what's typically used.

k33g commented 2 months ago

@codefromthecrypt Initially, Parakeet is only for Ollama (I don't want to recreate a LangChain):

🦜🪺 Parakeet is a GoLang library, made to simplify the development of small generative AI applications with Ollama 🦙.

But I will have a look soon - I need to estimate the impact -> if the other payload completions are compliant with the Ollama completion, why not?

https://github.com/orgs/parakeet-nest/projects/1/views/1

codefromthecrypt commented 2 months ago

thanks for considering, in some ways this is just a demo thought, in other ways a discussion of where scope starts and ends. it is tricky balance also because ollama will fix the prompt for you even if you use the old completions endpoint ;)

k33g commented 2 months ago

@codefromthecrypt I did some tests:

for the query, it's not too complicated
but for the response, that means too much work

I will keep it in mind, but not for tomorrow.

k33g commented 2 months ago

I will add a new method and new types:

func ChatWithOpenAI(url string, query llm.OpenAIQuery) (llm.OpenAIAnswer, error) {}

func ChatWithOpenAIStream(url string, query llm.OpenAIQuery, onChunk func(llm.OpenAIAnswer) error) error {}

Ollama provides experimental compatibility with parts of the OpenAI API https://github.com/ollama/ollama/blob/main/docs/openai.md

As it's experimental, I prefer to keep the completion methods of Ollama and OpenAI "separated."

parakeet-nest / parakeet

Support openai compatibility mode #17