[Feature request] Add simple HTTP API server like in llama.cpp with api like OpenAI

google / gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models.

Apache License 2.0

5.75k stars 487 forks source link

[Feature request] Add simple HTTP API server like in llama.cpp with api like OpenAI #1

Open pythops opened 4 months ago

pythops commented 4 months ago

For more infos here https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

austinvhuang commented 4 months ago

Great suggestion, if there's others who interested please +emoji above and we'll prioritize this :)

pythops commented 4 months ago

Just for the update: llama.cpp added support for gemma models https://github.com/ggerganov/llama.cpp/pull/5631

loretoparisi commented 4 months ago

Just for the update: llama.cpp added support for gemma models

https://github.com/ggerganov/llama.cpp/pull/5631

Also with 💎Gemma in 🦙Llama.CPP you get CUDA, Neon and AMD GPUs support! And - in theory - running into the browser if you can compile to WASM.

omkar806 commented 2 months ago

adding a api like support would be great these models can be used on cpu for smaller tasks. +1 for this.

zeerd commented 2 months ago

I have a question: why using http but not websocket?

As I known, the answer token is generated one word by one word. And, seems, http has no function to do multi-responses for one call. Which means , http need to gather the whole answer before trans it back.

ufownl commented 2 months ago

I have a question: why using http but not websocket?

As I known, the answer token is generated one word by one word. And, seems, http has no function to do multi-responses for one call. Which means , http need to gather the whole answer before trans it back.

WebSocket is more suitable for instant messenger style UI but may not be ideal for other UI types. And I think it is better to integrate gemma.cpp as a module into the web backend framework than to implement the HTTP/WebSocket API directly.

Here is my WebSocket online demo solution, and you can try it here or via this Kaggle notebook. In this solution gemma.cpp is a module of OpenResty which makes it easy to implement WebSocket or HTTP API.