Open pythops opened 4 months ago
Great suggestion, if there's others who interested please +emoji above and we'll prioritize this :)
Just for the update: llama.cpp
added support for gemma
models
https://github.com/ggerganov/llama.cpp/pull/5631
Just for the update:
llama.cpp
added support forgemma
models
Also with 💎Gemma in 🦙Llama.CPP you get CUDA, Neon and AMD GPUs support! And - in theory - running into the browser if you can compile to WASM.
adding a api like support would be great these models can be used on cpu for smaller tasks. +1 for this.
I have a question: why using http but not websocket?
As I known, the answer token is generated one word by one word. And, seems, http has no function to do multi-responses for one call. Which means , http need to gather the whole answer before trans it back.
I have a question: why using http but not websocket?
As I known, the answer token is generated one word by one word. And, seems, http has no function to do multi-responses for one call. Which means , http need to gather the whole answer before trans it back.
WebSocket is more suitable for instant messenger style UI but may not be ideal for other UI types. And I think it is better to integrate gemma.cpp as a module into the web backend framework than to implement the HTTP/WebSocket API directly.
Here is my WebSocket online demo solution, and you can try it here or via this Kaggle notebook. In this solution gemma.cpp is a module of OpenResty which makes it easy to implement WebSocket or HTTP API.
For more infos here https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md