Open feloy opened 2 weeks ago
There's #15 with a --server
flag by @srogmann , it needs still some work, but the idea is similar.
What kind of API would you expose and why? Please note that I'm noob on this side of things.
Yes, this PR seems to be what I was thinking about. The API I would expect is a llama-cpp compatible one, as in the PR.
The use case would be to have the choice between different inference servers in https://github.com/containers/podman-desktop-extension-ai-lab
Having a compliant OpenAI (chat) REST API would be amazing. This would allow many tools (including LangChain4J) to integrate without any extra code with Llama3.java See also https://platform.openai.com/docs/api-reference/chat
I personally think it makes more sense for this project to be usable as a library (which requires making the API clear) which then can be embedded inside other libraries / frameworks to provide a REST API (compatibility with OpenAI makes 100% sense to me).
Agreed, similar to what I've done as an experiment @ https://github.com/stephanj/Llama3JavaChatCompletionService But then better 😂
I have another question.
Say someone has opted to use mukel/Llama-3.2-3B-Instruct-GGUF
.
In that case which quantization should be the default, or is it expected that users need to provide that as well?
GGUF files come pre-quantized. ollama has a notion of "default" quantization that varies across models e.g. some smaller models use Q8_0, while larger models default to Q4_K_M... It is more complicated because a model may state that it is quantized with Q8_0, but it may contain tensors quantized with other methods e.g. most Q4_0 quantizations in HuggingFace include some tensors quantized with Q6_K (see initial implementation in #12). IMHO, the "default" should be the smallest possible "acceptable" quantization for that model ... but then we have to define what "acceptable" means.
I see, thanks for the input!
So I guess it makes sense to have the user choose which quantization they want?
Another question if I may:
Say we obtain a list of request - response messages from chat history and want Llama3.java to be aware of those. What is the proper way to interact with Llama3.java in this case?
Should we use encodeDialogPrompt
for this case?
Yes, but ingesting all the tokens again and again is wasteful. Note that this is not a problem for cloud providers because you pay per token and token ingestion is very fast (more so on GPUs). If they keep the KV caches around for a bit, they save for themselves. I'd like to have transparent caching for prompts and conversations. When you create the model in e.g. LangChain4j you could specify a caching strategy for prompts/conversations ... it's not clear to me what would be a good way (API) to specify what must be cached and how (persists to disk, keep KV caches in memory ...)
Also, #16 introduces prompt caching to disk.
Oh, that's very interesting to know
Thanks for this amazing work!
Would you be interested to have a
--service
mode, to be able to run llama3.java as a service, and have a third-party chat communicating with this service?