mukel / llama3.java

Practical Llama 3 inference in Java
MIT License
513 stars 62 forks source link

Running as a service #17

Open feloy opened 2 weeks ago

feloy commented 2 weeks ago

Thanks for this amazing work!

Would you be interested to have a --service mode, to be able to run llama3.java as a service, and have a third-party chat communicating with this service?

mukel commented 2 weeks ago

There's #15 with a --server flag by @srogmann , it needs still some work, but the idea is similar. What kind of API would you expose and why? Please note that I'm noob on this side of things.

feloy commented 2 weeks ago

Yes, this PR seems to be what I was thinking about. The API I would expect is a llama-cpp compatible one, as in the PR.

The use case would be to have the choice between different inference servers in https://github.com/containers/podman-desktop-extension-ai-lab

stephanj commented 2 weeks ago

Having a compliant OpenAI (chat) REST API would be amazing. This would allow many tools (including LangChain4J) to integrate without any extra code with Llama3.java See also https://platform.openai.com/docs/api-reference/chat

geoand commented 5 days ago

I personally think it makes more sense for this project to be usable as a library (which requires making the API clear) which then can be embedded inside other libraries / frameworks to provide a REST API (compatibility with OpenAI makes 100% sense to me).

stephanj commented 5 days ago

Agreed, similar to what I've done as an experiment @ https://github.com/stephanj/Llama3JavaChatCompletionService But then better 😂

geoand commented 4 days ago

I have another question. Say someone has opted to use mukel/Llama-3.2-3B-Instruct-GGUF. In that case which quantization should be the default, or is it expected that users need to provide that as well?

mukel commented 4 days ago

GGUF files come pre-quantized. ollama has a notion of "default" quantization that varies across models e.g. some smaller models use Q8_0, while larger models default to Q4_K_M... It is more complicated because a model may state that it is quantized with Q8_0, but it may contain tensors quantized with other methods e.g. most Q4_0 quantizations in HuggingFace include some tensors quantized with Q6_K (see initial implementation in #12). IMHO, the "default" should be the smallest possible "acceptable" quantization for that model ... but then we have to define what "acceptable" means.

geoand commented 4 days ago

I see, thanks for the input!

So I guess it makes sense to have the user choose which quantization they want?

geoand commented 2 days ago

Another question if I may:

Say we obtain a list of request - response messages from chat history and want Llama3.java to be aware of those. What is the proper way to interact with Llama3.java in this case? Should we use encodeDialogPrompt for this case?

mukel commented 2 days ago

Yes, but ingesting all the tokens again and again is wasteful. Note that this is not a problem for cloud providers because you pay per token and token ingestion is very fast (more so on GPUs). If they keep the KV caches around for a bit, they save for themselves. I'd like to have transparent caching for prompts and conversations. When you create the model in e.g. LangChain4j you could specify a caching strategy for prompts/conversations ... it's not clear to me what would be a good way (API) to specify what must be cached and how (persists to disk, keep KV caches in memory ...)

Also, #16 introduces prompt caching to disk.

geoand commented 2 days ago

Oh, that's very interesting to know