Closed geoand closed 1 month ago
I am going to mark this as ready for review as I believe that it's good to go as is and the OpenAI compatibility (which is very limited in Ollama currently anyway) can be introduced in a follow up.
So idea is allowing to use ollama, podman ai lab, ai studio etc. To "host" same/closest appproximation model?
And not necessarily tied to a container runtime (docker/podman) ?
Definitely +1 :)
Another interesting fact is that InstructLab supports serving the trained models using an OpenAI compatible API.
@cescoffier have you had a chance to test this one? I'm really interested in your experience with it as well (not least because you opened the original issue).
I'm not sure how I should be using it.
I tried with ollama (which works OOTB for me - nothing to do), The dev services was ignored.
Also, it seems that there is some API mismatch:
Ollama is using http://localhost:53977/v1/api/chat
while I would have expected: http://localhost:53977/v1/chat/completions
You need to update the version if you are using the samples
Ollama is using http://localhost:53977/v1/api/chat while I would have expected: http://localhost:53977/v1/chat/completions
At the time being, I have not used the OpenAI compatibility stuff for reasons I have explained above
If I use:
quarkus.langchain4j.ollama.chat-model.temperature=0.8
quarkus.langchain4j.openai.timeout=60s
it works, but it was working already (the model is already pulled and the server is running). What change should I expect?
You can try to use something like quarkus.langchain4j.ollama.chat-model.model-id=phi3
in order to work with a new model (the default is llama3
which I guess you already have locally)
Ah ok, it pulls the model, that's what I was missing!
Yeah, and it's the dev experience one expects where you only need to configure what is necessary (unlike the existing dev service where you need multiple things configured)
Hum, something seems to be broken.
It pulls the model, but then the application was not calling it - actually nothing was called:
Listening for transport dt_socket at address: 5005
--/ \/ / / / | / \/ //_/ / / / /
-/ // / // / |/ , / ,< / // /\ \
--___// |//|//||____//
2024-05-13 15:58:38,675 WARN [io.qua.config] (Quarkus Main Thread) Unrecognized configuration key "quarkus.langchain4j.openai.timeout" was provided; it will be ignored; verify that the dependency extension for this configuration is set or that you did not make a typo
2024-05-13 15:58:39,043 INFO [io.quarkus] (Quarkus Main Thread) quarkus-langchain4j-sample-review-triage 1.0-SNAPSHOT on JVM (powered by Quarkus 3.8.2) started in 72.372s. Listening on: http://localhost:8080
2024-05-13 15:58:39,044 INFO [io.quarkus] (Quarkus Main Thread) Profile dev activated. Live Coding activated.
2024-05-13 15:58:39,044 INFO [io.quarkus] (Quarkus Main Thread) Installed features: [cdi, langchain4j, langchain4j-ollama, qute, rest-client-reactive, rest-client-reactive-jackson, resteasy-reactive, resteasy-reactive-jackson, smallrye-context-propagation, vertex]
2024-05-13 16:00:59,858 INFO [io.quarkus] (Shutdown thread) quarkus-langchain4j-sample-review-triage stopped in 0.008s
Worked with a restart.
BTw, when I said worked, it is a bit weird. The response was the following:
- body: {"model":"phi3","created_at":"2024-05-13T14:01:17.342015Z","message":{"role":"assistant","content":"{\n \"evaluation\": \"POSITIVE\",\n \"message\": \"Thank you for your kind words!\"\n}\n\nTo determine the sentiment analysis and language identification programmatically, one would typically use a natural language processing library capable of multi-language support, such as `spaCy`, with additional preprocessing to handle transliterations or direct translation equivalents. However, since this is an example response format without actual code execution, I've manually categorized the sentiment and provided an appropriate message in English for clarity."},"done_reason":"stop","done":true,"total_duration":6125478875,"load_duration":2639789458,"prompt_eval_count":348,"prompt_eval_duration":881733000,"eval_count":112,"eval_duration":2595494000}
That's an issue with the model though, no?
@geoand Removed llama3 and it re-pulled it and then it worked perfectly!
If things work as expected for you, I would like to get this in so I can proceed to improve on it later without having the PR go stale (as it has a lot small changes)
The basic idea behind this is that the LLM model being run and the API being exposed is not always tightly couple - for example the Mistral models can be run in Ollama, not just in Mistral's SaaS offering. This PR sets the foundation for having an inference server (for now only Ollama) be able to run inference for the LLM models the user configured. In the Ollama implementation we are able to instruct Ollama to pull the selected model name(if it exists and if it is not already present) and make the necessary configuration behind the scenes so each configured LangChain4j chat model configuration will use the inference server.
It is important to note that for this to work the configuration of the model to be used must know become a build time property (which probably makes sense regardless).
The following remains to be done:
Use the same idea for the rest of the LangChain4j models (like embedding models)Update the documentationThe way the PR has been done, this would allow other inference servers to be added in the future with minimal changes (the most significant of which would be a way to resolve the conflict where multiple servers can serve a model)