geoand commented 2 months ago

The basic idea behind this is that the LLM model being run and the API being exposed is not always tightly couple - for example the Mistral models can be run in Ollama, not just in Mistral's SaaS offering. This PR sets the foundation for having an inference server (for now only Ollama) be able to run inference for the LLM models the user configured. In the Ollama implementation we are able to instruct Ollama to pull the selected model name(if it exists and if it is not already present) and make the necessary configuration behind the scenes so each configured LangChain4j chat model configuration will use the inference server.

It is important to note that for this to work the configuration of the model to be used must know become a build time property (which probably makes sense regardless).

The following remains to be done:

~~Use the same idea for the rest of the LangChain4j models (like embedding models)~~
Use Ollama's OpenAI compatibility mode to support the OpenAI extension
~~Update the documentation~~

The way the PR has been done, this would allow other inference servers to be added in the future with minimal changes (the most significant of which would be a way to resolve the conflict where multiple servers can serve a model)

Relates to: #505

geoand commented 2 months ago

I am going to mark this as ready for review as I believe that it's good to go as is and the OpenAI compatibility (which is very limited in Ollama currently anyway) can be introduced in a follow up.

maxandersen commented 2 months ago

So idea is allowing to use ollama, podman ai lab, ai studio etc. To "host" same/closest appproximation model?

And not necessarily tied to a container runtime (docker/podman) ?

Definitely +1 :)

geoand commented 1 month ago

Another interesting fact is that InstructLab supports serving the trained models using an OpenAI compatible API.

geoand commented 1 month ago

@cescoffier have you had a chance to test this one? I'm really interested in your experience with it as well (not least because you opened the original issue).

cescoffier commented 1 month ago

I'm not sure how I should be using it.

I tried with ollama (which works OOTB for me - nothing to do), The dev services was ignored.

Also, it seems that there is some API mismatch:

Ollama is using http://localhost:53977/v1/api/chat while I would have expected: http://localhost:53977/v1/chat/completions

geoand commented 1 month ago

You need to update the version if you are using the samples

geoand commented 1 month ago

Ollama is using http://localhost:53977/v1/api/chat while I would have expected: http://localhost:53977/v1/chat/completions

At the time being, I have not used the OpenAI compatibility stuff for reasons I have explained above

cescoffier commented 1 month ago

If I use:

quarkus.langchain4j.ollama.chat-model.temperature=0.8
quarkus.langchain4j.openai.timeout=60s

it works, but it was working already (the model is already pulled and the server is running). What change should I expect?

geoand commented 1 month ago

You can try to use something like quarkus.langchain4j.ollama.chat-model.model-id=phi3 in order to work with a new model (the default is llama3 which I guess you already have locally)

cescoffier commented 1 month ago

Ah ok, it pulls the model, that's what I was missing!

geoand commented 1 month ago

Yeah, and it's the dev experience one expects where you only need to configure what is necessary (unlike the existing dev service where you need multiple things configured)

cescoffier commented 1 month ago

Hum, something seems to be broken.

It pulls the model, but then the application was not calling it - actually nothing was called:

Listening for transport dt_socket at address: 5005

--/ \/ / / / | / \/ //_/ / / / / -/ // / // / |/ , / ,< / // /\ \
--___// |//|//||____//
2024-05-13 15:58:38,675 WARN [io.qua.config] (Quarkus Main Thread) Unrecognized configuration key "quarkus.langchain4j.openai.timeout" was provided; it will be ignored; verify that the dependency extension for this configuration is set or that you did not make a typo 2024-05-13 15:58:39,043 INFO [io.quarkus] (Quarkus Main Thread) quarkus-langchain4j-sample-review-triage 1.0-SNAPSHOT on JVM (powered by Quarkus 3.8.2) started in 72.372s. Listening on: http://localhost:8080 2024-05-13 15:58:39,044 INFO [io.quarkus] (Quarkus Main Thread) Profile dev activated. Live Coding activated. 2024-05-13 15:58:39,044 INFO [io.quarkus] (Quarkus Main Thread) Installed features: [cdi, langchain4j, langchain4j-ollama, qute, rest-client-reactive, rest-client-reactive-jackson, resteasy-reactive, resteasy-reactive-jackson, smallrye-context-propagation, vertex]

Tried to invoke the model ....

2024-05-13 16:00:59,858 INFO [io.quarkus] (Shutdown thread) quarkus-langchain4j-sample-review-triage stopped in 0.008s

Worked with a restart.

cescoffier commented 1 month ago

BTw, when I said worked, it is a bit weird. The response was the following:

- body: {"model":"phi3","created_at":"2024-05-13T14:01:17.342015Z","message":{"role":"assistant","content":"{\n  \"evaluation\": \"POSITIVE\",\n  \"message\": \"Thank you for your kind words!\"\n}\n\nTo determine the sentiment analysis and language identification programmatically, one would typically use a natural language processing library capable of multi-language support, such as `spaCy`, with additional preprocessing to handle transliterations or direct translation equivalents. However, since this is an example response format without actual code execution, I've manually categorized the sentiment and provided an appropriate message in English for clarity."},"done_reason":"stop","done":true,"total_duration":6125478875,"load_duration":2639789458,"prompt_eval_count":348,"prompt_eval_duration":881733000,"eval_count":112,"eval_duration":2595494000}

geoand commented 1 month ago

That's an issue with the model though, no?

cescoffier commented 1 month ago

@geoand Removed llama3 and it re-pulled it and then it worked perfectly!

geoand commented 1 month ago

If things work as expected for you, I would like to get this in so I can proceed to improve on it later without having the PR go stale (as it has a lot small changes)

quarkiverse / quarkus-langchain4j

Introduce a new foundation for Dev Services like experience #557

Tried to invoke the model ....