What if I want to use OpenRouter etc? Can I just use an endpoint instead of local inference?

FellowTraveler commented 3 weeks ago

title says it all

remichu-ai commented 3 weeks ago

I believe openrouter is a library that helps routing your request to an online LLM provider with lowest price. Meaning the service provider must already hosted the model.

This library is for host LLM locally instead so it kinda serves difference purpose.

If you have further question, feel free to let me know

FellowTraveler commented 3 weeks ago

What I mean is, if this library adds additional agentic abilities, then what if I want to use those agentic abilities, regardless of where the model is running? Furthermore, what if I want to use Ollama as my backend? I would like to be able to access agentic functionality regardless of where I'm running inferences. I'm just wondering why those agentic capabilities are so tightly-coupled with the inference engine. But if your focus is just on providing another inference engine, fair enough.

remichu-ai commented 3 weeks ago

The reason why it is tightly-coupled is from my frustrating experience with using agentic workflow so far.

Agentic capabilities basically boil down to having a LLM that can generate in the way that we want it to.

So the current landscape to achieve this is: a generic LLM Inference + Frontend Agentic framework. A generic LLM Inference:

Open Source: ollama, Tabby, silly tarven, web generation UI
Closed Source: OpenAI, Claude

Frontend Agentic framework:

Autogen, Langgraph, crewAI

All of the open source inference engine currently is unoppionated (neutral) and the task of guiding the LLM generation is handled at frontend mostly via prompting.

While this neutrality sounds nice in theory and let us swap OpenAI to closed model easily; in practice it just doesnt work well from my experience. Even the basic example on their website just straight up doesnt work when you swap the backend from close source to open source model, e.g. the LLM will goes into loop etc.

Of course, we can attribute this to local model is not as smart as closed source; but that is the card we are dealt with and will always be the case.

And by contrast, closed source LLM is actually "oppionated", e.g. claude < antThinking> < antArtifact>.

For simple example, Chain of Thought prompting has been out for a long time, but there is no standardized way to do it using open source and the common practice is to do it in the frontend (e.g. langchain react agent). But Claude use hidden < antThinking> tag during generation and i am sure Openai does something behind the scene too.

So for gallama, i am trying to to make the open source LLM backend have capability to better guide the LLM to generate how we want. And that comes with a price, which is the engine is no longer neutral and unoppionated.

All these additional features that I implemented will straight up not be accepted to most of the current Open source inference engine because it will make the engine no longer neutral. A lot of these feature required integration to the generation itself, and the is prompting involved behind the scene. There is no proven research to back it up and even for thing with proven search e.g. CoT, there is also no standardized way to do it like I mentioned.

I by no mean claim that the way i am doing is correct nor my engine is better than any of the available engine today; it is just something different that i use for myself after lots of frustration with current agentic landscape.

teddybear082 commented 2 weeks ago

This repo was sorely needed, especially openai style function / tools calling which doesn’t seem supported anywhere for local models. I don’t have enough VRAM to use exllama models but hopefully someday will look forward to trying with llamacpp!

remichu-ai commented 2 weeks ago

Thank you for the kind words. The repo does work with llama cpp however I use exllamav2 most of the time so there might be hidden bug somewhere for llama cpp. Feel free to report bug if you come across any.

teddybear082 commented 2 weeks ago

Oh wow I see backend llamacpp now. I thought it was still in development and not released yet. Thanks! I am not quite sure the exact command line prompt I should use to get it working with a gguf but I will experiment!

remichu-ai / gallama

What if I want to use OpenRouter etc? Can I just use an endpoint instead of local inference? #17