Open rapidarchitect opened 6 months ago
I personally use LM Studio for my local LLM server and would love to use it with this as well.
hopefully the dev can use this example python code for future development.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")
completion = client.chat.completions.create( model="local-model", # this field is currently unused messages=[ {"role": "system", "content": "Always answer in rhymes."}, {"role": "user", "content": "Introduce yourself."} ], temperature=0.7, )
print(completion.choices[0].message)
We're about to add Ollama support to LlamaIndexTS first, see https://github.com/run-llama/LlamaIndexTS/pull/305 - then it could be used in chat-llamaindex
We're about to add Ollama support to LlamaIndexTS first, see run-llama/LlamaIndexTS#305 - then it could be used in chat-llamaindex
Some things are missing for the class Ollama
to fit into the current implementation, at least the maxTokens
metadata entry and the tokens()
method.
I've managed to make it work, with LiteLLM (and Ollama behind it) by setting export OPENAI_BASE_URL=https://litellm.mydomain.tld/v1
and changing
diff --git app/client/platforms/llm.ts app/client/platforms/llm.ts
index ddc316f..a907ee7 100644
--- app/client/platforms/llm.ts
+++ app/client/platforms/llm.ts
@@ -36,6 +36,9 @@ export const ALL_MODELS = [
"gpt-4-vision-preview",
"gpt-3.5-turbo",
"gpt-3.5-turbo-16k",
+ "mixtral_default",
+ "mistral",
+ "phi",
] as const;
export type ModelType = (typeof ALL_MODELS)[number];
And then creating a new bot using one of those models, and adjusting its params.
@m0wer thanks! cool hack! Yes Ollama
currently doesn't have tokens()
implemented, that's why SummaryChatHistory
is not working with it. But SimpleChatHistory
should.
You can try setting sendMemory
to false
for your bot, see:
https://github.com/run-llama/chat-llamaindex/blob/aeee808134a9b267d22d3d48900ba7393e37cdbc/app/api/llm/route.ts#L167C1-L169
Thanks @marcusschiesser ! But I found additional problems when uploading files and building for production. In llamaindexTS there is a list of valid OpenAI model names so the type check fails. Maybe renaming mixtral as gpt4 in litellm would do the trick.
I needed to do the sendMemory
to false
as you said to get it to work but that was it for the development version.
Now I don't know if I should go on the direction of extending the Ollama class or doing the rename in litellm to just be able to reuse most of the current chat-llamaindex code.
Any advice or ideas are welcome ;-)
llamaindexTS there is a list of valid OpenAI model names so the type check fails. This example doesn't work with mixtral? https://github.com/run-llama/LlamaIndexTS/blob/main/examples/ollama.ts Should work as it's using
model: string
llamaindexTS there is a list of valid OpenAI model names so the type check fails. This example doesn't work with mixtral? https://github.com/run-llama/LlamaIndexTS/blob/main/examples/ollama.ts Should work as it's using
model: string
Yes that works. But what I'm trying to do is to get the complete https://chat.llamaindex.ai/ to work with local LLMs. So either I fake the OpenAI API with litellm or I extend the class Ollama from LlamaIndexTS to support the missing methods.
To make it work with Ollama I had to adapt my reverse proxy settings and do the following changes:
diff --git app/api/llm/route.ts app/api/llm/route.ts
index aa3066c..de9a806 100644
--- app/api/llm/route.ts
+++ app/api/llm/route.ts
@@ -4,7 +4,8 @@ import {
DefaultContextGenerator,
HistoryChatEngine,
IndexDict,
- OpenAI,
+ LLMMetadata,
+ Ollama,
ServiceContext,
SimpleChatHistory,
SummaryChatHistory,
@@ -120,6 +121,33 @@ function createReadableStream(
return responseStream.readable;
}
+class OllamaCustom extends Ollama {
+ maxTokens: number;
+
+ constructor(init: Partial<OllamaCustom> & {
+ model: string;
+ }) {
+ super(init);
+ this.maxTokens = init.maxTokens || 2048;
+ }
+
+ get metadata(): LLMMetadata {
+ return {
+ model: this.model,
+ temperature: this.temperature,
+ topP: this.topP,
+ maxTokens: this.maxTokens,
+ contextWindow: this.contextWindow,
+ tokenizer: undefined,
+ };
+ }
+
+ tokens(messages: ChatMessage[]): number {
+ let tokens = 10;
+ return tokens;
+ }
+}
+
export async function POST(request: NextRequest) {
try {
const body = await request.json();
@@ -146,11 +174,14 @@ export async function POST(request: NextRequest) {
);
}
- const llm = new OpenAI({
+ const llm = new OllamaCustom({
+ baseURL: "https://ollama.mydomain.tld",
model: config.model,
temperature: config.temperature,
topP: config.topP,
+ contextWindow: config.maxTokens,
maxTokens: config.maxTokens,
+ requestTimeout: 5 * 60 * 1000,
});
const serviceContext = serviceContextFromDefaults({
diff --git app/client/platforms/llm.ts app/client/platforms/llm.ts
index ddc316f..33273c9 100644
--- app/client/platforms/llm.ts
+++ app/client/platforms/llm.ts
@@ -31,11 +31,11 @@ export interface ResponseMessage {
}
export const ALL_MODELS = [
- "gpt-4",
- "gpt-4-1106-preview",
- "gpt-4-vision-preview",
- "gpt-3.5-turbo",
- "gpt-3.5-turbo-16k",
+ "mistral",
+ "mixtral_default",
+ "dolphin-mixtral:8x7b-v2.7-q4_K_M",
+ "llava",
+ "phi",
] as const;
export type ModelType = (typeof ALL_MODELS)[number];
diff --git package.json package.json
index 0ba2c1b..b6dfdfb 100644
--- package.json
+++ package.json
@@ -38,7 +38,7 @@
"dotenv": "^16.3.1",
"emoji-picker-react": "^4.4.12",
"encoding": "^0.1.13",
- "llamaindex": "0.0.0-20231110031459",
+ "llamaindex": "0.0.44",
"lucide-react": "^0.277.0",
"mermaid": "^10.3.1",
"nanoid": "^5.0.2",
see https://github.com/run-llama/chat-llamaindex/pull/77 for how to use a Ollama
Would like to be able to run this with local llm stacks like litellm or ollama etc.
Could you provide a parameter to specify llm and base url