run-llama / chat-llamaindex

https://chat.llamaindex.ai
MIT License
761 stars 245 forks source link

[Feature] Local LLM Support #53

Open rapidarchitect opened 6 months ago

rapidarchitect commented 6 months ago

Would like to be able to run this with local llm stacks like litellm or ollama etc.

Could you provide a parameter to specify llm and base url

kagevazquez commented 6 months ago

I personally use LM Studio for my local LLM server and would love to use it with this as well.

hopefully the dev can use this example python code for future development.

Example: reuse your existing OpenAI setup

from openai import OpenAI

Point to the local server

client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")

completion = client.chat.completions.create( model="local-model", # this field is currently unused messages=[ {"role": "system", "content": "Always answer in rhymes."}, {"role": "user", "content": "Introduce yourself."} ], temperature=0.7, )

print(completion.choices[0].message)

marcusschiesser commented 6 months ago

We're about to add Ollama support to LlamaIndexTS first, see https://github.com/run-llama/LlamaIndexTS/pull/305 - then it could be used in chat-llamaindex

m0wer commented 5 months ago

We're about to add Ollama support to LlamaIndexTS first, see run-llama/LlamaIndexTS#305 - then it could be used in chat-llamaindex

Some things are missing for the class Ollama to fit into the current implementation, at least the maxTokens metadata entry and the tokens() method.

m0wer commented 5 months ago

I've managed to make it work, with LiteLLM (and Ollama behind it) by setting export OPENAI_BASE_URL=https://litellm.mydomain.tld/v1 and changing

diff --git app/client/platforms/llm.ts app/client/platforms/llm.ts
index ddc316f..a907ee7 100644
--- app/client/platforms/llm.ts
+++ app/client/platforms/llm.ts
@@ -36,6 +36,9 @@ export const ALL_MODELS = [
   "gpt-4-vision-preview",
   "gpt-3.5-turbo",
   "gpt-3.5-turbo-16k",
+  "mixtral_default",
+  "mistral",
+  "phi",
 ] as const;

 export type ModelType = (typeof ALL_MODELS)[number];

And then creating a new bot using one of those models, and adjusting its params.

marcusschiesser commented 5 months ago

@m0wer thanks! cool hack! Yes Ollama currently doesn't have tokens() implemented, that's why SummaryChatHistory is not working with it. But SimpleChatHistory should. You can try setting sendMemory to false for your bot, see: https://github.com/run-llama/chat-llamaindex/blob/aeee808134a9b267d22d3d48900ba7393e37cdbc/app/api/llm/route.ts#L167C1-L169

m0wer commented 5 months ago

Thanks @marcusschiesser ! But I found additional problems when uploading files and building for production. In llamaindexTS there is a list of valid OpenAI model names so the type check fails. Maybe renaming mixtral as gpt4 in litellm would do the trick.

I needed to do the sendMemory to false as you said to get it to work but that was it for the development version.

Now I don't know if I should go on the direction of extending the Ollama class or doing the rename in litellm to just be able to reuse most of the current chat-llamaindex code.

Any advice or ideas are welcome ;-)

marcusschiesser commented 5 months ago

llamaindexTS there is a list of valid OpenAI model names so the type check fails. This example doesn't work with mixtral? https://github.com/run-llama/LlamaIndexTS/blob/main/examples/ollama.ts Should work as it's using model: string

m0wer commented 5 months ago

llamaindexTS there is a list of valid OpenAI model names so the type check fails. This example doesn't work with mixtral? https://github.com/run-llama/LlamaIndexTS/blob/main/examples/ollama.ts Should work as it's using model: string

Yes that works. But what I'm trying to do is to get the complete https://chat.llamaindex.ai/ to work with local LLMs. So either I fake the OpenAI API with litellm or I extend the class Ollama from LlamaIndexTS to support the missing methods.

m0wer commented 5 months ago

To make it work with Ollama I had to adapt my reverse proxy settings and do the following changes:

diff --git app/api/llm/route.ts app/api/llm/route.ts
index aa3066c..de9a806 100644
--- app/api/llm/route.ts
+++ app/api/llm/route.ts
@@ -4,7 +4,8 @@ import {
   DefaultContextGenerator,
   HistoryChatEngine,
   IndexDict,
-  OpenAI,
+  LLMMetadata,
+  Ollama,
   ServiceContext,
   SimpleChatHistory,
   SummaryChatHistory,
@@ -120,6 +121,33 @@ function createReadableStream(
   return responseStream.readable;
 }

+class OllamaCustom extends Ollama {
+    maxTokens: number;
+
+  constructor(init: Partial<OllamaCustom> & {
+    model: string;
+  }) {
+    super(init);
+    this.maxTokens = init.maxTokens || 2048;
+  }
+
+  get metadata(): LLMMetadata {
+    return {
+      model: this.model,
+      temperature: this.temperature,
+      topP: this.topP,
+      maxTokens: this.maxTokens,
+      contextWindow: this.contextWindow,
+      tokenizer: undefined,
+    };
+  }
+
+  tokens(messages: ChatMessage[]): number {
+    let tokens = 10;
+    return tokens;
+  }
+}
+
 export async function POST(request: NextRequest) {
   try {
     const body = await request.json();
@@ -146,11 +174,14 @@ export async function POST(request: NextRequest) {
       );
     }

-    const llm = new OpenAI({
+    const llm = new OllamaCustom({
+      baseURL: "https://ollama.mydomain.tld",
       model: config.model,
       temperature: config.temperature,
       topP: config.topP,
+      contextWindow: config.maxTokens,
       maxTokens: config.maxTokens,
+      requestTimeout: 5 * 60 * 1000,
     });

     const serviceContext = serviceContextFromDefaults({
diff --git app/client/platforms/llm.ts app/client/platforms/llm.ts
index ddc316f..33273c9 100644
--- app/client/platforms/llm.ts
+++ app/client/platforms/llm.ts
@@ -31,11 +31,11 @@ export interface ResponseMessage {
 }

 export const ALL_MODELS = [
-  "gpt-4",
-  "gpt-4-1106-preview",
-  "gpt-4-vision-preview",
-  "gpt-3.5-turbo",
-  "gpt-3.5-turbo-16k",
+  "mistral",
+  "mixtral_default",
+  "dolphin-mixtral:8x7b-v2.7-q4_K_M",
+  "llava",
+  "phi",
 ] as const;

 export type ModelType = (typeof ALL_MODELS)[number];
diff --git package.json package.json
index 0ba2c1b..b6dfdfb 100644
--- package.json
+++ package.json
@@ -38,7 +38,7 @@
     "dotenv": "^16.3.1",
     "emoji-picker-react": "^4.4.12",
     "encoding": "^0.1.13",
-    "llamaindex": "0.0.0-20231110031459",
+    "llamaindex": "0.0.44",
     "lucide-react": "^0.277.0",
     "mermaid": "^10.3.1",
     "nanoid": "^5.0.2",
marcusschiesser commented 4 months ago

see https://github.com/run-llama/chat-llamaindex/pull/77 for how to use a Ollama