mudler / LocalAI

:robot: The free, Open Source OpenAI alternative. Self-hosted, community-driven and local-first. Drop-in replacement for OpenAI running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. It allows to generate Text, Audio, Video, Images. Also with voice cloning capabilities.
https://localai.io
MIT License
21.75k stars 1.66k forks source link

LocalAI slow response #1215

Closed mydeveloperplanet closed 8 months ago

mydeveloperplanet commented 8 months ago

We started experimenting with LocalAI and are very enthusiastic about it. However, we encounter slow response when we want to chat based on documents. The example we use is based on this LangChain4j example. The slightly adapted source code we used, is added below this issue.

If we run the example against OpenAI, we receive a response in 10 seconds. If we run the example against LocalAI, we receive a response in 138 seconds.

We checked the advise on https://localai.io/faq/:

We cannot figure out what causes the slow response.

If more data, information is needed, please let us know.

    static class If_You_Need_More_Control {

        public static void main(String[] args) {

            long starttime = System.currentTimeMillis();

            // Load the document that includes the information you'd like to "chat" about with the model.
            Document document = loadDocument(toPath("story-about-happy-carrot.txt"));

            // Split document into segments 100 tokens each
            DocumentSplitter splitter = DocumentSplitters.recursive(
                    100,
                    0,
                    new OpenAiTokenizer(GPT_3_5_TURBO)
            );
            List<TextSegment> segments = splitter.split(document);

            // Embed segments (convert them into vectors that represent the meaning) using embedding model
            EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
            List<Embedding> embeddings = embeddingModel.embedAll(segments).content();

            // Store embeddings into embedding store for further search / retrieval
            EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
            embeddingStore.addAll(embeddings, segments);

            // Specify the question you want to ask the model
            String question = "Who is Charlie?";

            // Embed the question
            Embedding questionEmbedding = embeddingModel.embed(question).content();

            // Find relevant embeddings in embedding store by semantic similarity
            // You can play with parameters below to find a sweet spot for your specific use case
            int maxResults = 3;
            double minScore = 0.7;
            List<EmbeddingMatch<TextSegment>> relevantEmbeddings
                    = embeddingStore.findRelevant(questionEmbedding, maxResults, minScore);

            // Create a prompt for the model that includes question and relevant embeddings
            PromptTemplate promptTemplate = PromptTemplate.from(
                    "Answer the following question to the best of your ability:\n"
                            + "\n"
                            + "Question:\n"
                            + "{{question}}\n"
                            + "\n"
                            + "Base your answer on the following information:\n"
                            + "{{information}}");

            String information = relevantEmbeddings.stream()
                    .map(match -> match.embedded().text())
                    .collect(joining("\n\n"));

            Map<String, Object> variables = new HashMap<>();
            variables.put("question", question);
            variables.put("information", information);

            Prompt prompt = promptTemplate.apply(variables);

            // Send the prompt to the OpenAI chat model
            ChatLanguageModel chatModel = /*LocalAiChatModel.builder()
                    .baseUrl("http://localhost:8080/")
                    .modelName("lunademo")
                    .timeout(Duration.ofSeconds(500))
                    .maxRetries(1)
                    .temperature(0.3)
                    .logRequests(false)
                    .logResponses(false)
                    .build();*/
                    OpenAiChatModel.builder().apiKey("demo").timeout(Duration.ofSeconds(30)).build();
            AiMessage aiMessage = chatModel.generate(prompt.toUserMessage()).content();

            // See an answer from the model
            String answer = aiMessage.text();
            System.out.println(answer); // Charlie is a cheerful carrot living in VeggieVille...
            System.out.println("Completed in: " + (System.currentTimeMillis() - starttime)/1000 + " seconds");
        }
    }

    private static Path toPath(String fileName) {
        try {
            URL fileUrl = Main.class.getResource(fileName);
            return Path.of(fileUrl.toURI());
        } catch (URISyntaxException e) {
            throw new RuntimeException(e);
        }
    }
lunamidori5 commented 8 months ago

@mydeveloperplanet hello sorry for the delay, would you be willing to post the docker-compose and the models yaml here (I have been hard at work updating the yamls on the site so I am unsure what one you pulled.)

Note: updated the .env only works if you wipe the docker and reup it (docker-compose down --rmi all then docker-compose up --pull always A change ill be making to the site is move the CPU cores from the env to the models yaml.

mydeveloperplanet commented 8 months ago

No problem, your reply is fast enough :-)

Here the docker-compose and model yamls: localai.tar.gz

If there is anything I can do from my side, do not hesitate to ask, I am willing to help.

lunamidori5 commented 8 months ago

Okay yea, so your running the older yaml. to fix, remove the f16 and gpu layers from the yaml then add threads: X where x is the number of threads, this being said you are running on CPU only, and that will always be 25x slower than GPU for this. You can use both at the same time by keeping the f16 (set it to true) and changing the GPU layers to whatever number you GPU supports. - @mydeveloperplanet

Small reminder, each time you change a yaml you need to restart the docker docker-compose restart - Windows docker compose restart - Linux

Link - https://localai.io/howtos/easy-setup-docker-gpu/

mydeveloperplanet commented 8 months ago

I forgot to mention: we did some tests on Thursday/Friday with a GPU and the results are indeed much better. Did not thought that it would make that kind of a difference. This is issue can be closed, conclusion to me is that you should always make use of a GPU and CPU is more for testing purposes. And thanks for your support!

harshsing2891 commented 8 months ago

can u suggest how can i replicate same in kubernetes cluster?