LocalAI slow response - Githubissues

mydeveloperplanet commented 8 months ago

We started experimenting with LocalAI and are very enthusiastic about it. However, we encounter slow response when we want to chat based on documents. The example we use is based on this LangChain4j example. The slightly adapted source code we used, is added below this issue.

If we run the example against OpenAI, we receive a response in 10 seconds. If we run the example against LocalAI, we receive a response in 138 seconds.

We checked the advise on https://localai.io/faq/:

we use only SSD for model storage
we ran the code on a Windows machine (4 CPUs, 32GB RAM) and on an Ubuntu machine (12 CPUs, 48GB RAM), there is no noticeable difference (we changed the number of THREADS in the .env file). Changing the threads on the 12 CPU machine and restarting the docker container, does not seem to have a lot of influence. With docker stats, we notice that the container has access to all CPUs and all memory.
we noticed with DEBUG=true, that the embeddings are executed fast, but from the moment 'Send the prompt to the OpenAI chat model' is passed, nothing is logged until the response is received after many seconds
we use in LocalAI the luna-ai-llama2-uncensored.Q4_K_M.gguf model and use the model configuration as configured here, but then with the Q4_K_M model configured.

We cannot figure out what causes the slow response.

If more data, information is needed, please let us know.

    static class If_You_Need_More_Control {

        public static void main(String[] args) {

            long starttime = System.currentTimeMillis();

            // Load the document that includes the information you'd like to "chat" about with the model.
            Document document = loadDocument(toPath("story-about-happy-carrot.txt"));

            // Split document into segments 100 tokens each
            DocumentSplitter splitter = DocumentSplitters.recursive(
                    100,
                    0,
                    new OpenAiTokenizer(GPT_3_5_TURBO)
            );
            List<TextSegment> segments = splitter.split(document);

            // Embed segments (convert them into vectors that represent the meaning) using embedding model
            EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
            List<Embedding> embeddings = embeddingModel.embedAll(segments).content();

            // Store embeddings into embedding store for further search / retrieval
            EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
            embeddingStore.addAll(embeddings, segments);

            // Specify the question you want to ask the model
            String question = "Who is Charlie?";

            // Embed the question
            Embedding questionEmbedding = embeddingModel.embed(question).content();

            // Find relevant embeddings in embedding store by semantic similarity
            // You can play with parameters below to find a sweet spot for your specific use case
            int maxResults = 3;
            double minScore = 0.7;
            List<EmbeddingMatch<TextSegment>> relevantEmbeddings
                    = embeddingStore.findRelevant(questionEmbedding, maxResults, minScore);

            // Create a prompt for the model that includes question and relevant embeddings
            PromptTemplate promptTemplate = PromptTemplate.from(
                    "Answer the following question to the best of your ability:\n"
                            + "\n"
                            + "Question:\n"
                            + "{{question}}\n"
                            + "\n"
                            + "Base your answer on the following information:\n"
                            + "{{information}}");

            String information = relevantEmbeddings.stream()
                    .map(match -> match.embedded().text())
                    .collect(joining("\n\n"));

            Map<String, Object> variables = new HashMap<>();
            variables.put("question", question);
            variables.put("information", information);

            Prompt prompt = promptTemplate.apply(variables);

            // Send the prompt to the OpenAI chat model
            ChatLanguageModel chatModel = /*LocalAiChatModel.builder()
                    .baseUrl("http://localhost:8080/")
                    .modelName("lunademo")
                    .timeout(Duration.ofSeconds(500))
                    .maxRetries(1)
                    .temperature(0.3)
                    .logRequests(false)
                    .logResponses(false)
                    .build();*/
                    OpenAiChatModel.builder().apiKey("demo").timeout(Duration.ofSeconds(30)).build();
            AiMessage aiMessage = chatModel.generate(prompt.toUserMessage()).content();

            // See an answer from the model
            String answer = aiMessage.text();
            System.out.println(answer); // Charlie is a cheerful carrot living in VeggieVille...
            System.out.println("Completed in: " + (System.currentTimeMillis() - starttime)/1000 + " seconds");
        }
    }

    private static Path toPath(String fileName) {
        try {
            URL fileUrl = Main.class.getResource(fileName);
            return Path.of(fileUrl.toURI());
        } catch (URISyntaxException e) {
            throw new RuntimeException(e);
        }
    }

lunamidori5 commented 8 months ago

@mydeveloperplanet hello sorry for the delay, would you be willing to post the docker-compose and the models yaml here (I have been hard at work updating the yamls on the site so I am unsure what one you pulled.)

Note: updated the .env only works if you wipe the docker and reup it (docker-compose down --rmi all then docker-compose up --pull always A change ill be making to the site is move the CPU cores from the env to the models yaml.

mydeveloperplanet commented 8 months ago

No problem, your reply is fast enough :-)

Here the docker-compose and model yamls: localai.tar.gz

If there is anything I can do from my side, do not hesitate to ask, I am willing to help.

lunamidori5 commented 8 months ago

Okay yea, so your running the older yaml. to fix, remove the f16 and gpu layers from the yaml then add threads: X where x is the number of threads, this being said you are running on CPU only, and that will always be 25x slower than GPU for this. You can use both at the same time by keeping the f16 (set it to true) and changing the GPU layers to whatever number you GPU supports. - @mydeveloperplanet

Small reminder, each time you change a yaml you need to restart the docker docker-compose restart - Windows docker compose restart - Linux

Link - https://localai.io/howtos/easy-setup-docker-gpu/

mydeveloperplanet commented 8 months ago

I forgot to mention: we did some tests on Thursday/Friday with a GPU and the results are indeed much better. Did not thought that it would make that kind of a difference. This is issue can be closed, conclusion to me is that you should always make use of a GPU and CPU is more for testing purposes. And thanks for your support!

harshsing2891 commented 8 months ago

can u suggest how can i replicate same in kubernetes cluster?

mudler / LocalAI

LocalAI slow response #1215