llama_print_timings: .. lines show up before the last word of the answer

chunhualiao commented 11 months ago

👉 what does this code do?
🤖  The given code creates a BaseLLM object which is an implementation of linearized language models for text classification tasks. It also calls the `embedding_search` method to search for similar vectors using a given query and a search range. Finally, it loads the vector store with the specified embeddings, index, and root directory and returns the corresponding vector based on the search
llama_print_timings:        load time =   493.57 ms
llama_print_timings:      sample time =    60.73 ms /    78 runs   (    0.78 ms per token,  1284.44 tokens per second)
llama_print_timings: prompt eval time = 21908.13 ms /   602 tokens (   36.39 ms per token,    27.48 tokens per second)
llama_print_timings:        eval time =  6591.89 ms /    77 runs   (   85.61 ms per token,    11.68 tokens per second)
llama_print_timings:       total time = 28967.39 ms
 range.
📄 /Users/liao6/workspace/talk-codebase/talk_codebase/llm.py:

As you can see, the timing information is injected before the last word of the answer "range."

llama_print_timings:        load time =   493.57 ms
llama_print_timings:      sample time =    60.73 ms /    78 runs   (    0.78 ms per token,  1284.44 tokens per second)
llama_print_timings: prompt eval time = 21908.13 ms /   602 tokens (   36.39 ms per token,    27.48 tokens per second)
llama_print_timings:        eval time =  6591.89 ms /    77 runs   (   85.61 ms per token,    11.68 tokens per second)
llama_print_timings:       total time = 28967.39 ms

chunhualiao commented 10 months ago

I have a fix:

diff --git a/talk_codebase/llm.py b/talk_codebase/llm.py
index 9a26c4a..cb3b462 100644
--- a/talk_codebase/llm.py
+++ b/talk_codebase/llm.py
@@ -94,6 +94,7 @@ class LocalLLM(BaseLLM):
         model_n_batch = int(self.config.get("n_batch"))
         callbacks = CallbackManager([StreamStdOut()])
         llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=False)
+        llm.client.verbose = False
         return llm

rsaryev commented 10 months ago

create please pr https://github.com/rsaryev/talk-codebase/pulls

rsaryev commented 10 months ago

Please update talk-codebase pip install --upgrade talk-codebase==0.1.46

rsaryev / talk-codebase

llama_print_timings: .. lines show up before the last word of the answer #15