Closed lambda7xx closed 4 months ago
Hey there, @lambda7xx! Fancy seeing you here again. 🙂 I'm diving into your issue and will get back to you with some insights soon. Hang tight!
I add timer in the /llama_index/llms/huggingface/base.py
@llm_completion_callback()
def complete(
self, prompt: str, formatted: bool = False, **kwargs: Any
) -> CompletionResponse:
"""Completion endpoint."""
full_prompt = prompt
def getlen(s):
return len(s.split())
print(f"1 complete, len(prompt): {getlen(prompt)}")
import time
start = time.time()
if not formatted:
if self.query_wrapper_prompt:
full_prompt = self.query_wrapper_prompt.format(query_str=prompt)
if self.system_prompt:
full_prompt = f"{self.system_prompt} {full_prompt}"
inputs = self._tokenizer(full_prompt, return_tensors="pt")
inputs = inputs.to(self._model.device)
end = time.time()
print("2 complete, the time of tokenizer: ", end-start)
# remove keys from the tokenizer if needed, to avoid HF errors
for key in self.tokenizer_outputs_to_remove:
if key in inputs:
inputs.pop(key, None)
start = time.time()
tokens = self._model.generate(
**inputs,
max_new_tokens=self.max_new_tokens,
stopping_criteria=self._stopping_criteria,
**self.generate_kwargs,
)#(lambda):insert torch.cuda.nvtx
end = time.time()
print("3 complete, the time of model.generate: ", end-start)
completion_tokens = tokens[0][inputs["input_ids"].size(1) :]
completion = self._tokenizer.decode(completion_tokens, skip_special_tokens=True)
print("4 complete, len(completion): ", getlen(completion))
print("************finish complete function************\n\n")
return CompletionResponse(text=completion, raw={"model_output": tokens})
I found if I run the query_engine.query(tmp_query)
, the latency is larger then one of new query.
part of log is below. my original query is query*******:Based on the abstract of "Llama 2: Open Foundation and Fine-Tuned Chat Models," what are the two primary objectives achieved in this work, and what is the range of parameters for the large language models developed?
. For the original query, the prompt length is 36 and the new token lengths is 149 and its latency is 12.3 seconds, while for the new query, its prompt is 923 and new token lengths is 29 and the latency is 2.75 seconds. why is the new query's latency is smaller than the original query?
<IPython.core.display.Markdown object>
type(query1):<class 'str'> and type(query_engine):<class 'llama_index.core.query_engine.retriever_query_engine.RetrieverQueryEngine'>
*****retriever_query_engine.py tart _query*****
retriever_query_engine.py retriever time: 0.017117977142333984
1 Refine::get_response, get_response, the prev_response is: None
2 Refine::get_response, get_response, the len(text_chunk) is: 877 and len(query_str) is: 24
1 _give_response_single, the type(text_qa_template): <class 'llama_index.core.prompts.base.SelectorPromptTemplate'> and the type(text_chunks): <class 'list'>
1 Refine::_default_program_factory, self._structured_answer_filtering: False
2 _give_response_single, the responese:None and the self._streaming: False and type(cur_text_chunk): <class 'str'>
3 _give_response_single, the response is None and not self._streaming
DefaultRefineProgram::__call, the self._output_cls: None
DefaultRefineProgram::__call, call llm::predict
*******start llm predict
1 complete, len(prompt): 923
2 complete, the time of tokenizer: 0.0053899288177490234
3 complete, the time of model.generate: 2.725148916244507
4 complete, len(completion): 29
************finish complete function************
LLM::predict, the predict time 2.7312374114990234
Refine::get_response, self._give_response_single, the time is: 2.7343428134918213
retriever_query_engine.py self._response_synthesizer.synthesize: 2.7391698360443115
query+generate duration:2.7564148902893066
****finish one query****
*****retriever_query_engine.py tart _query*****
retriever_query_engine.py retriever time: 0.01692509651184082
1 Refine::get_response, get_response, the prev_response is: None
2 Refine::get_response, get_response, the len(text_chunk) is: 877 and len(query_str) is: 36
1 _give_response_single, the type(text_qa_template): <class 'llama_index.core.prompts.base.SelectorPromptTemplate'> and the type(text_chunks): <class 'list'>
1 Refine::_default_program_factory, self._structured_answer_filtering: False
2 _give_response_single, the responese:None and the self._streaming: False and type(cur_text_chunk): <class 'str'>
3 _give_response_single, the response is None and not self._streaming
DefaultRefineProgram::__call, the self._output_cls: None
DefaultRefineProgram::__call, call llm::predict
*******start llm predict
1 complete, len(prompt): 935
2 complete, the time of tokenizer: 0.00542449951171875
3 complete, the time of model.generate: 12.346087455749512
4 complete, len(completion): 172
************finish complete function************
LLM::predict, the predict time 12.352328300476074
Refine::get_response, self._give_response_single, the time is: 12.35537576675415
retriever_query_engine.py self._response_synthesizer.synthesize: 12.360142707824707
query_wo_rewrite_time:12.377184391021729
Based on the information you've provided, it seems like the latency issue you're experiencing might be due to several factors inherent in the design and execution flow of the LlamaIndex query processing mechanism. Here are some potential reasons:
Recursive Query Processing: The query processing mechanism is recursive, especially for non-leaf nodes. This means that for each level of the tree, the system potentially makes multiple calls to either _query_with_selected_node
or _query_level
methods, which in turn may involve calls to external services (e.g., self._service_context.llm.predict
). The recursion depth and the number of recursive calls can significantly increase the latency, particularly if the original query requires deeper traversal in the tree structure compared to a new query.
External Service Calls: The methods _query_with_selected_node
and _query_level
involve calls to external services for prediction (self._service_context.llm.predict
). The latency of these external calls can vary based on several factors such as network latency, the load on the external service, and the complexity of the query. If the original query involves more complex or numerous external service calls than a new query, this could lead to higher latency.
Dynamic Template Selection and Text Splitting: The system dynamically selects query templates (query_template
or query_template_multiple
) and performs text splitting based on the number of child nodes and the branching factor. This dynamic behavior, including the construction of numbered node text and the handling of multiple chunks of text, can introduce additional computational overhead, especially if the original query results in a more complex branching structure.
Node Selection and Number Extraction: The process of selecting nodes based on the responses from the external service (extract_numbers_given_response
) and the subsequent logic to handle invalid responses or out-of-range numbers can add to the latency. This is more pronounced if the original query leads to responses that require additional handling for number extraction and validation.
Verbose Logging and Debugging: The verbose logging and debugging statements, although useful for development and troubleshooting, can introduce I/O overhead, especially if the logging level is set to debug and the system is logging detailed information for each step of the query processing.
In summary, the higher latency for the original query compared to a new query could be due to the recursive nature of the query processing, the reliance on external service calls, the dynamic selection of query templates and text splitting, the complexity of node selection and number extraction, and the potential overhead from verbose logging. Each of these factors can contribute to the overall latency, and their impact can vary based on the specifics of the original query and the structure of the LlamaIndex.
You can find more details about the query processing mechanism in the select_leaf_retriever.py file.
To resolve this issue, you might want to consider optimizing the factors mentioned above. For instance, you could try to reduce the recursion depth or the number of external service calls, simplify the dynamic template selection and text splitting, streamline the node selection and number extraction, or adjust the logging level to reduce I/O overhead. However, please note that these optimizations should be done carefully as they could potentially affect the functionality and performance of other parts of the LlamaIndex.
I hope this helps! If you have any further questions or need more clarification, feel free to ask.
does the llamaindex support batch query? currently, my code is for single query. if i want do batch query, how to do?
Question Validation
Question
I run the query rewrite with the follow code.