[Question]: multiple-PDF with local LLM llama3

yanwun commented 4 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

Hello team, I use Ollama service to handle LLM server, and I use Llama 3 I used the example of multi-PDF Retrieval-Augmented Generation (RAG) , the link is here https://docs.llamaindex.ai/en/stable/examples/agent/multi_document_agents-v1/, with agents provided by the following link, using wiki documents as the data source for RAG. However, my responses are very slow regardless of whether I have one document or two documents. Additionally, I have encountered the following issues:

I want the query engine to perform only one iteration without fetching data again for LLM integration. How can I achieve this?
The response time is very long, with the response speed being the same for two documents and eighteen documents.
Sometimes the query engine goes into an infinite loop. Can it be limited? please help, thanks!

from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMMultiSelector
import time

s = time.time()
test_retriever = custom_obj_retriever.retrieve("Tell me about Tokyo?")
query_engine = RouterQueryEngine(
    selector=LLMMultiSelector.from_defaults(), 
    query_engine_tools=test_retriever,
    verbose=True
)
response = query_engine.query("Tell me more about Tokyo?")
print(str(response))
print(str(response.metadata["selector_result"]))
e = time.time()
print(e-s)

Outputs: Selecting query engine 0: The summary provided in option 1 directly answers the question 'Tell me more about Tokyo?' and provides information about Tokyo's culture, history, and modern infrastructure.. Thought: The current language of the user is: English. I need to use a tool to help me answer the question. Action: vector_tool_data_Tokyo Action Input: {'input': 'Tokyo'} Observation: The city that ceased to be a mere figurehead and became both the de facto and de jure ruler of the country after the overthrow of the Tokugawa shogunate, recognising the advantages of the existing infrastructure and the vastness of the Kanto Plain. Thought: I can use more information from the tool to answer the question. Action: vector_tool_data_Tokyo Action Input: {'input': 'Tokyo', 'num_results': 5} Observation: Tokyo. Thought: I can answer without using any more tools. I'll use the user's language to answer Answer: Tokyo is a city in Japan that played a significant role in the country's history, serving as both the de facto and de jure ruler of Japan after the overthrow of the Tokugawa shogunate. Tokyo is a city in Japan that played a significant role in the country's history, serving as both the de facto and de jure ruler of Japan after the overthrow of the Tokugawa shogunate. selections=[SingleSelection(index=0, reason="The summary provided in option 1 directly answers the question 'Tell me more about Tokyo?' and provides information about Tokyo's culture, history, and modern infrastructure.")] 28.08979320526123

logan-markewich commented 4 months ago

open-source llms are pretty bad in general at being agents right now. Also, Ollama is slow in general compared to other APIs.

I want the query engine to perform only one iteration without fetching data again for LLM integration -- I don't know what this means tbh

yanwun commented 4 months ago

@logan-markewich thanks for helping answer the question. I mean that, because my query engine performs somethings multiple observations to provide responses to the LLM, I want to control it so that it responds after the first observation. How can I achieve this?

Thanks.

MikeDepies commented 4 months ago

From what I can see, the implementation of the ollama wrapper instantiates an httpx instance each time it runs a query. This is where I'm seeing all of the time cost for responses.

run-llama / llama_index

[Question]: multiple-PDF with local LLM llama3 #14424

Question Validation

Question