run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.79k stars 5.28k forks source link

[Question]: multiple-PDF with local LLM llama3 #14424

Closed yanwun closed 1 month ago

yanwun commented 4 months ago

Question Validation

Question

Hello team, I use Ollama service to handle LLM server, and I use Llama 3 I used the example of multi-PDF Retrieval-Augmented Generation (RAG) , the link is here https://docs.llamaindex.ai/en/stable/examples/agent/multi_document_agents-v1/, with agents provided by the following link, using wiki documents as the data source for RAG. However, my responses are very slow regardless of whether I have one document or two documents. Additionally, I have encountered the following issues:

from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMMultiSelector
import time

s = time.time()
test_retriever = custom_obj_retriever.retrieve("Tell me about Tokyo?")
query_engine = RouterQueryEngine(
    selector=LLMMultiSelector.from_defaults(), 
    query_engine_tools=test_retriever,
    verbose=True
)
response = query_engine.query("Tell me more about Tokyo?")
print(str(response))
print(str(response.metadata["selector_result"]))
e = time.time()
print(e-s)

Outputs: Selecting query engine 0: The summary provided in option 1 directly answers the question 'Tell me more about Tokyo?' and provides information about Tokyo's culture, history, and modern infrastructure.. Thought: The current language of the user is: English. I need to use a tool to help me answer the question. Action: vector_tool_data_Tokyo Action Input: {'input': 'Tokyo'} Observation: The city that ceased to be a mere figurehead and became both the de facto and de jure ruler of the country after the overthrow of the Tokugawa shogunate, recognising the advantages of the existing infrastructure and the vastness of the Kanto Plain. Thought: I can use more information from the tool to answer the question. Action: vector_tool_data_Tokyo Action Input: {'input': 'Tokyo', 'num_results': 5} Observation: Tokyo. Thought: I can answer without using any more tools. I'll use the user's language to answer Answer: Tokyo is a city in Japan that played a significant role in the country's history, serving as both the de facto and de jure ruler of Japan after the overthrow of the Tokugawa shogunate. Tokyo is a city in Japan that played a significant role in the country's history, serving as both the de facto and de jure ruler of Japan after the overthrow of the Tokugawa shogunate. selections=[SingleSelection(index=0, reason="The summary provided in option 1 directly answers the question 'Tell me more about Tokyo?' and provides information about Tokyo's culture, history, and modern infrastructure.")] 28.08979320526123

logan-markewich commented 4 months ago

open-source llms are pretty bad in general at being agents right now. Also, Ollama is slow in general compared to other APIs.

yanwun commented 4 months ago

@logan-markewich thanks for helping answer the question. I mean that, because my query engine performs somethings multiple observations to provide responses to the LLM, I want to control it so that it responds after the first observation. How can I achieve this?

Thanks.

MikeDepies commented 4 months ago

From what I can see, the implementation of the ollama wrapper instantiates an httpx instance each time it runs a query. This is where I'm seeing all of the time cost for responses.