run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.17k stars 4.63k forks source link

[Question]: Why when I query my index, does it only index 2 files when I have 10 csvs? #14374

Open liamkavfc opened 6 days ago

liamkavfc commented 6 days ago

Question Validation

Question

I have an index which is built from 10 cvs I have in my data directory

But when I query, it only has 2 files indexed and its not always the same 2 files, can be 2 different files each time.

Here is my code: `def construct_index(directory_path): max_input_size = 4096 num_outputs = 512 max_chunk_overlap = 1.0 chunk_size_limit = 600

prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

llm_predictor = LangChainLLM(OpenAI(temperature=0.7,model="gpt-3.5-turbo", max_tokens=num_outputs))

documents = SimpleDirectoryReader(directory_path).load_data(show_progress=True)

index = VectorStoreIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper, store_nodes_override=True)

index.storage_context.persist(persist_dir='./index_dir')

return index

def chatbot(input_text):
    storage_context = StorageContext.from_defaults(persist_dir="./index_dir")
    index = load_index_from_storage(storage_context)

    query_engine = index.as_query_engine()
    response = query_engine.query(input_text)
    return response.response`
logan-markewich commented 6 days ago

How do you know it's 2 files? What are you checking to confirm this?

liamkavfc commented 6 days ago

In the response I am looking at source nodes. I'll attach some screenshots

image

And here are the files in my index:

image

logan-markewich commented 6 days ago

The default top k is 2, so source nodes will always be the top 2 nodes retrieved from your index.

This is why it changes, and why it's always 2

logan-markewich commented 6 days ago

You can increase the top k, like index.as_query_engine(similarity_top_k=3)

liamkavfc commented 6 days ago

Ah makes sense, thank you, I will try that!