run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
37.01k stars 5.31k forks source link

[Bug]: PyMuPDF loader not working with SimpleDirectoryReader #16758

Closed vikrantdeshpande09876 closed 4 weeks ago

vikrantdeshpande09876 commented 4 weeks ago

Bug Description

Trying to execute:

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import PyMuPDFReader

reader_config = SimpleDirectoryReader(
    input_dir="../data/raw/rag-data",
    recursive=True,
    file_extractor={
        '.pdf' : PyMuPDFReader
    }
)
documents = reader_config.load_data()

Version

llama-index==0.11.19 llama-index-agent-openai==0.3.4 llama-index-cli==0.3.1 llama-index-core==0.11.19 llama-index-embeddings-huggingface==0.3.1 llama-index-embeddings-openai==0.2.5 llama-index-indices-managed-llama-cloud==0.4.0 llama-index-legacy==0.9.48.post3 llama-index-llms-langchain==0.4.2 llama-index-llms-openai==0.2.16 llama-index-llms-openai-like==0.2.0 llama-index-llms-openllm==0.3.1 llama-index-multi-modal-llms-openai==0.2.3 llama-index-program-openai==0.2.0 llama-index-question-gen-openai==0.2.0 llama-index-readers-file==0.2.2 llama-index-readers-llama-parse==0.3.0 llama-index-vector-stores-postgres==0.2.6

Steps to Reproduce

Shouldn't this line be file: Union[Path, str], instead? Am I missing something obvious here, or is there some version inconsistency?

Relevant Logs/Tracbacks

Failed to load file /Users/myPC/../data/raw/rag-data/file1(en).pdf with error: PyMuPDFReader.load_data() missing 1 required positional argument: 'file_path'. Skipping...
Failed to load file /Users/myPC/../data/raw/rag-data/file2.pdf with error: PyMuPDFReader.load_data() missing 1 required positional argument: 'file_path'. Skipping...
Failed to load file /Users/myPC/../data/raw/rag-data/file3.pdf with error: PyMuPDFReader.load_data() missing 1 required positional argument: 'file_path'. Skipping...
Failed to load file /Users/myPC/../data/raw/rag-data/file4.pdf with error: PyMuPDFReader.load_data() missing 1 required positional argument: 'file_path'. Skipping...
dosubot[bot] commented 4 weeks ago

A similar issue was reported where the SimpleDirectoryReader had problems with PDF reading, and it was resolved by updating to version 0.10.3 of the library [1]. However, since you're using a later version, the problem might be different.

The error message indicates that the PyMuPDFReader.load_data() method is missing the file_path argument. Ensure that the file_extractor is correctly instantiated with PyMuPDFReader() instead of just PyMuPDFReader. Here's how you can modify your code:

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import PyMuPDFReader

reader_config = SimpleDirectoryReader(
    input_dir="../data/raw/rag-data",
    recursive=True,
    file_extractor={
        '.pdf': PyMuPDFReader()  # Instantiate the reader
    }
)
documents = reader_config.load_data()

This should resolve the issue by correctly passing the file_path to the load_data() method.

To continue talking to Dosu, mention @dosu.

logan-markewich commented 4 weeks ago

Yea pretty sure you need to instantiate the reader like dosu suggested

vikrantdeshpande09876 commented 4 weeks ago

Man 🤦. Okay closing this out.