run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.71k stars 5.26k forks source link

[Bug]: ImportError when trying to read a pdf file using SimpleDirectoryReader' #12254

Closed arashaga closed 7 months ago

arashaga commented 7 months ago

Bug Description

When trying to load a PDF file using the SimpleDirectoryReader from the llama_index library, an ImportError is raised. The error message indicates that the 'DocxReader' cannot be imported from 'llama_index.readers.file'. I was just trying to read a PDF file. as part of the error below it says that package llama-index-reader-file package now found but my pip says it's already installed.

Version

1.10.19 and 0.10.23

Steps to Reproduce

1- Import SimpleDirectoryReader from llama_index.core. 2- Try to load a PDF file using SimpleDirectoryReader.

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(input_files=["data/pdfs/example.pdf"]).load_data()

Relevant Logs/Tracbacks

ImportError                               Traceback (most recent call last)
File /anaconda/envs/lama_env/lib/python3.10/site-packages/llama_index/core/readers/file/base.py:25, in _try_loading_included_file_formats()
     24 try:
---> 25     from llama_index.readers.file import (
     26         DocxReader,
     27         EpubReader,
     28         HWPReader,
     29         ImageReader,
     30         IPYNBReader,
     31         MarkdownReader,
     32         MboxReader,
     33         PandasCSVReader,
     34         PDFReader,
     35         PptxReader,
     36         VideoAudioReader,
     37     )  # pants: no-infer-dep
     38 except ImportError:

ImportError: cannot import name 'DocxReader' from 'llama_index.readers.file' (unknown location)

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Cell In[6], line 3
      1 from llama_index.core import SimpleDirectoryReader
----> 3 documents = SimpleDirectoryReader(input_files=["data/pdfs/ISLR Seventh Printing.pdf"]).load_data()

File /anaconda/envs/lama_env/lib/python3.10/site-packages/llama_index/core/readers/file/base.py:550, in SimpleDirectoryReader.load_data(self, show_progress, num_workers, fs)
    545         files_to_process = tqdm(
    546             self.input_files, desc="Loading files", unit="file"
    547         )
    548     for input_file in files_to_process:
    549         documents.extend(
--> 550             SimpleDirectoryReader.load_file(
    551                 input_file=input_file,
    552                 file_metadata=self.file_metadata,
    553                 file_extractor=self.file_extractor,
    554                 filename_as_id=self.filename_as_id,
    555                 encoding=self.encoding,
    556                 errors=self.errors,
    557                 fs=fs,
    558             )
    559         )
    561 return self._exclude_metadata(documents)

File /anaconda/envs/lama_env/lib/python3.10/site-packages/llama_index/core/readers/file/base.py:384, in SimpleDirectoryReader.load_file(input_file, file_metadata, file_extractor, filename_as_id, encoding, errors, fs)
    353 """Static method for loading file.
    354 
    355 NOTE: necessarily as a static method for parallel processing.
   (...)
    381     List[Document]: loaded documents
    382 """
    383 # TODO: make this less redundant
--> 384 default_file_reader_cls = SimpleDirectoryReader.supported_suffix_fn()
    385 default_file_reader_suffix = list(default_file_reader_cls.keys())
    386 metadata: Optional[dict] = None

File /anaconda/envs/lama_env/lib/python3.10/site-packages/llama_index/core/readers/file/base.py:39, in _try_loading_included_file_formats()
     25     from llama_index.readers.file import (
     26         DocxReader,
     27         EpubReader,
   (...)
     36         VideoAudioReader,
     37     )  # pants: no-infer-dep
     38 except ImportError:
---> 39     raise ImportError("`llama-index-readers-file` package not found")
     41 default_file_reader_cls: Dict[str, Type[BaseReader]] = {
     42     ".hwp": HWPReader,
     43     ".pdf": PDFReader,
   (...)
     57     ".ipynb": IPYNBReader,
     58 }
     59 return default_file_reader_cls

ImportError: `llama-index-readers-file` package not found
dosubot[bot] commented 7 months ago

Hey @arashaga! 👋 I'm Dosu, a bot here to help you solve bugs, answer your questions, and guide you on becoming a contributor while you're waiting for a human maintainer. I'm diving into your issue and will get back to you with a full answer shortly. 🤖💡

logan-markewich commented 7 months ago

@arashaga this to me tells me either you did not install the llama-index-readers-file package (which is included in the llama-index package by default), or you venv is corrupted

I would try a new venv

In a new terminal

pip uninstall llama-index  # remove any global install
python -m venv venv 
source venv/bin/activate
pip install llama-index
arashaga commented 7 months ago

Thank you @logan-markewich that was the case. I installed a new env and it worked. Thank you.