run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.71k stars 5.26k forks source link

[Question]: Understanding the PDFLoader #7610

Closed Data-drone closed 1 year ago

Data-drone commented 1 year ago

Question Validation

Question

How do I find the default settings for PDFLoader?

dosubot[bot] commented 1 year ago

🤖

Hello,

Based on the information available in the LlamaIndex repository, there isn't a PDFLoader class or module. The repository does contain a PDFReader class, which is used to parse PDF files. This class doesn't have any explicit default settings. However, it does have a method load_data which is used to load and parse the PDF data. This method requires a file parameter, which is a Path object pointing to the PDF file to be read, and an optional extra_info parameter, which is a dictionary that can contain additional metadata.

Here is the relevant code snippet from the PDFReader class:

def load_data(
    self, file: Path, extra_info: Optional[Dict] = None
) -> List[Document]:
    """Parse file."""
    try:
        import pypdf
    except ImportError:
        raise ImportError(
            "pypdf is required to read PDF files: `pip install pypdf`"
        )
    with open(file, "rb") as fp:
        # Create a PDF object
        pdf = pypdf.PdfReader(fp)

        # Get the number of pages in the PDF document
        num_pages = len(pdf.pages)

        # Iterate over every page
        docs = []
        for page in range(num_pages):
            # Extract the text from the page
            page_text = pdf.pages[page].extract_text()
            page_label = pdf.page_labels[page]

            metadata = {"page_label": page_label, "file_name": file.name}
            if extra_info is not None:
                metadata.update(extra_info)

            docs.append(Document(text=page_text, metadata=metadata))
        return docs

This code is located in the llama_index/readers/file/docs_reader.py file in the repository.

If you were referring to a different PDFLoader class or module, could you please provide more context or details? This will help me provide a more accurate answer.

Best regards, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 1 year ago

The bot is correct. If you want more details, click the link to the source file