run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.55k stars 5.23k forks source link

[Question]: Why does a list of documents appear when I load a multi-page pdf file? #16768

Open NguyenDinhTiem opened 6 days ago

NguyenDinhTiem commented 6 days ago

Question Validation

Question

Why does a list of documents appear when I load a multi-page pdf file? documents = SimpleDirectoryReader( input_files=FILES_PATH, filename_as_id = True ).load_data()

dosubot[bot] commented 6 days ago

When you load a multi-page PDF file using the SimpleDirectoryReader with load_data(), a list of documents appears because the SimpleDirectoryReader processes each page of the PDF as a separate document. This is due to the PDFReader class, which handles PDF files by creating a Document object for each page, unless the return_full_document option is set to True. By default, this option is False, so each page is processed individually, resulting in a list of Document objects, each representing a page of the PDF [1][2].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 6 days ago

PDFs are split by page by default, to help with citations