run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.72k stars 5.27k forks source link

[Question]: PDFReader how to specify chunk size ? #7509

Closed xrtechspot closed 1 year ago

xrtechspot commented 1 year ago

Question Validation

Question

Here is a way to specify in loader chunk size tu use?

PDFReader = download_loader("PDFReader") loader = PDFReader() chunks = loader.load_data(file=Path(file_path))

Thank you

dosubot[bot] commented 1 year ago

🤖

Hello,

Thank you for your question. Currently, the PDFReader class in the LlamaIndex codebase does not have a parameter to specify the chunk size when loading data from a PDF file. The load_data method in the PDFReader class only takes a file parameter of type Path and an extra_info parameter of type Optional[Dict].

Here is the method signature for reference:

def load_data(
    self, file: Path, extra_info: Optional[Dict] = None
) -> List[Document]:

In the method body, the PDF file is opened and read in its entirety without any chunking:

with open(file, "rb") as fp:
    # Create a PDF object
    pdf = pypdf.PdfReader(fp)

Therefore, there is no provision for specifying a chunk size when loading data from a PDF file in the current implementation of the PDFReader class.

If you need to process the PDF file in chunks, you might need to implement this functionality yourself. You could, for example, modify the load_data method to take an additional parameter for the chunk size and then read the PDF file in chunks of that size.

I hope this answers your question. If you have any further questions, please feel free to ask.

Best regards, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 1 year ago

@xrtechspot there is not, right now it just creates one document object per page (this helps with citations using the metadata, since the page number goes into the metadata)

logan-markewich commented 1 year ago

You can specify the chunk size when you actually convert the documents into nodes

xrtechspot commented 1 year ago

You can specify the chunk size when you actually convert the documents into nodes

how to do that ?

logan-markewich commented 1 year ago

Usually, just set the chunk size in the service context

from llama_index import ServiceContext, VectorStoreIndex

service_context = ServiceContext.from_defaults(chunk_size=512)

index = VectorStoreIndex.from_documents(documents, service_context=service_context)