Closed xrtechspot closed 1 year ago
🤖
Hello,
Thank you for your question. Currently, the PDFReader
class in the LlamaIndex codebase does not have a parameter to specify the chunk size when loading data from a PDF file. The load_data
method in the PDFReader
class only takes a file
parameter of type Path
and an extra_info
parameter of type Optional[Dict]
.
Here is the method signature for reference:
def load_data(
self, file: Path, extra_info: Optional[Dict] = None
) -> List[Document]:
In the method body, the PDF file is opened and read in its entirety without any chunking:
with open(file, "rb") as fp:
# Create a PDF object
pdf = pypdf.PdfReader(fp)
Therefore, there is no provision for specifying a chunk size when loading data from a PDF file in the current implementation of the PDFReader
class.
If you need to process the PDF file in chunks, you might need to implement this functionality yourself. You could, for example, modify the load_data
method to take an additional parameter for the chunk size and then read the PDF file in chunks of that size.
I hope this answers your question. If you have any further questions, please feel free to ask.
Best regards, Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@xrtechspot there is not, right now it just creates one document object per page (this helps with citations using the metadata, since the page number goes into the metadata)
You can specify the chunk size when you actually convert the documents into nodes
You can specify the chunk size when you actually convert the documents into nodes
how to do that ?
Usually, just set the chunk size in the service context
from llama_index import ServiceContext, VectorStoreIndex
service_context = ServiceContext.from_defaults(chunk_size=512)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
Question Validation
Question
Here is a way to specify in loader chunk size tu use?
PDFReader = download_loader("PDFReader") loader = PDFReader() chunks = loader.load_data(file=Path(file_path))
Thank you