Open fniu opened 1 month ago
Hey, @fniu! I'm here to help you with bugs, questions, and becoming a contributor. Let's work on solving this issue together!
I found a similar unsolved discussion that might be relevant to your issue: Get the page number of a PDF when fetching document segments [1].
Regarding your issue, the function responsible for extracting and adding the page number to the metadata is parse
within the PdfExtractor
class. This function is called within the load
method of the PdfExtractor
class. The page number is accessed and added to the metadata via metadata = {"source": blob.source, "page": page_number}
[2].
Here is the relevant code snippet:
def parse(self, blob: Blob) -> Iterator[Document]:
"""Lazily parse the blob."""
import pypdfium2
with blob.as_bytes_io() as file_path:
pdf_reader = pypdfium2.PdfDocument(file_path, autoclose=True)
try:
for page_number, page in enumerate(pdf_reader):
text_page = page.get_textpage()
content = text_page.get_text_range()
text_page.close()
page.close()
metadata = {"source": blob.source, "page": page_number}
yield Document(page_content=content, metadata=metadata)
finally:
pdf_reader.close()
If the page number is not appearing in the metadata, it might be due to a configuration issue or a bug. Ensure that the PdfExtractor
class is being used correctly and that the metadata is being processed and stored as expected.
If you have verified that the PdfExtractor
is correctly implemented and the issue persists, it might be worth checking if there are any specific configurations or flags that need to be enabled to include the page number in the metadata. Unfortunately, I couldn't find specific information on this within the repository [2].
If the problem continues, you might want to follow up on the related discussion or create a new issue for further investigation.
To continue talking to Dosu, mention @dosu.
it's reverted.
Self Checks
Dify version
0.8.2
Cloud or Self Hosted
Cloud
Steps to reproduce
When I create a knowledge from PDF files, I would expect to find page number present in the meta since the #7749 is merged. However, I can't find "page" in the meta data, as shown below. Do I missing anything to "enable" that feature?
✔️ Expected Behavior
I expect to find page number present in the meta data from retrieval.
❌ Actual Behavior
key "page" is missing