PDF page number is absent in knowledge retrieval

fniu commented 1 month ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.8.2

Cloud or Self Hosted

Cloud

Steps to reproduce

When I create a knowledge from PDF files, I would expect to find page number present in the meta since the #7749 is merged. However, I can't find "page" in the meta data, as shown below. Do I missing anything to "enable" that feature?

    {
      "metadata": {
        "_source": "knowledge",
        "position": 1,
        "dataset_id": "f392217b-73ae-421f-8fea-207cafa9c926",
        "dataset_name": "Krav med raadstext...",
        "document_id": "0d7ba751-0ffd-4c81-8efa-be1d1c8e14ba",
        "document_name": "Krav med raadstexter TRVINFRA-00304 Plankorsningar v15_0.pdf",
        "document_data_source_type": "upload_file",
        "segment_id": "621a7608-9da0-4dac-8473-f644e9db8f35",
        "retriever_from": "workflow",
        "score": 0.32195564148658923,
        "segment_hit_count": 2,
        "segment_word_count": 2408,
        "segment_position": 60,
        "segment_index_node_hash": "4f1a192d11f56fa99fd1cb5f9a0b50ca25db417dccd32e035a0499d8aadb3da3"
      },

✔️ Expected Behavior

I expect to find page number present in the meta data from retrieval.

❌ Actual Behavior

key "page" is missing

dosubot[bot] commented 1 month ago

Hey, @fniu! I'm here to help you with bugs, questions, and becoming a contributor. Let's work on solving this issue together!

I found a similar unsolved discussion that might be relevant to your issue: Get the page number of a PDF when fetching document segments ^[1].

Regarding your issue, the function responsible for extracting and adding the page number to the metadata is parse within the PdfExtractor class. This function is called within the load method of the PdfExtractor class. The page number is accessed and added to the metadata via metadata = {"source": blob.source, "page": page_number} ^[2].

Here is the relevant code snippet:

def parse(self, blob: Blob) -> Iterator[Document]:
    """Lazily parse the blob."""
    import pypdfium2

    with blob.as_bytes_io() as file_path:
        pdf_reader = pypdfium2.PdfDocument(file_path, autoclose=True)
        try:
            for page_number, page in enumerate(pdf_reader):
                text_page = page.get_textpage()
                content = text_page.get_text_range()
                text_page.close()
                page.close()
                metadata = {"source": blob.source, "page": page_number}
                yield Document(page_content=content, metadata=metadata)
        finally:
            pdf_reader.close()

If the page number is not appearing in the metadata, it might be due to a configuration issue or a bug. Ensure that the PdfExtractor class is being used correctly and that the metadata is being processed and stored as expected.

If you have verified that the PdfExtractor is correctly implemented and the issue persists, it might be worth checking if there are any specific configurations or flags that need to be enabled to include the page number in the metadata. Unfortunately, I couldn't find specific information on this within the repository ^[2].

If the problem continues, you might want to follow up on the related discussion or create a new issue for further investigation.

To continue talking to Dosu, mention @dosu.

soulzzz commented 1 month ago

it's reverted.

langgenius / dify