langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
47.72k stars 6.8k forks source link

PDF page number is absent in knowledge retrieval #8502

Open fniu opened 3 weeks ago

fniu commented 3 weeks ago

Self Checks

Dify version

0.8.2

Cloud or Self Hosted

Cloud

Steps to reproduce

When I create a knowledge from PDF files, I would expect to find page number present in the meta since the #7749 is merged. However, I can't find "page" in the meta data, as shown below. Do I missing anything to "enable" that feature?

    {
      "metadata": {
        "_source": "knowledge",
        "position": 1,
        "dataset_id": "f392217b-73ae-421f-8fea-207cafa9c926",
        "dataset_name": "Krav med raadstext...",
        "document_id": "0d7ba751-0ffd-4c81-8efa-be1d1c8e14ba",
        "document_name": "Krav med raadstexter TRVINFRA-00304 Plankorsningar v15_0.pdf",
        "document_data_source_type": "upload_file",
        "segment_id": "621a7608-9da0-4dac-8473-f644e9db8f35",
        "retriever_from": "workflow",
        "score": 0.32195564148658923,
        "segment_hit_count": 2,
        "segment_word_count": 2408,
        "segment_position": 60,
        "segment_index_node_hash": "4f1a192d11f56fa99fd1cb5f9a0b50ca25db417dccd32e035a0499d8aadb3da3"
      },

✔️ Expected Behavior

I expect to find page number present in the meta data from retrieval.

❌ Actual Behavior

key "page" is missing

dosubot[bot] commented 3 weeks ago

Hey, @fniu! I'm here to help you with bugs, questions, and becoming a contributor. Let's work on solving this issue together!

I found a similar unsolved discussion that might be relevant to your issue: Get the page number of a PDF when fetching document segments [1].

Regarding your issue, the function responsible for extracting and adding the page number to the metadata is parse within the PdfExtractor class. This function is called within the load method of the PdfExtractor class. The page number is accessed and added to the metadata via metadata = {"source": blob.source, "page": page_number} [2].

Here is the relevant code snippet:

def parse(self, blob: Blob) -> Iterator[Document]:
    """Lazily parse the blob."""
    import pypdfium2

    with blob.as_bytes_io() as file_path:
        pdf_reader = pypdfium2.PdfDocument(file_path, autoclose=True)
        try:
            for page_number, page in enumerate(pdf_reader):
                text_page = page.get_textpage()
                content = text_page.get_text_range()
                text_page.close()
                page.close()
                metadata = {"source": blob.source, "page": page_number}
                yield Document(page_content=content, metadata=metadata)
        finally:
            pdf_reader.close()

If the page number is not appearing in the metadata, it might be due to a configuration issue or a bug. Ensure that the PdfExtractor class is being used correctly and that the metadata is being processed and stored as expected.

If you have verified that the PdfExtractor is correctly implemented and the issue persists, it might be worth checking if there are any specific configurations or flags that need to be enabled to include the page number in the metadata. Unfortunately, I couldn't find specific information on this within the repository [2].

If the problem continues, you might want to follow up on the related discussion or create a new issue for further investigation.

To continue talking to Dosu, mention @dosu.

soulzzz commented 3 weeks ago

it's reverted.