run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.18k stars 4.94k forks source link

[Bug]: PDFReader ignore extra_info/metadata from the file_metadata function #6646

Closed hawktang closed 1 year ago

hawktang commented 1 year ago

Bug Description

[Bug]: PDFReader ignore extra_info/metadata from the file_metadata function

Version

0.6.35

Steps to Reproduce

def file_metadata(input_file: str) -> dict:
    path = Path(input_file)
    return dict(
        path=input_file,
        parent=str(path.parent.name),
        name=path.name,
        stem=path.stem,
        suffix=path.suffix,
    )

app = typer.Typer()

@app.command()
def ingest(path_input: str) -> None:
    """
    Ingests the documents in the given directory and creates an index using the specified embedding model.

    Args:
        path_input (str): The path to the directory containing the documents to be ingested.
    """

    path_input = Path(path_input)
    collection_name = path_input.name
    logger.debug(f'collection_name: {collection_name}')
    logger.debug(f'ingesting {path_input}')

    vector_store = QdrantVectorStore(client=client, collection_name=collection_name)

    documents = SimpleDirectoryReader(
        path_input,
        recursive=True,
        filename_as_id=True,
        file_metadata=file_metadata
    ).load_data()

### Relevant Logs/Tracbacks

```shell
for page in range(num_pages):
                # Extract the text from the page
                page_text = pdf.pages[page].extract_text()
                page_label = pdf.page_labels[page]
# metadata is over write here
                metadata = {"page_label": page_label, "file_name": file.name}
                if metadata is not None:
                    metadata.update(metadata)

                docs.append(Document(text=page_text, metadata=metadata))
hawktang commented 1 year ago


def load_data(
        self, file: Path, metadata: Optional[Dict] = None
    ) -> List[Document]:

....

if metadata is None:
    metadata = {}
metadata.update({"page_label": page_label, "file_name": file.name})
hawktang commented 1 year ago

Subject: [PATCH] fix metadata error in pdf reader
---
Index: llama_index/readers/file/docs_reader.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/llama_index/readers/file/docs_reader.py b/llama_index/readers/file/docs_reader.py
--- a/llama_index/readers/file/docs_reader.py   (revision d394ffd5b57b976192f002f52fc9315401b4aa09)
+++ b/llama_index/readers/file/docs_reader.py   (revision 4109d5165fc79e68014702cbad91b1692fe4ad79)
@@ -14,7 +14,7 @@
     """PDF parser."""

     def load_data(
-        self, file: Path, extra_info: Optional[Dict] = None
+        self, file: Path, metadata: Optional[Dict] = None
     ) -> List[Document]:
         """Parse file."""
         try:
@@ -36,10 +36,9 @@
                 # Extract the text from the page
                 page_text = pdf.pages[page].extract_text()
                 page_label = pdf.page_labels[page]
-
-                metadata = {"page_label": page_label, "file_name": file.name}
-                if extra_info is not None:
-                    metadata.update(extra_info)
+                if metadata is None:
+                    metadata = {}
+                metadata.update({"page_label": page_label, "file_name": file.name})

                 docs.append(Document(text=page_text, metadata=metadata))
             return docs
hawktang commented 1 year ago

I saw you have fixed the issue in the repo. However, better give metadata even without explicit file_metadata function exist, right?

logan-markewich commented 1 year ago

@hawktang not sure what you mean here. Is there an issue with the fix I made the other day? https://github.com/jerryjliu/llama_index/blob/d394ffd5b57b976192f002f52fc9315401b4aa09/llama_index/readers/file/docs_reader.py#L40

hawktang commented 1 year ago

After your fix it is what I want now, thank you for the fix.

When will the fix publish with pip ;-)