[Bug]: SimpleFileNodeParser() KeyError: 'extension'

SawyerCzupka commented 4 months ago

Bug Description

When trying to use the IngestionPipeline to load some documents with the SimpleFileNodeParser() it errors because its looking for a metadata key that doesn't exist.

Version

0.10.12 & 0.10.14.post1

Steps to Reproduce

Use this pipeline on documents generated with naive SimpleDirectoryReader:

def get_pipeline(
    vector_store: BasePydanticVectorStore | None,
    together_embed_model_name: str = "togethercomputer/m2-bert-80M-32k-retrieval",
) -> IngestionPipeline:
    if os.environ.get("TOGETHER_API_KEY") is None:
        raise ValueError("TOGETHER_API_KEY environment variable must be set")

    transformations = [
        SimpleFileNodeParser(),
        SentenceSplitter(
            chunk_size=384, chunk_overlap=64
        ),  # Ensures that the chunks are not too large
        TogetherEmbedding(
            model_name=together_embed_model_name,
        ),
    ]

    return IngestionPipeline(
        transformations=transformations,
        vector_store=vector_store,
    )

Relevant Logs/Tracbacks

DEBUG:httpx:load_ssl_context verify=True cert=None trust_env=True http2=False
DEBUG:httpx:load_verify_locations cafile='/home/sawyer/git/gef-ml/.venv/lib/python3.10/site-packages/certifi/cacert.pem'
DEBUG:httpcore.connection:connect_tcp.started host='localhost' port=6333 local_address=None timeout=5.0 socket_options=None
DEBUG:httpcore.connection:connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7fe9f35db8e0>
DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'GET']>
DEBUG:httpcore.http11:send_request_headers.complete
DEBUG:httpcore.http11:send_request_body.started request=<Request [b'GET']>
DEBUG:httpcore.http11:send_request_body.complete
DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'GET']>
DEBUG:httpcore.http11:receive_response_headers.complete return_value=(b'HTTP/1.1', 404, b'Not Found', [(b'transfer-encoding', b'chunked'), (b'vary', b'accept-encoding, Origin, Access-Control-Request-Method, Access-Control-Request-Headers'), (b'content-type', b'application/json'), (b'content-encoding', b'gzip'), (b'date', b'Fri, 01 Mar 2024 18:15:50 GMT')])
INFO:httpx:HTTP Request: GET http://localhost:6333/collections/temp "HTTP/1.1 404 Not Found"
DEBUG:httpcore.http11:receive_response_body.started request=<Request [b'GET']>
DEBUG:httpcore.http11:receive_response_body.complete
DEBUG:httpcore.http11:response_closed.started
DEBUG:httpcore.http11:response_closed.complete
DEBUG:httpcore.connection:close.started
DEBUG:httpcore.connection:close.complete
INFO:gef_ml.ingestion:Found 1 projects in directory ../data/to_ingest/.
  0%|                                                                 | 0/1 [00:00<?, ?it/s]INFO:gef_ml.ingestion:Ingesting project 10087: Started...
DEBUG:llama_index.core.readers.file.base:> [SimpleDirectoryReader] Total files added: 4
DEBUG:fsspec.local:open file: /home/sawyer/git/gef-ml/scripts/../data/to_ingest/10087/p10087_doc0.pdf
DEBUG:fsspec.local:open file: /home/sawyer/git/gef-ml/scripts/../data/to_ingest/10087/p10087_doc1.pdf
DEBUG:fsspec.local:open file: /home/sawyer/git/gef-ml/scripts/../data/to_ingest/10087/p10087_doc2.pdf
DEBUG:fsspec.local:open file: /home/sawyer/git/gef-ml/scripts/../data/to_ingest/10087/p10087_doc3.pdf
DEBUG:gef_ml.ingestion:First Doc Metadata: {'page_label': '1', 'file_name': '/home/sawyer/git/gef-ml/scripts/../data/to_ingest/10087/p10087_doc0.pdf', 'file_path': '/home/sawyer/git/gef-ml/scripts/../data/to_ingest/10087/p10087_doc0.pdf', 'file_type': 'application/pdf', 'file_size': 558431, 'creation_date': '2024-03-01', 'last_modified_date': '2024-03-01'}
DEBUG:gef_ml.ingestion:Loaded 190 documents for project 10087.
ERROR:gef_ml.ingestion:Ingesting project 10087: Failed.
ERROR:gef_ml.ingestion:Traceback: 'extension'
Traceback (most recent call last):
  File "/home/sawyer/git/gef-ml/gef_ml/ingestion.py", line 102, in ingest
    self._ingest_project_id(project_id, show_progress=False)
  File "/home/sawyer/git/gef-ml/gef_ml/ingestion.py", line 77, in _ingest_project_id
    self.pipeline.run(
  File "/home/sawyer/git/gef-ml/.venv/lib/python3.10/site-packages/llama_index/core/ingestion/pipeline.py", line 750, in run
    nodes = run_transformations(
  File "/home/sawyer/git/gef-ml/.venv/lib/python3.10/site-packages/llama_index/core/ingestion/pipeline.py", line 124, in run_transformations
    nodes = transform(nodes, **kwargs)
  File "/home/sawyer/git/gef-ml/.venv/lib/python3.10/site-packages/llama_index/core/node_parser/interface.py", line 116, in __call__
    return self.get_nodes_from_documents(nodes, **kwargs)
  File "/home/sawyer/git/gef-ml/.venv/lib/python3.10/site-packages/llama_index/core/node_parser/interface.py", line 76, in get_nodes_from_documents
    nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
  File "/home/sawyer/git/gef-ml/.venv/lib/python3.10/site-packages/llama_index/core/node_parser/file/simple_file.py", line 68, in _parse_nodes
    ext = document.metadata["extension"]
KeyError: 'extension'

Metadata for the first loaded document: ```shell First Doc Metadata: {'page_label': '1', 'file_name': '/home/sawyer/git/gef-ml/scripts/../data/to_ingest/10087/p10087_doc0.pdf', 'file_path': '/home/sawyer/git/gef-ml/scripts/../data/to_ingest/10087/p10087_doc0.pdf', 'file_type': 'application/pdf', 'file_size': 558431, 'creation_date': '2024-03-01', 'last_modified_date': '2024-03-01'}

dosubot[bot] commented 4 months ago

The KeyError: 'extension' you're encountering is due to the SimpleFileNodeParser expecting an 'extension' key in the metadata of each document it processes. This key is used to determine the file type of the document and select the appropriate parser.

To resolve this issue, you should ensure that every document processed by the SimpleFileNodeParser has an 'extension' key in its metadata. If modifying the document generation or reading process is not feasible, you can add a check and handling mechanism directly in the _parse_nodes method to either skip documents without an 'extension' key or to infer the extension based on the document's content or name.

Here's a modified version of the _parse_nodes method with added handling for documents missing the 'extension' key in their metadata:

def _parse_nodes(
    self,
    nodes: Sequence[BaseNode],
    show_progress: bool = False,
    **kwargs: Any,
) -> List[BaseNode]:
    """Parse document into nodes.

    Args:
        nodes (Sequence[BaseNode]): nodes to parse
    """
    all_nodes: List[BaseNode] = []
    documents_with_progress = get_tqdm_iterable(
        nodes, show_progress, "Parsing documents into nodes"
    )

    for document in documents_with_progress:
        # Check if 'extension' key exists in document metadata
        if 'extension' not in document.metadata:
            # Handle documents without 'extension' key
            # For example, skip or log a warning
            print(f"Warning: Document {document} does not have an 'extension' key in its metadata.")
            continue  # Skip this document

        ext = document.metadata["extension"]
        if ext in FILE_NODE_PARSERS:
            parser = FILE_NODE_PARSERS[ext](
                include_metadata=self.include_metadata,
                include_prev_next_rel=self.include_prev_next_rel,
                callback_manager=self.callback_manager,
            )
            nodes = parser.get_nodes_from_documents([document], show_progress)
            all_nodes.extend(nodes)
        else:
            # What to do when file type isn't supported yet?
            all_nodes.extend(document)

    return all_nodes

This modification adds a check for the existence of the 'extension' key and skips documents where it is missing, preventing the KeyError. Depending on your requirements, you may choose to handle documents without an 'extension' key differently, such as attempting to infer the file type or logging a warning.

You can find the original implementation of the _parse_nodes method in the llama-index-core/llama_index/core/node_parser/file/simple_file.py file in the LlamaIndex repository.

Sources

llama-index-core/llama_index/core/node_parser/file/simple_file.py

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

SawyerCzupka commented 4 months ago

@dosu-bot does the 'extension' metadata key get set by llamaindex anywhere?

dosubot[bot] commented 4 months ago

Based on the provided context, there doesn't appear to be a specific function or method within the LlamaIndex codebase that sets the 'extension' metadata key for each document. The BaseIndex class, which is the parent class for all index types, does not contain a method for setting an 'extension' metadata key. Similarly, the KeywordTableIndex class, as well as other index classes, do not seem to have this functionality based on the provided context.

The SimpleDirectoryReader class, which is part of the file reading and metadata extraction process, sets several metadata keys in the default_file_metadata_func function. However, the 'extension' metadata key is not explicitly mentioned.

It's possible that the 'extension' metadata key is set elsewhere in the codebase or in a different way, but without additional context or information, it's difficult to provide a definitive answer. If you could provide more details about where you expect this functionality to be or any additional context, it would be helpful in providing a more accurate response.

For reference, here are the sources from the provided context:

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

MrLion7 commented 3 months ago

same here

dino-su commented 2 months ago

same here

run-llama / llama_index