Open SawyerCzupka opened 4 months ago
The KeyError: 'extension'
you're encountering is due to the SimpleFileNodeParser
expecting an 'extension' key in the metadata of each document it processes. This key is used to determine the file type of the document and select the appropriate parser.
To resolve this issue, you should ensure that every document processed by the SimpleFileNodeParser
has an 'extension' key in its metadata. If modifying the document generation or reading process is not feasible, you can add a check and handling mechanism directly in the _parse_nodes
method to either skip documents without an 'extension' key or to infer the extension based on the document's content or name.
Here's a modified version of the _parse_nodes
method with added handling for documents missing the 'extension' key in their metadata:
def _parse_nodes(
self,
nodes: Sequence[BaseNode],
show_progress: bool = False,
**kwargs: Any,
) -> List[BaseNode]:
"""Parse document into nodes.
Args:
nodes (Sequence[BaseNode]): nodes to parse
"""
all_nodes: List[BaseNode] = []
documents_with_progress = get_tqdm_iterable(
nodes, show_progress, "Parsing documents into nodes"
)
for document in documents_with_progress:
# Check if 'extension' key exists in document metadata
if 'extension' not in document.metadata:
# Handle documents without 'extension' key
# For example, skip or log a warning
print(f"Warning: Document {document} does not have an 'extension' key in its metadata.")
continue # Skip this document
ext = document.metadata["extension"]
if ext in FILE_NODE_PARSERS:
parser = FILE_NODE_PARSERS[ext](
include_metadata=self.include_metadata,
include_prev_next_rel=self.include_prev_next_rel,
callback_manager=self.callback_manager,
)
nodes = parser.get_nodes_from_documents([document], show_progress)
all_nodes.extend(nodes)
else:
# What to do when file type isn't supported yet?
all_nodes.extend(document)
return all_nodes
This modification adds a check for the existence of the 'extension' key and skips documents where it is missing, preventing the KeyError
. Depending on your requirements, you may choose to handle documents without an 'extension' key differently, such as attempting to infer the file type or logging a warning.
You can find the original implementation of the _parse_nodes
method in the llama-index-core/llama_index/core/node_parser/file/simple_file.py file in the LlamaIndex repository.
@dosu-bot does the 'extension' metadata key get set by llamaindex anywhere?
Based on the provided context, there doesn't appear to be a specific function or method within the LlamaIndex codebase that sets the 'extension' metadata key for each document. The BaseIndex
class, which is the parent class for all index types, does not contain a method for setting an 'extension' metadata key. Similarly, the KeywordTableIndex
class, as well as other index classes, do not seem to have this functionality based on the provided context.
The SimpleDirectoryReader
class, which is part of the file reading and metadata extraction process, sets several metadata keys in the default_file_metadata_func
function. However, the 'extension' metadata key is not explicitly mentioned.
It's possible that the 'extension' metadata key is set elsewhere in the codebase or in a different way, but without additional context or information, it's difficult to provide a definitive answer. If you could provide more details about where you expect this functionality to be or any additional context, it would be helpful in providing a more accurate response.
For reference, here are the sources from the provided context:
same here
same here
Bug Description
When trying to use the IngestionPipeline to load some documents with the SimpleFileNodeParser() it errors because its looking for a metadata key that doesn't exist.
Version
0.10.12 & 0.10.14.post1
Steps to Reproduce
Use this pipeline on documents generated with naive SimpleDirectoryReader:
Relevant Logs/Tracbacks
Metadata for the first loaded document: Β ```shell First Doc Metadata: {'page_label': '1', 'file_name': '/home/sawyer/git/gef-ml/scripts/../data/to_ingest/10087/p10087_doc0.pdf', 'file_path': '/home/sawyer/git/gef-ml/scripts/../data/to_ingest/10087/p10087_doc0.pdf', 'file_type': 'application/pdf', 'file_size': 558431, 'creation_date': '2024-03-01', 'last_modified_date': '2024-03-01'}