run-llama / llama-hub

A library of data loaders for LLMs made by the community -- to be used with LlamaIndex and/or LangChain
https://llamahub.ai/
MIT License
3.44k stars 729 forks source link

Extract metadata from Azure BLOB #804

Closed raccoonex closed 8 months ago

raccoonex commented 8 months ago

Description

Extract useful system-generated metadata from Azure BLOB and user-defined metadata and tags.

The changes from this PR extract metadata from Azure BLOB entities. Before this PR the llama_index.readers.file.base.default_file_metadata_func was used for metadata extraction. However, since the data is downloaded to the host system, the extracted metadata may not be correct.

This PR implements metadata extraction directly from Azure BLOB properties which consists of system metadata (e.g. creation_time) and user-defined metadata (as metadata and tags).

The new metadata set equals to the one obtained by the default metadata extractor + other Azure system meta + user-defined meta.

No dependencies have been changed.

Type of Change

Please delete options that are not relevant.

How Has This Been Tested?

This has been tested with Azure Storage Blob (authenticated with connection string) and a Python script. Example:

from llama_hub.azstorage_blob import AzStorageBlobReader

reader = AzStorageBlobReader(container_name="the-container-name", connection_string="conn-str")
documents = reader.load_data()

documents[0].metadata
{'page_label': '1', 'file_name': 'some_file_name.pdf', 'file_type': 'application/pdf', 'file_size': 1093814, 'creation_date': '2023-12-20', 'last_modified_date': '2023-12-21', 'last_accessed_date': None, 'container': 'the-container-name', 'source_url': 'this-is-user-defined-meta'}

Suggested Checklist: