run-llama / llama-hub

A library of data loaders for LLMs made by the community -- to be used with LlamaIndex and/or LangChain
https://llamahub.ai/
MIT License
3.43k stars 728 forks source link

[Bug]: azstorage_blob incorrect file path when creating metadata dictionary #865

Closed Matt-Scheetz closed 6 months ago

Matt-Scheetz commented 6 months ago

Bug Description

This may just be a Windows issue.

When importing files using AzStorageBlobReader, the variable download_file_path is getting set:

download_file_path = f"{temp_dir}/{stream.name}"

The blob metadata is then added with this file path as the key:

blob_meta[download_file_path] = blob_client.get_blob_properties()

When SimpleDirectoryReader then tries to get the metadata using extract_blob_meta a KeyError is thrown on the 1st line of the function:

meta: dict = blob_meta[file_path]

Reason being when SimpleDirectoryReader is iterating through the contents of the directory passed in it is not generating the paths, it pulls them from the directory contents.

In my local test, download-file-path: C:\\Users\\Me\\AppData\\Local\\Temp\\tmpfrm_02oi/myfile.pdf - notice the / prior to the file name This is how the key for the blob_meta is set. Then when SimpleDirectoryReader executes extract_blob_meta it is passing in a path of: C:\\Users\\Me\\AppData\\Local\\Temp\\tmpfrm_02oi\\myfile.pdf - notice the \\ prior to the file name

Suggest switching lines 92 & 110 to:

download_file_path = os.path.join(temp_dir, stream.name)

llama-hub v 0.0.70

Version

0.9.30

Steps to Reproduce

  1. Save file to Azure Blob Store
  2. Execute Blob Storage Reader:
    
    loader = AzStorageBlobReader(container='scrabble-dictionary', 
                             blob='dictionary.txt',  
                             account_url='https://<storage account name>.blob.core.windows.net', 
                             credential=default_credential)

documents = loader.load_data()


3. Error thrown during code execution

### Relevant Logs/Tracbacks

_No response_
Matt-Scheetz commented 6 months ago

PR merged in