run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
2.72k stars 263 forks source link

LlamaParse feature `take_screenshot` does not work with AzStorageBlobReader #376

Open galvangoh opened 1 month ago

galvangoh commented 1 month ago

Describe the bug It is not possible to parse document with AzStorageBlobReader with the take_screenshot=True featurefrom LlamaParse. Also, theAzStorageBlobReader` class does not provide any interface to download screenshots of the document.

Reproducible example:

from llama_parse import LlamaParse
from llama_index.readers.azstorage_blob import AzStorageBlobReader

import os
from dotenv import load_dotenv
load_dotenv()
LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")

instructions = """The provided document is an invoice. Please extract all basic
               information of the document, customer & supplier. The document
               also contains table of line items that needs to be extracted as
               well."""

container = 'CONTAINER_NAME'
folder_path = 'SUBDIR_1/SUBDIR_2'
connection_string = 'my_connection_string'
blob_name = 'MultiPageInvoice.pdf'

# parameters for LlamaParse
parser_params = {
    'api_key': LLAMA_CLOUD_API_KEY,
    'result_type': 'markdown',
    'parsing_instruction': instructions,
    'invalidate_cache': True,
    'do_not_cache': True,
    'skip_diagonal_text': True,
    'num_workers': 5,
    'ignore_errors': False,
    'use_vendor_multimodal_model': True,
    'vendor_multimodal_model_name': 'openai-gpt4o',
    'take_screenshot': True
}

# instantiate the parser
parser = LlamaParse(**parser_params)

file_extractor = {'.pdf': parser}

azure_loader = AzStorageBlobReader(
    container_name=f'{container_name}/{folder_path}', 
    connection_string=connection_string,
    blob=blob_name,
    file_extractor=file_extractor,
)

# begin parsing
document = azure_loader.load_data() # error out here

Error message:

Started parsing the file under job_id 2e1f4eb4-2025-4a23-9297-a71fd979de62
Error while parsing the file '<bytes/buffer>': Failed to parse the file: 2e1f4eb4-2025-4a23-9297-a71fd979de62, status: ERROR
Failed to load file file:///C:/####/####/####/####/####/####/MultiPageInvoice.pdf with error: Failed to parse the file: 2e1f4eb4-2025-4a23-9297-a71fd979de62, status: ERROR. Skipping...

Files MultiPageInvoice.pdf

Job ID 2e1f4eb4-2025-4a23-9297-a71fd979de62

Screenshots tempsnip

Client:

Additional context llama-parse==0.5.1 llama-index-readers-azstorage-blob==0.2.0