langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
91.16k stars 14.49k forks source link

AzureAIDocumentIntelligenceLoader does not load all PDF pages #22775

Open bbest31 opened 2 months ago

bbest31 commented 2 months ago

Checked other resources

Example Code

I'm using the code from the LangChain docs verbatim

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

file_path = "<filepath>"
endpoint = "<endpoint>"
key = "<key>"
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint,
    api_key=key,
    file_path=file_path,
    api_model="prebuilt-layout",
    mode="page",
)

documents = loader.load()

Error Message and Stack Trace (if applicable)

No response

Description

System Info

langchain==0.2.3 langchain-community==0.2.4 langchain-core==0.2.5 langchain-text-splitters==0.2.1 platform: mac python version: 3.12.3

emddarn commented 2 months ago

Are you using the free tier service by chance? https://learn.microsoft.com/en-us/answers/questions/1154480/more-than-2-pages-not-getting-read-by-azure-forms

The way I got around this is by setting the pages parameter for begin_analyze_document in a loop to read 2 pages at a time. Quickstart

bbest31 commented 2 months ago

@emddarn yes I was using the free tier. I ended up using a different service for PDF document loading, but figured I'd point it out for anyone choosing to use this loader. Maybe a callout in the LangChain docs about this would be helpful to note for folks.