AzureAIDocumentIntelligenceLoader does not load all PDF pages

bbest31 commented 2 months ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

I'm using the code from the LangChain docs verbatim

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

file_path = "<filepath>"
endpoint = "<endpoint>"
key = "<key>"
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint,
    api_key=key,
    file_path=file_path,
    api_model="prebuilt-layout",
    mode="page",
)

documents = loader.load()

Error Message and Stack Trace (if applicable)

No response

Description

I'm trying to use the Azure Document Intelligence loader to read my pdf files.
Using the markdown mode I only get the first page of the pdf loaded.
If I use any other mode (page, single) I will get at most pages 1 and 2.
I expect all pages within a page to be returned as a Document object.

System Info

langchain==0.2.3 langchain-community==0.2.4 langchain-core==0.2.5 langchain-text-splitters==0.2.1 platform: mac python version: 3.12.3

emddarn commented 2 months ago

Are you using the free tier service by chance? https://learn.microsoft.com/en-us/answers/questions/1154480/more-than-2-pages-not-getting-read-by-azure-forms

The way I got around this is by setting the pages parameter for begin_analyze_document in a loop to read 2 pages at a time. Quickstart

bbest31 commented 2 months ago

@emddarn yes I was using the free tier. I ended up using a different service for PDF document loading, but figured I'd point it out for anyone choosing to use this loader. Maybe a callout in the LangChain docs about this would be helpful to note for folks.

langchain-ai / langchain