run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.46k stars 5k forks source link

[Bug]: Smart PDF Loader - Failed to establish a new connection #14902

Open RGalkin opened 1 month ago

RGalkin commented 1 month ago

Bug Description

I'm running the example from the website:

from llama_index.readers.smart_pdf_loader import SmartPDFLoader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"  # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_loader = SmartPDFLoader(llmsherpa_api_url=llmsherpa_api_url)
documents = pdf_loader.load_data(pdf_url)

I'm getting the following error: `An exception occurred: HTTPSConnectionPool(host='readers.llmsherpa.com', port=443): Max retries exceeded with url: /api/document/developer/parseDocument?renderFormat=all (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002671A13C910>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000002671A13C910>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

The above exception was the direct cause of the following exception: raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='readers.llmsherpa.com', port=443): Max retries exceeded with url: /api/document/developer/parseDocument?renderFormat=all (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002671A13C910>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))`

Version

llama-index 0.10.54 llama-index-readers-smart-pdf-loader 0.1.4

Steps to Reproduce

from llama_index.readers.smart_pdf_loader import SmartPDFLoader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"  # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_loader = SmartPDFLoader(llmsherpa_api_url=llmsherpa_api_url)
documents = pdf_loader.load_data(pdf_url)

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 1 month ago

Привет, @RGalkin! Я здесь, чтобы помочь тебе с этой проблемой 🤖

You might be encountering this connection error because the llmsherpa API URL might be incorrect or the service might be down. Ensure that the llmsherpa_api_url is correct and that the service is accessible.

Here is the example usage provided in the documentation:

from llama_index.readers.smart_pdf_loader import SmartPDFLoader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"  # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_loader = SmartPDFLoader(llmsherpa_api_url=llmsherpa_api_url)
documents = pdf_loader.load_data(pdf_url)

Make sure that the llmsherpa_api_url is correct and that you can access it from your network. If the URL or the service is not reachable, you will encounter a connection error.

Additionally, ensure that the pdf_path_or_url parameter is correctly specified as either a URL or a file path. The SmartPDFLoader class uses the LayoutPDFReader from llmsherpa.readers to read and chunk the PDF, so any issues with the URL or file path can also lead to connection errors [1][2].

To continue talking to Dosu, mention @dosu.

RGalkin commented 1 month ago

I thought maybe something in my code was interfering with the connection so I ran the code in colab, still getting the same error.

dosubot[bot] commented 1 month ago

The connection error you're encountering is likely due to issues with the llmsherpa API service itself or network-related problems. Here are a few steps you can take to troubleshoot and resolve the issue:

  1. Check API Service Status: Ensure that the llmsherpa API service is up and running. Sometimes, the service might be temporarily down for maintenance or other reasons.

  2. Network Configuration: Verify that there are no network issues or firewall settings blocking the connection to https://readers.llmsherpa.com. You can try accessing the URL directly in a web browser to see if it is reachable.

  3. Rate Limiting: Be aware that the llmsherpa API has rate limiting in place. If you exceed the allowed number of requests per second, your IP might be temporarily banned. Ensure you are not making too many requests in a short period [1].

  4. Proxy Settings: If you are behind a proxy, ensure that your proxy settings are correctly configured in your environment. You might need to set the proxy settings in your code as well.

  5. Retry Mechanism: Implement a retry mechanism in your code to handle transient network issues. This can help in cases where the connection fails intermittently.

Here is an example of how you can implement a simple retry mechanism:

import time
from llama_index.readers.smart_pdf_loader import SmartPDFLoader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"  # also allowed is a file path e.g. /home/downloads/xyz.pdf

def load_pdf_with_retries(pdf_url, retries=3, delay=5):
    pdf_loader = SmartPDFLoader(llmsherpa_api_url=llmsherpa_api_url)
    for attempt in range(retries):
        try:
            documents = pdf_loader.load_data(pdf_url)
            return documents
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < retries - 1:
                time.sleep(delay)
            else:
                raise

documents = load_pdf_with_retries(pdf_url)

This code will retry the connection up to three times with a delay of five seconds between attempts.

If the issue persists, it might be worth reaching out to the maintainers of the llmsherpa API to check if there are any known issues or additional configurations required.

To continue talking to Dosu, mention @dosu.

cognetics-ai commented 3 weeks ago

The listener/server seems to be down. MaxRetryError: HTTPSConnectionPool(host='readers.llmsherpa.com', port=443): Max retries exceeded with url: /api/document/developer/parseDocument?renderFormat=all (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7e835823e920>, 'Connection to readers.llmsherpa.com timed out. (connect timeout=None)'))