llmsherpa url error - Githubissues

nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects

https://www.nlmatics.com

MIT License

1.37k stars 134 forks source link

llmsherpa url error #34

Closed aclaudiadavid closed 9 months ago

aclaudiadavid commented 9 months ago

Hi! I'm trying to parse a pdf like the example:

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

But I'm receiving the error: MaxRetryError: HTTPSConnectionPool(host='arxiv.org', port=443): Max retries exceeded with url: /pdf/1910.13461.pdf (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x000001BEE59F6410>: Failed to resolve 'arxiv.org' ([Errno 11002] getaddrinfo failed)"))

I've also tried with local files, so I think my problem is related to the API. Does anyone know how to solve this?

ansukla commented 9 months ago

I don't think this is an error on llmsherpa end. This seems to be a connectivity issue between the server where you are running the code and arxiv. It could be throttling code on their end restricting too many downloads or a temporary connectivity issue.

imSrbh commented 8 months ago

MaxRetryError: HTTPSConnectionPool(host='readers.llmsherpa.com', port=443): Max retries exceeded with url: [/api/document/developer/parseDocument](https://file+.vscode-resource.vscode-cdn.net/api/document/developer/parseDocument)?renderFormat=all (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

JackOfAllSkills commented 8 months ago

I am also getting same issue. I am using LLLSherpa to chunk a pdf but I always get this SSLCertVerificationError. I am using python 3.12 and using simple code llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url). Looks like it is known issue and could have been resolved by disabling SSL check but I could not find anyway to handle it as connections are made by LayoutPDFReader with no handle to disable SSL check. Please guide me