nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.15k stars 113 forks source link

read_pdf fails on specific pdf locally, not through hosted api #63

Open Ianpwest opened 3 months ago

Ianpwest commented 3 months ago

PDF in question: JTR.pdf

This api call works great llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all" pdf_url = "JTR.pdf" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)

This local call fails llmsherpa_api_url = "[http://localhost:5010/api/parseDocument?renderFormat=all"](http://localhost:5010/api/parseDocument?renderFormat=all%22) pdf_url = "JTR.pdf" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)

The local version is running from the latest docker build. Other pdfs work fine. Is there a way to get a better error message? Currently receiving: KeyError: 'return_dict'

I noticed there are other issues open around this error but did not find any matching this case where it works on one and not the other.

I appreciate your time and any insight. Thanks!

wolfassi123 commented 2 months ago

Hey, @Ianpwest did you manage to solve this?

Ianpwest commented 2 months ago

Hey, @Ianpwest did you manage to solve this?

@wolfassi123 No, there were also some other parsing issues with different character sets. The library is promising but seemingly under supported. No movement on my tickets.

kiran-nlmatics commented 2 months ago

Hello @Ianpwest, @wolfassi123, I have fixed the issue and seems to be working with the sample PDF provided here. Can you do a pull from the main branch of nlm-ingestor and verify?