nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.15k stars 113 forks source link

LayoutPDFReader._parse_pdf returns error when pdf contains empty pages #59

Open aleksvercau opened 4 months ago

aleksvercau commented 4 months ago

I tried processing a pdf file using the LayoutPDFReader.read_pdf() method, but got a KeyError for response_json['return_dict']['result']['blocks'], since the response did not contain results, because there was an error (on a side node: would be nice to have a specific error in this case instead of a key error, clearly stating that the file could not be processed and the reason why).

I split my pdf in pages and processed each page separately to understand what the issue was. Turns out that the error existed every time an empty page was being processed. I am not sure whether this is the case for empty pages of all types of pdfs or just for some pdf types (there are small differences between text pdfs depending on how they were created). It only occurred on one of the pdfs I was processing, but it was also the only pdf with empty pages...

Better: do not fail processing of a whole document if it has one empty page, but simply skip that page.

jaavedd9 commented 3 months ago

I am facing issue too

mgrabmayr commented 2 months ago

me too. any intelligent fixes so far?

madhuprakash19 commented 2 days ago

I am facing the same issue