nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
922 stars 112 forks source link

Bug #48

Open aman-vink opened 3 months ago

aman-vink commented 3 months ago

KeyError Traceback (most recent call last) in <cell line: 15>() 13 llmsherpa_api_url = llmsherpa_api_url + "&applyOcr=yes" 14 pdf_reader = LayoutPDFReader(llmsherpa_api_url) ---> 15 doc = pdf_reader.read_pdf(pdf_url)

/usr/local/lib/python3.10/dist-packages/llmsherpa/readers/file_reader.py in read_pdf(self, path_or_url, contents) 71 parser_response = self._parse_pdf(pdf_file) 72 response_json = json.loads(parser_response.data.decode("utf-8")) ---> 73 blocks = response_json['return_dict']['result']['blocks'] 74 return Document(blocks)

KeyError: 'return_dict'

https://s201.q4cdn.com/262069030/files/doc_financials/2023/ar/Walmart-10K-Reports-Optimized.pdf

For this url

kiran-nlmatics commented 2 months ago

@aman-vink, Please pull from the main branch and let me know if the issue is still observed.

erikbijl commented 1 month ago

Same issue for me on a long pdf (>200 pages)

almariscal commented 2 weeks ago

Hello @kiran-nlmatics I am facing the same issue and just did pip install nlm-ingestor + LLM Sherpa docker server.

EDIT: Here is the complete error


KeyError Traceback (most recent call last) Cell In[21], line 6 4 llmsherpa_api_url = llmsherpa_api_url + "&applyOcr=yes" 5 pdf_reader = LayoutPDFReader(llmsherpa_api_url) ----> 6 doc = pdf_reader.read_pdf(pdf_url)

File ~/miniconda3/envs/mariscal-env-310/lib/python3.10/site-packages/llmsherpa/readers/file_reader.py:73, in LayoutPDFReader.read_pdf(self, path_or_url, contents) 71 parser_response = self._parse_pdf(pdf_file) 72 response_json = json.loads(parser_response.data.decode("utf-8")) ---> 73 blocks = response_json['return_dict']['result']['blocks'] 74 return Document(blocks)

KeyError: 'return_dict'

madhuprakash19 commented 1 week ago

File "/Users/tpmpraka/miniconda3/envs/grm/lib/python3.11/site-packages/llmsherpa/readers/file_reader.py", line 74, in read_pdf blocks = response_json['return_dict']['result']['blocks']


KeyError: 'return_dict'

response JSON was {'reason': "'style'", 'status': 'fail'}