nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
1.05k stars 152 forks source link

KeyError: 'left' #18

Open opiethehokie opened 7 months ago

opiethehokie commented 7 months ago

Seeing the following error for one of my PDFs:

127.0.0.1 - - [13/Feb/2024 14:05:43] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 - INFO:werkzeug:127.0.0.1 - - [13/Feb/2024 14:05:43] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 - INFO:main:Parsing document: 6dbf73f5-9d13-4d29-b330-898e98d755c2.pdf INFO:nlm_ingestor.ingestor.ingestor_api:Parsing application/pdf at /tmp/tmp4n1214e0.pdf with name 6dbf73f5-9d13-4d29-b330-898e98d755c2.pdf INFO:nlm_ingestor.ingestor.ingestor_api:using pdf parser testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': False, 'parse_pages': (), 'apply_ocr': False} INFO:nlm_ingestor.ingestor.pdf_ingestor:Parsing PDF INFO:nlm_ingestor.ingestor.pdf_ingestor:PDF Parsing finished in 109.6956ms on workspace processing page: 0 Number of p_tags.... 141 processing page: 1 Number of p_tags.... 52 processing blocks in page: 1 error uploading file, stacktrace: Traceback (most recent call last): File "/root/nlm-ingestor/nlm_ingestor/ingestion_daemon/main.py", line 44, in parse_document ingest_status, return_dict = ingestor_api.ingest_document( File "/root/nlm-ingestor/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options) File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks( File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format) File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in init self.parse(pages) File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 562, in parse self.json_dict = block_renderer.BlockRenderer(self).render_json() File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py", line 347, in render_json table_block["left"], KeyError: 'left'

Sorry I'm unable to share the file. Updating the condition in block_renderer.py line 351 to check if "left" is in table_block has allowed me to get past this, but it's not clear to me if that's the best fix or not.

stefanknegt commented 7 months ago

I am experiencing the same issue, any update on this?

yparwani commented 5 months ago

Have the same issue for some PDFs

kiran-nlmatics commented 5 months ago

Please pull from the main branch and let me know if the issue is still observed.

stefanknegt commented 4 months ago

Seems to be fixed for me! Thanks