127.0.0.1 - - [13/Feb/2024 14:05:43] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [13/Feb/2024 14:05:43] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 -
INFO:main:Parsing document: 6dbf73f5-9d13-4d29-b330-898e98d755c2.pdf
INFO:nlm_ingestor.ingestor.ingestor_api:Parsing application/pdf at /tmp/tmp4n1214e0.pdf with name 6dbf73f5-9d13-4d29-b330-898e98d755c2.pdf
INFO:nlm_ingestor.ingestor.ingestor_api:using pdf parser
testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': False, 'parse_pages': (), 'apply_ocr': False}
INFO:nlm_ingestor.ingestor.pdf_ingestor:Parsing PDF
INFO:nlm_ingestor.ingestor.pdf_ingestor:PDF Parsing finished in 109.6956ms on workspace
processing page: 0 Number of p_tags.... 141
processing page: 1 Number of p_tags.... 52
processing blocks in page: 1
error uploading file, stacktrace: Traceback (most recent call last):
File "/root/nlm-ingestor/nlm_ingestor/ingestion_daemon/main.py", line 44, in parse_document
ingest_status, return_dict = ingestor_api.ingest_document(
File "/root/nlm-ingestor/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init
blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in init
self.parse(pages)
File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 562, in parse
self.json_dict = block_renderer.BlockRenderer(self).render_json()
File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py", line 347, in render_json
table_block["left"],
KeyError: 'left'
Sorry I'm unable to share the file. Updating the condition in block_renderer.py line 351 to check if "left" is in table_block has allowed me to get past this, but it's not clear to me if that's the best fix or not.
Seeing the following error for one of my PDFs:
127.0.0.1 - - [13/Feb/2024 14:05:43] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 - INFO:werkzeug:127.0.0.1 - - [13/Feb/2024 14:05:43] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 - INFO:main:Parsing document: 6dbf73f5-9d13-4d29-b330-898e98d755c2.pdf INFO:nlm_ingestor.ingestor.ingestor_api:Parsing application/pdf at /tmp/tmp4n1214e0.pdf with name 6dbf73f5-9d13-4d29-b330-898e98d755c2.pdf INFO:nlm_ingestor.ingestor.ingestor_api:using pdf parser testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': False, 'parse_pages': (), 'apply_ocr': False} INFO:nlm_ingestor.ingestor.pdf_ingestor:Parsing PDF INFO:nlm_ingestor.ingestor.pdf_ingestor:PDF Parsing finished in 109.6956ms on workspace processing page: 0 Number of p_tags.... 141 processing page: 1 Number of p_tags.... 52 processing blocks in page: 1 error uploading file, stacktrace: Traceback (most recent call last): File "/root/nlm-ingestor/nlm_ingestor/ingestion_daemon/main.py", line 44, in parse_document ingest_status, return_dict = ingestor_api.ingest_document( File "/root/nlm-ingestor/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options) File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks( File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format) File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in init self.parse(pages) File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 562, in parse self.json_dict = block_renderer.BlockRenderer(self).render_json() File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py", line 347, in render_json table_block["left"], KeyError: 'left'
Sorry I'm unable to share the file. Updating the condition in block_renderer.py line 351 to check if "left" is in table_block has allowed me to get past this, but it's not clear to me if that's the best fix or not.