Seeing the following error for one of my PDFs with the new indent parser:
127.0.0.1 - - [13/Feb/2024 15:55:29] "POST /api/parseDocument?renderFormat=all&useNewIndentParser=yes HTTP/1.1" 200 -
testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': True, 'parse_pages': (), 'apply_ocr': False}
processing page: 0 Number of p_tags.... 15
processing page: 1 Number of p_tags.... 19
processing page: 2 Number of p_tags.... 19
processing page: 3 Number of p_tags.... 6
processing page: 4 Number of p_tags.... 7
processing page: 5 Number of p_tags.... 12
processing page: 6 Number of p_tags.... 15
processing page: 7 Number of p_tags.... 3
processing page: 8 Number of p_tags.... 14
processing page: 9 Number of p_tags.... 5
processing page: 10 Number of p_tags.... 16
processing page: 11 Number of p_tags.... 5
processing page: 12 Number of p_tags.... 11
processing page: 13 Number of p_tags.... 14
processing page: 14 Number of p_tags.... 11
processing page: 15 Number of p_tags.... 7
processing page: 16 Number of p_tags.... 11
processing page: 17 Number of p_tags.... 12
processing page: 18 Number of p_tags.... 14
processing page: 19 Number of p_tags.... 18
processing page: 20 Number of p_tags.... 39
processing page: 21 Number of p_tags.... 1
processing page: 22 Number of p_tags.... 1
processing page: 23 Number of p_tags.... 1
processing page: 24 Number of p_tags.... 1
processing page: 25 Number of p_tags.... 1
processing page: 26 Number of p_tags.... 1
processing blocks in page: 1
processing blocks in page: 2
processing blocks in page: 3
processing blocks in page: 4
processing blocks in page: 5
processing blocks in page: 6
processing blocks in page: 8
processing blocks in page: 9
processing blocks in page: 10
processing blocks in page: 11
processing blocks in page: 12
processing blocks in page: 13
processing blocks in page: 14
processing blocks in page: 15
processing blocks in page: 16
processing blocks in page: 17
processing blocks in page: 18
processing blocks in page: 19
processing blocks in page: 20
error uploading file, stacktrace: Traceback (most recent call last):
File "/root/nlm-ingestor/nlm_ingestor/ingestion_daemon/main.py", line 44, in parse_document
ingest_status, return_dict = ingestor_api.ingest_document(
File "/root/nlm-ingestor/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init
blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 175, in parse_blocks
indent_parser.indent()
File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/new_indent_parser.py", line 254, in indent
self.indent_leafs()
File "/root/git/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/new_indent_parser.py", line 244, in indent_leafs
block['level'] = curr_header['level'] + 1
TypeError: 'NoneType' object is not subscriptable
Seeing the following error for one of my PDFs with the new indent parser:
127.0.0.1 - - [13/Feb/2024 15:55:29] "POST /api/parseDocument?renderFormat=all&useNewIndentParser=yes HTTP/1.1" 200 - testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': True, 'parse_pages': (), 'apply_ocr': False} processing page: 0 Number of p_tags.... 15 processing page: 1 Number of p_tags.... 19 processing page: 2 Number of p_tags.... 19 processing page: 3 Number of p_tags.... 6 processing page: 4 Number of p_tags.... 7 processing page: 5 Number of p_tags.... 12 processing page: 6 Number of p_tags.... 15 processing page: 7 Number of p_tags.... 3 processing page: 8 Number of p_tags.... 14 processing page: 9 Number of p_tags.... 5 processing page: 10 Number of p_tags.... 16 processing page: 11 Number of p_tags.... 5 processing page: 12 Number of p_tags.... 11 processing page: 13 Number of p_tags.... 14 processing page: 14 Number of p_tags.... 11 processing page: 15 Number of p_tags.... 7 processing page: 16 Number of p_tags.... 11 processing page: 17 Number of p_tags.... 12 processing page: 18 Number of p_tags.... 14 processing page: 19 Number of p_tags.... 18 processing page: 20 Number of p_tags.... 39 processing page: 21 Number of p_tags.... 1 processing page: 22 Number of p_tags.... 1 processing page: 23 Number of p_tags.... 1 processing page: 24 Number of p_tags.... 1 processing page: 25 Number of p_tags.... 1 processing page: 26 Number of p_tags.... 1 processing blocks in page: 1 processing blocks in page: 2 processing blocks in page: 3 processing blocks in page: 4 processing blocks in page: 5 processing blocks in page: 6 processing blocks in page: 8 processing blocks in page: 9 processing blocks in page: 10 processing blocks in page: 11 processing blocks in page: 12 processing blocks in page: 13 processing blocks in page: 14 processing blocks in page: 15 processing blocks in page: 16 processing blocks in page: 17 processing blocks in page: 18 processing blocks in page: 19 processing blocks in page: 20 error uploading file, stacktrace: Traceback (most recent call last): File "/root/nlm-ingestor/nlm_ingestor/ingestion_daemon/main.py", line 44, in parse_document ingest_status, return_dict = ingestor_api.ingest_document( File "/root/nlm-ingestor/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options) File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks( File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 175, in parse_blocks indent_parser.indent() File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/new_indent_parser.py", line 254, in indent self.indent_leafs() File "/root/git/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/new_indent_parser.py", line 244, in indent_leafs block['level'] = curr_header['level'] + 1 TypeError: 'NoneType' object is not subscriptable