nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
971 stars 124 forks source link

TypeError: 'NoneType' object is not subscriptable #21

Open opiethehokie opened 5 months ago

opiethehokie commented 5 months ago

Seeing the following error for one of my PDFs with the new indent parser:

127.0.0.1 - - [13/Feb/2024 15:55:29] "POST /api/parseDocument?renderFormat=all&useNewIndentParser=yes HTTP/1.1" 200 - testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': True, 'parse_pages': (), 'apply_ocr': False} processing page: 0 Number of p_tags.... 15 processing page: 1 Number of p_tags.... 19 processing page: 2 Number of p_tags.... 19 processing page: 3 Number of p_tags.... 6 processing page: 4 Number of p_tags.... 7 processing page: 5 Number of p_tags.... 12 processing page: 6 Number of p_tags.... 15 processing page: 7 Number of p_tags.... 3 processing page: 8 Number of p_tags.... 14 processing page: 9 Number of p_tags.... 5 processing page: 10 Number of p_tags.... 16 processing page: 11 Number of p_tags.... 5 processing page: 12 Number of p_tags.... 11 processing page: 13 Number of p_tags.... 14 processing page: 14 Number of p_tags.... 11 processing page: 15 Number of p_tags.... 7 processing page: 16 Number of p_tags.... 11 processing page: 17 Number of p_tags.... 12 processing page: 18 Number of p_tags.... 14 processing page: 19 Number of p_tags.... 18 processing page: 20 Number of p_tags.... 39 processing page: 21 Number of p_tags.... 1 processing page: 22 Number of p_tags.... 1 processing page: 23 Number of p_tags.... 1 processing page: 24 Number of p_tags.... 1 processing page: 25 Number of p_tags.... 1 processing page: 26 Number of p_tags.... 1 processing blocks in page: 1 processing blocks in page: 2 processing blocks in page: 3 processing blocks in page: 4 processing blocks in page: 5 processing blocks in page: 6 processing blocks in page: 8 processing blocks in page: 9 processing blocks in page: 10 processing blocks in page: 11 processing blocks in page: 12 processing blocks in page: 13 processing blocks in page: 14 processing blocks in page: 15 processing blocks in page: 16 processing blocks in page: 17 processing blocks in page: 18 processing blocks in page: 19 processing blocks in page: 20 error uploading file, stacktrace: Traceback (most recent call last): File "/root/nlm-ingestor/nlm_ingestor/ingestion_daemon/main.py", line 44, in parse_document ingest_status, return_dict = ingestor_api.ingest_document( File "/root/nlm-ingestor/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options) File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks( File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 175, in parse_blocks indent_parser.indent() File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/new_indent_parser.py", line 254, in indent self.indent_leafs() File "/root/git/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/new_indent_parser.py", line 244, in indent_leafs block['level'] = curr_header['level'] + 1 TypeError: 'NoneType' object is not subscriptable

EzequielAlejandroLastra commented 2 weeks ago

I have the same error