Open opiethehokie opened 5 months ago
I'm getting the same error on a different scenario, related to tables.
Traceback (most recent call last):
File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 48, in parse_document
return_dict, _ = ingestor_api.ingest_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
ingestor = pdf_ingestor.PDFIngestor(doc_location, parse_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks
parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
self.parse(pages)
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 551, in parse
self.organize_and_indent_blocks()
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 2676, in organize_and_indent_blocks
block_idx, footer_count = self.build_table(block_idx, organized_blocks, table_start_idx,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3070, in build_table
tr_block = organized_blocks[table_start_idx]
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
IndexError: list index out of range
Might be related to a couple of sequential .pop
s without updating table_start_idx
.
https://github.com/nlmatics/nlm-ingestor/blob/465e6a1a72619015d194d92ca290176b87f1afe7/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py#L2072
Dirty fix was to update visual_ingestor.py#L3070
with min(table_start_idx, len(organized_blocks) - 1)
Seeing the following error for one of my PDFs:
127.0.0.1 - - [13/Feb/2024 14:58:59] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 - INFO:werkzeug:127.0.0.1 - - [13/Feb/2024 14:58:59] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 - INFO:main:Parsing document: 3f367d70-dccc-47ce-a17d-c6689fcb88d2.pdf INFO:nlm_ingestor.ingestor.ingestor_api:Parsing application/pdf at /tmp/tmpt5dhdwuq.pdf with name 3f367d70-dccc-47ce-a17d-c6689fcb88d2.pdf INFO:nlm_ingestor.ingestor.ingestor_api:using pdf parser testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': False, 'parse_pages': (), 'apply_ocr': False} INFO:nlm_ingestor.ingestor.pdf_ingestor:Parsing PDF INFO:nlm_ingestor.ingestor.pdf_ingestor:PDF Parsing finished in 45.7038ms on workspace processing page: 1 Number of p_tags.... 2 processing page: 4 Number of p_tags.... 4 group buf still has: 1 • processing blocks in page: 4 error uploading file, stacktrace: Traceback (most recent call last): File "/root/nlm-ingestor/nlm_ingestor/ingestion_daemon/main.py", line 44, in parse_document ingest_status, return_dict = ingestor_api.ingest_document( File "/root/nlm-ingestor/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options) File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks( File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks title_page_fonts = top_pages_info(parsed_doc) File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 254, in top_pages_info temp, title_candidates = retrieve_title_candidates(i) File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 237, in retrieve_titlecandidates for freq in sorted_freq[list(sorted_freq.keys())[key_idx]]: IndexError: list index out of range
Sorry I'm unable to share the file. Updating the condition in pdf_ingestor.py line 35 to check if len(sorted_freq) is greater than key_idx instead of 0 has allowed me to get past this, but it's not clear to me if that's the best fix or not.