nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
972 stars 124 forks source link

IndexError: list index out of range #19

Open opiethehokie opened 5 months ago

opiethehokie commented 5 months ago

Seeing the following error for one of my PDFs:

127.0.0.1 - - [13/Feb/2024 14:58:59] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 - INFO:werkzeug:127.0.0.1 - - [13/Feb/2024 14:58:59] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 - INFO:main:Parsing document: 3f367d70-dccc-47ce-a17d-c6689fcb88d2.pdf INFO:nlm_ingestor.ingestor.ingestor_api:Parsing application/pdf at /tmp/tmpt5dhdwuq.pdf with name 3f367d70-dccc-47ce-a17d-c6689fcb88d2.pdf INFO:nlm_ingestor.ingestor.ingestor_api:using pdf parser testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': False, 'parse_pages': (), 'apply_ocr': False} INFO:nlm_ingestor.ingestor.pdf_ingestor:Parsing PDF INFO:nlm_ingestor.ingestor.pdf_ingestor:PDF Parsing finished in 45.7038ms on workspace processing page: 1 Number of p_tags.... 2 processing page: 4 Number of p_tags.... 4 group buf still has: 1 • processing blocks in page: 4 error uploading file, stacktrace: Traceback (most recent call last): File "/root/nlm-ingestor/nlm_ingestor/ingestion_daemon/main.py", line 44, in parse_document ingest_status, return_dict = ingestor_api.ingest_document( File "/root/nlm-ingestor/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options) File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks( File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks title_page_fonts = top_pages_info(parsed_doc) File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 254, in top_pages_info temp, title_candidates = retrieve_title_candidates(i) File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 237, in retrieve_titlecandidates for freq in sorted_freq[list(sorted_freq.keys())[key_idx]]: IndexError: list index out of range

Sorry I'm unable to share the file. Updating the condition in pdf_ingestor.py line 35 to check if len(sorted_freq) is greater than key_idx instead of 0 has allowed me to get past this, but it's not clear to me if that's the best fix or not.

vitorhirota commented 1 month ago

I'm getting the same error on a different scenario, related to tables.

Traceback (most recent call last):
  File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 48, in parse_document
    return_dict, _ = ingestor_api.ingest_document(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
    ingestor = pdf_ingestor.PDFIngestor(doc_location, parse_options)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
    blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
                                                                            ^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks
    parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
    self.parse(pages)
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 551, in parse
    self.organize_and_indent_blocks()
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 2676, in organize_and_indent_blocks
    block_idx, footer_count = self.build_table(block_idx, organized_blocks, table_start_idx,
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3070, in build_table
    tr_block = organized_blocks[table_start_idx]
               ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
IndexError: list index out of range

Might be related to a couple of sequential .pops without updating table_start_idx. https://github.com/nlmatics/nlm-ingestor/blob/465e6a1a72619015d194d92ca290176b87f1afe7/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py#L2072

Dirty fix was to update visual_ingestor.py#L3070 with min(table_start_idx, len(organized_blocks) - 1)