nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
1.05k stars 152 forks source link

Error when parsing a PDF #44

Open kaulshashank opened 6 months ago

kaulshashank commented 6 months ago

I am running the development server using docker on my local machine.

The API url I'm using is:

http://localhost:5010/api/parseDocument?renderFormat=all&applyOcr=yes&useNewIndentParser=yes

When posting my PDF to the server, I receive the following error in logs:

Traceback (most recent call last):
  File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
    ingest_status, return_dict = ingestor_api.ingest_document(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
    pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
    blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
                                                                            ^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
    parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
    self.parse(pages)
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 551, in parse
    self.organize_and_indent_blocks()
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3053, in organize_and_indent_blocks
    indent.indent_blocks()
  File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 682, in indent_blocks
    indent, level_stack, indent_reason = get_level(class_name)
                                         ^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 321, in get_level
    parent_list_idx = list_indents[l["list_type"]]["parent_list_idx"]
                      ~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: ''

testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': True, 'parse_pages': (), 'apply_ocr': True}
processing page:  0  Number of p_tags....  5
processing page:  1  Number of p_tags....  17
processing page:  2  Number of p_tags....  199
processing page:  3  Number of p_tags....  184
processing page:  4  Number of p_tags....  14
splitting line: 700+ 300+
processing page:  5  Number of p_tags....  28
processing page:  6  Number of p_tags....  43
processing page:  7  Number of p_tags....  46
processing page:  8  Number of p_tags....  33
processing page:  9  Number of p_tags....  15
processing page:  10  Number of p_tags....  24
processing page:  11  Number of p_tags....  68
processing page:  12  Number of p_tags....  28
processing page:  13  Number of p_tags....  40
processing page:  14  Number of p_tags....  42
processing page:  15  Number of p_tags....  47
processing page:  16  Number of p_tags....  47
processing page:  17  Number of p_tags....  59
processing page:  18  Number of p_tags....  39
processing page:  19  Number of p_tags....  42
processing page:  20  Number of p_tags....  49
processing page:  21  Number of p_tags....  49
processing page:  22  Number of p_tags....  55
processing page:  23  Number of p_tags....  34
processing page:  24  Number of p_tags....  20
processing page:  25  Number of p_tags....  42
processing page:  26  Number of p_tags....  49
processing page:  27  Number of p_tags....  40
processing page:  28  Number of p_tags....  47
processing page:  29  Number of p_tags....  34
processing page:  30  Number of p_tags....  21
processing page:  31  Number of p_tags....  42
processing page:  32  Number of p_tags....  47
processing page:  33  Number of p_tags....  48
processing page:  34  Number of p_tags....  20
processing page:  35  Number of p_tags....  34
processing page:  36  Number of p_tags....  16
processing page:  37  Number of p_tags....  31
processing page:  38  Number of p_tags....  34
processing page:  39  Number of p_tags....  35
processing page:  40  Number of p_tags....  46
processing page:  41  Number of p_tags....  50
processing page:  42  Number of p_tags....  45
processing page:  43  Number of p_tags....  39
processing page:  44  Number of p_tags....  48
processing page:  45  Number of p_tags....  47
processing page:  46  Number of p_tags....  41
processing page:  47  Number of p_tags....  44
processing page:  48  Number of p_tags....  44
processing page:  49  Number of p_tags....  46
processing page:  50  Number of p_tags....  47
processing page:  51  Number of p_tags....  21
processing page:  52  Number of p_tags....  40
processing page:  53  Number of p_tags....  18
processing page:  54  Number of p_tags....  10
processing page:  55  Number of p_tags....  4
processing page:  56  Number of p_tags....  39
processing page:  57  Number of p_tags....  92
processing page:  58  Number of p_tags....  162
processing page:  59  Number of p_tags....  150
processing page:  60  Number of p_tags....  21
processing page:  61  Number of p_tags....  33
processing page:  62  Number of p_tags....  7
processing blocks in page:  1
processing blocks in page:  2
processing blocks in page:  3
processing blocks in page:  4
processing blocks in page:  5
processing blocks in page:  6
processing blocks in page:  7
processing blocks in page:  8
processing blocks in page:  9
processing blocks in page:  10
processing blocks in page:  11
processing blocks in page:  12
processing blocks in page:  12
processing blocks in page:  13
processing blocks in page:  14
processing blocks in page:  15
processing blocks in page:  16
processing blocks in page:  17
processing blocks in page:  18
processing blocks in page:  19
processing blocks in page:  19
processing blocks in page:  20
processing blocks in page:  20
processing blocks in page:  21
processing blocks in page:  22
processing blocks in page:  23
processing blocks in page:  24
processing blocks in page:  25
processing blocks in page:  26
processing blocks in page:  26
processing blocks in page:  27
processing blocks in page:  28
processing blocks in page:  29
processing blocks in page:  30
processing blocks in page:  30
processing blocks in page:  31
processing blocks in page:  32
processing blocks in page:  33
processing blocks in page:  34
processing blocks in page:  35
processing blocks in page:  36
processing blocks in page:  37
processing blocks in page:  38
processing blocks in page:  39
processing blocks in page:  40
processing blocks in page:  41
processing blocks in page:  42
processing blocks in page:  43
processing blocks in page:  44
processing blocks in page:  45
processing blocks in page:  46
processing blocks in page:  47
processing blocks in page:  48
processing blocks in page:  49
processing blocks in page:  50
processing blocks in page:  51
processing blocks in page:  53
processing blocks in page:  52
processing blocks in page:  54
processing blocks in page:  55
processing blocks in page:  56
processing blocks in page:  57
processing blocks in page:  58
processing blocks in page:  59
processing blocks in page:  60
processing blocks in page:  61
processing blocks in page:  62
error uploading file, stacktrace:  error uploading file, stacktrace: Traceback (most recent call last):
  File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
    ingest_status, return_dict = ingestor_api.ingest_document(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
    pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
    blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
                                                                            ^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
    parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
    self.parse(pages)
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 551, in parse
    self.organize_and_indent_blocks()
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3053, in organize_and_indent_blocks
    indent.indent_blocks()
  File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 682, in indent_blocks
    indent, level_stack, indent_reason = get_level(class_name)
                                         ^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 321, in get_level
    parent_list_idx = list_indents[l["list_type"]]["parent_list_idx"]
                      ~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: ''
Traceback (most recent call last):
  File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
    ingest_status, return_dict = ingestor_api.ingest_document(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
    pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
    blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
                                                                            ^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
    parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
    self.parse(pages)
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 551, in parse
    self.organize_and_indent_blocks()
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3053, in organize_and_indent_blocks
    indent.indent_blocks()
  File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 682, in indent_blocks
    indent, level_stack, indent_reason = get_level(class_name)
                                         ^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 321, in get_level
    parent_list_idx = list_indents[l["list_type"]]["parent_list_idx"]
                      ~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: ''
172.17.0.1 - - [03/Apr/2024 07:54:14] "POST /api/parseDocument?renderFormat=all&applyOcr=yes&useNewIndentParser=yes HTTP/1.1" 500 -

I am unable to share the PDF here due to NDA reasons but I can answer any questions regarding the PDF pages if you have any.

If i understand the trace correctly, page 62 is causing the error, so here's a screenshot of the PDF's page 62:

image

kaulshashank commented 6 months ago

Does the server have a soft fail option for pages? It's possible some pages aren't required and can be skipped in case they are unparse-able. Would be nice to have this feature.

thomasBourdin commented 4 months ago

I have same error on my side.