When posting my PDF to the server, I receive the following error in logs:
Traceback (most recent call last):
File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
ingest_status, return_dict = ingestor_api.ingest_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
self.parse(pages)
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 551, in parse
self.organize_and_indent_blocks()
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3053, in organize_and_indent_blocks
indent.indent_blocks()
File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 682, in indent_blocks
indent, level_stack, indent_reason = get_level(class_name)
^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 321, in get_level
parent_list_idx = list_indents[l["list_type"]]["parent_list_idx"]
~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: ''
testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': True, 'parse_pages': (), 'apply_ocr': True}
processing page: 0 Number of p_tags.... 5
processing page: 1 Number of p_tags.... 17
processing page: 2 Number of p_tags.... 199
processing page: 3 Number of p_tags.... 184
processing page: 4 Number of p_tags.... 14
splitting line: 700+ 300+
processing page: 5 Number of p_tags.... 28
processing page: 6 Number of p_tags.... 43
processing page: 7 Number of p_tags.... 46
processing page: 8 Number of p_tags.... 33
processing page: 9 Number of p_tags.... 15
processing page: 10 Number of p_tags.... 24
processing page: 11 Number of p_tags.... 68
processing page: 12 Number of p_tags.... 28
processing page: 13 Number of p_tags.... 40
processing page: 14 Number of p_tags.... 42
processing page: 15 Number of p_tags.... 47
processing page: 16 Number of p_tags.... 47
processing page: 17 Number of p_tags.... 59
processing page: 18 Number of p_tags.... 39
processing page: 19 Number of p_tags.... 42
processing page: 20 Number of p_tags.... 49
processing page: 21 Number of p_tags.... 49
processing page: 22 Number of p_tags.... 55
processing page: 23 Number of p_tags.... 34
processing page: 24 Number of p_tags.... 20
processing page: 25 Number of p_tags.... 42
processing page: 26 Number of p_tags.... 49
processing page: 27 Number of p_tags.... 40
processing page: 28 Number of p_tags.... 47
processing page: 29 Number of p_tags.... 34
processing page: 30 Number of p_tags.... 21
processing page: 31 Number of p_tags.... 42
processing page: 32 Number of p_tags.... 47
processing page: 33 Number of p_tags.... 48
processing page: 34 Number of p_tags.... 20
processing page: 35 Number of p_tags.... 34
processing page: 36 Number of p_tags.... 16
processing page: 37 Number of p_tags.... 31
processing page: 38 Number of p_tags.... 34
processing page: 39 Number of p_tags.... 35
processing page: 40 Number of p_tags.... 46
processing page: 41 Number of p_tags.... 50
processing page: 42 Number of p_tags.... 45
processing page: 43 Number of p_tags.... 39
processing page: 44 Number of p_tags.... 48
processing page: 45 Number of p_tags.... 47
processing page: 46 Number of p_tags.... 41
processing page: 47 Number of p_tags.... 44
processing page: 48 Number of p_tags.... 44
processing page: 49 Number of p_tags.... 46
processing page: 50 Number of p_tags.... 47
processing page: 51 Number of p_tags.... 21
processing page: 52 Number of p_tags.... 40
processing page: 53 Number of p_tags.... 18
processing page: 54 Number of p_tags.... 10
processing page: 55 Number of p_tags.... 4
processing page: 56 Number of p_tags.... 39
processing page: 57 Number of p_tags.... 92
processing page: 58 Number of p_tags.... 162
processing page: 59 Number of p_tags.... 150
processing page: 60 Number of p_tags.... 21
processing page: 61 Number of p_tags.... 33
processing page: 62 Number of p_tags.... 7
processing blocks in page: 1
processing blocks in page: 2
processing blocks in page: 3
processing blocks in page: 4
processing blocks in page: 5
processing blocks in page: 6
processing blocks in page: 7
processing blocks in page: 8
processing blocks in page: 9
processing blocks in page: 10
processing blocks in page: 11
processing blocks in page: 12
processing blocks in page: 12
processing blocks in page: 13
processing blocks in page: 14
processing blocks in page: 15
processing blocks in page: 16
processing blocks in page: 17
processing blocks in page: 18
processing blocks in page: 19
processing blocks in page: 19
processing blocks in page: 20
processing blocks in page: 20
processing blocks in page: 21
processing blocks in page: 22
processing blocks in page: 23
processing blocks in page: 24
processing blocks in page: 25
processing blocks in page: 26
processing blocks in page: 26
processing blocks in page: 27
processing blocks in page: 28
processing blocks in page: 29
processing blocks in page: 30
processing blocks in page: 30
processing blocks in page: 31
processing blocks in page: 32
processing blocks in page: 33
processing blocks in page: 34
processing blocks in page: 35
processing blocks in page: 36
processing blocks in page: 37
processing blocks in page: 38
processing blocks in page: 39
processing blocks in page: 40
processing blocks in page: 41
processing blocks in page: 42
processing blocks in page: 43
processing blocks in page: 44
processing blocks in page: 45
processing blocks in page: 46
processing blocks in page: 47
processing blocks in page: 48
processing blocks in page: 49
processing blocks in page: 50
processing blocks in page: 51
processing blocks in page: 53
processing blocks in page: 52
processing blocks in page: 54
processing blocks in page: 55
processing blocks in page: 56
processing blocks in page: 57
processing blocks in page: 58
processing blocks in page: 59
processing blocks in page: 60
processing blocks in page: 61
processing blocks in page: 62
error uploading file, stacktrace: error uploading file, stacktrace: Traceback (most recent call last):
File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
ingest_status, return_dict = ingestor_api.ingest_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
self.parse(pages)
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 551, in parse
self.organize_and_indent_blocks()
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3053, in organize_and_indent_blocks
indent.indent_blocks()
File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 682, in indent_blocks
indent, level_stack, indent_reason = get_level(class_name)
^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 321, in get_level
parent_list_idx = list_indents[l["list_type"]]["parent_list_idx"]
~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: ''
Traceback (most recent call last):
File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
ingest_status, return_dict = ingestor_api.ingest_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
self.parse(pages)
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 551, in parse
self.organize_and_indent_blocks()
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3053, in organize_and_indent_blocks
indent.indent_blocks()
File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 682, in indent_blocks
indent, level_stack, indent_reason = get_level(class_name)
^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 321, in get_level
parent_list_idx = list_indents[l["list_type"]]["parent_list_idx"]
~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: ''
172.17.0.1 - - [03/Apr/2024 07:54:14] "POST /api/parseDocument?renderFormat=all&applyOcr=yes&useNewIndentParser=yes HTTP/1.1" 500 -
I am unable to share the PDF here due to NDA reasons but I can answer any questions regarding the PDF pages if you have any.
If i understand the trace correctly, page 62 is causing the error, so here's a screenshot of the PDF's page 62:
Does the server have a soft fail option for pages? It's possible some pages aren't required and can be skipped in case they are unparse-able. Would be nice to have this feature.
I am running the development server using
docker
on my local machine.The API url I'm using is:
When posting my PDF to the server, I receive the following error in logs:
I am unable to share the PDF here due to NDA reasons but I can answer any questions regarding the PDF pages if you have any.
If i understand the trace correctly, page 62 is causing the error, so here's a screenshot of the PDF's page 62: