nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
1.1k stars 158 forks source link

Bug: Markdown parsing error #83

Open jamesvillarrubia opened 3 months ago

jamesvillarrubia commented 3 months ago

The following markdown:

A horizontal rule follows.

***

also

size  material      color
----  ------------  ------------
9     leather       brown
10    hemp canvas   natural
11    glass         transparent

produces the following error:

parser-1    | error uploading file, stacktrace: Traceback (most recent call last):
parser-1    |   File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 48, in parse_document
parser-1    |     return_dict, _ = ingestor_api.ingest_document(
parser-1    |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
parser-1    |   File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 41, in ingest_document
parser-1    |     ingestor = markdown_parser.MarkdownDocument(doc_location)
parser-1    |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
parser-1    |   File "/app/nlm_ingestor/file_parser/markdown_parser.py", line 163, in __init__
parser-1    |     self.blocks, self.html_str = parse_markdown_to_blocks(markdown_text)
parser-1    |                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
parser-1    |   File "/app/nlm_ingestor/file_parser/markdown_parser.py", line 37, in parse_markdown_to_blocks
parser-1    |     cur_blocks = {
parser-1    |                  ^
parser-1    | KeyError: 'thematic_break'
parser-1    | Traceback (most recent call last):
parser-1    |   File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 48, in parse_document
parser-1    |     return_dict, _ = ingestor_api.ingest_document(
parser-1    |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
parser-1    |   File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 41, in ingest_document
parser-1    |     ingestor = markdown_parser.MarkdownDocument(doc_location)
parser-1    |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
parser-1    |   File "/app/nlm_ingestor/file_parser/markdown_parser.py", line 163, in __init__
parser-1    |     self.blocks, self.html_str = parse_markdown_to_blocks(markdown_text)
parser-1    |                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
parser-1    |   File "/app/nlm_ingestor/file_parser/markdown_parser.py", line 37, in parse_markdown_to_blocks
parser-1    |     cur_blocks = {
parser-1    |                  ^
parser-1    | KeyError: 'thematic_break'
parser-1    | 192.168.65.1 - - [07/Aug/2024 00:49:13] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 500 -