nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
922 stars 112 forks source link

Encoding error with non-ASCII character. #33

Open jamesvillarrubia opened 4 months ago

jamesvillarrubia commented 4 months ago

There is some sort of encoding error with '½'

Happy to submit a PR if someone can point me in the right direction for this conversion.

nlm-ingestor-1  | testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': False, 'parse_pages': (), 'apply_ocr': False}
.....

processing blocks in page:  317
nlm-ingestor-1  | processing blocks in page:  318
nlm-ingestor-1  | ERROR:root:could not convert string to float: '½'
nlm-ingestor-1  | ERROR:root:could not convert string to float: '½'
nlm-ingestor-1  | ERROR:root:could not convert string to float: '½'
nlm-ingestor-1  | ERROR:__main__:error uploading file, stacktrace: Traceback (most recent call last):
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
nlm-ingestor-1  |     ingest_status, return_dict = ingestor_api.ingest_document(
nlm-ingestor-1  |                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
nlm-ingestor-1  |     pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
nlm-ingestor-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
nlm-ingestor-1  |     blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
nlm-ingestor-1  |                                                                             ^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
nlm-ingestor-1  |     parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
nlm-ingestor-1  |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
nlm-ingestor-1  |     self.parse(pages)
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 562, in parse
nlm-ingestor-1  |     self.json_dict = block_renderer.BlockRenderer(self).render_json()
nlm-ingestor-1  |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py", line 347, in render_json
nlm-ingestor-1  |     table_block["left"],
nlm-ingestor-1  |     ~~~~~~~~~~~^^^^^^^^
nlm-ingestor-1  | KeyError: 'left'
nlm-ingestor-1  | Traceback (most recent call last):
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
nlm-ingestor-1  |     ingest_status, return_dict = ingestor_api.ingest_document(
nlm-ingestor-1  |                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
nlm-ingestor-1  |     pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
nlm-ingestor-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
nlm-ingestor-1  |     blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
nlm-ingestor-1  |                                                                             ^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
nlm-ingestor-1  |     parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
nlm-ingestor-1  |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
nlm-ingestor-1  |     self.parse(pages)
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 562, in parse
nlm-ingestor-1  |     self.json_dict = block_renderer.BlockRenderer(self).render_json()
nlm-ingestor-1  |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py", line 347, in render_json
nlm-ingestor-1  |     table_block["left"],
nlm-ingestor-1  |     ~~~~~~~~~~~^^^^^^^^
nlm-ingestor-1  | KeyError: 'left'
Ianpwest commented 3 months ago

I get the same failure on any Unicode character in the text. Would be nice if it could fail with a warning and continue.

Ianpwest commented 3 months ago

I have also noticed that the same PDF with the Unicode characters works when hitting the hosted endpoint: llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"

But fails when using the latest docker image