I've been playing around with LLMSherpa and the ingestor but am stuck on setting up the HTML parser.
I was able to send the request by modifying the LayoutPDFReader snippets for parsing a PDF file below,
def parse_pdf( pdf_file):
auth_header = {}
parser_response = api_connection.request("POST", "http://localhost:5010/api/parseDocument?renderFormat=all", fields={'file': pdf_file})
return parser_response
def read_html(path_or_url, contents=None):
"""
Reads pdf from a url or path
Parameters
----------
path_or_url: str
path or url to the pdf file e.g. https://someexapmple.com/myfile.pdf or /home/user/myfile.pdf
contents: bytes
contents of the pdf file. If contents is given, path_or_url is ignored. This is useful when you already have the pdf file contents in memory such as if you are using streamlit or flask.
"""
file_name = os.path.basename(path_or_url)
with open(path_or_url, "rb") as f:
file_data = f.read()
pdf_file = (file_name, file_data, 'text/html')
parser_response = parse_pdf(pdf_file)
response_json = json.loads(parser_response.data.decode("utf-8"))
blocks = response_json['return_dict']['result']['blocks']
return Document(blocks)
The server is getting the error:
error uploading file, stacktrace: Traceback (most recent call last):
File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
ingest_status, return_dict = ingestor_api.ingest_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 47, in ingest_document
htmli = html_ingestor.HTMLIngestor(doc_location)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/html_ingestor.py", line 32, in __init__
self.json_dict = br.render_json()
^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py", line 259, in render_json
block["box_style"][1],
~~~~~^^^^^^^^^^^^^
KeyError: 'box_style'
Traceback (most recent call last):
File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
ingest_status, return_dict = ingestor_api.ingest_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 47, in ingest_document
htmli = html_ingestor.HTMLIngestor(doc_location)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/html_ingestor.py", line 32, in __init__
self.json_dict = br.render_json()
^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py", line 259, in render_json
block["box_style"][1],
~~~~~^^^^^^^^^^^^^
KeyError: 'box_style'
172.17.0.1 - - [26/Jan/2024 19:13:34] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 500
My sample HTML file is copy/pasted from inspect source on https://github.com/nlmatics/nlm-ingestor. Any thoughts? Is the HTML parser only meant for TIKA responses to generating HTML for DOCX, PPTX? Thanks!
I've been playing around with LLMSherpa and the ingestor but am stuck on setting up the HTML parser.
I was able to send the request by modifying the
LayoutPDFReader
snippets for parsing a PDF file below,The server is getting the error:
My sample HTML file is copy/pasted from inspect source on https://github.com/nlmatics/nlm-ingestor. Any thoughts? Is the HTML parser only meant for TIKA responses to generating HTML for DOCX, PPTX? Thanks!