nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
971 stars 124 forks source link

How to use HTML parser? #6

Closed ghost closed 6 months ago

ghost commented 6 months ago

I've been playing around with LLMSherpa and the ingestor but am stuck on setting up the HTML parser.

I was able to send the request by modifying the LayoutPDFReader snippets for parsing a PDF file below,

def parse_pdf( pdf_file):
    auth_header = {}
    parser_response = api_connection.request("POST", "http://localhost:5010/api/parseDocument?renderFormat=all", fields={'file': pdf_file})
    return parser_response

def read_html(path_or_url, contents=None):
    """
    Reads pdf from a url or path

    Parameters
    ----------
    path_or_url: str
        path or url to the pdf file e.g. https://someexapmple.com/myfile.pdf or /home/user/myfile.pdf
    contents: bytes
        contents of the pdf file. If contents is given, path_or_url is ignored. This is useful when you already have the pdf file contents in memory such as if you are using streamlit or flask.
    """
    file_name = os.path.basename(path_or_url)
    with open(path_or_url, "rb") as f:
        file_data = f.read()
        pdf_file = (file_name, file_data, 'text/html')
    parser_response = parse_pdf(pdf_file)
    response_json = json.loads(parser_response.data.decode("utf-8"))
    blocks = response_json['return_dict']['result']['blocks']
    return Document(blocks)

The server is getting the error:

 error uploading file, stacktrace: Traceback (most recent call last):
  File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
    ingest_status, return_dict = ingestor_api.ingest_document(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 47, in ingest_document
    htmli = html_ingestor.HTMLIngestor(doc_location)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/html_ingestor.py", line 32, in __init__
    self.json_dict = br.render_json()
                     ^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py", line 259, in render_json
    block["box_style"][1],
    ~~~~~^^^^^^^^^^^^^
KeyError: 'box_style'
Traceback (most recent call last):
  File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
    ingest_status, return_dict = ingestor_api.ingest_document(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 47, in ingest_document
    htmli = html_ingestor.HTMLIngestor(doc_location)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/html_ingestor.py", line 32, in __init__
    self.json_dict = br.render_json()
                     ^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py", line 259, in render_json
    block["box_style"][1],
    ~~~~~^^^^^^^^^^^^^
KeyError: 'box_style'
172.17.0.1 - - [26/Jan/2024 19:13:34] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 500

My sample HTML file is copy/pasted from inspect source on https://github.com/nlmatics/nlm-ingestor. Any thoughts? Is the HTML parser only meant for TIKA responses to generating HTML for DOCX, PPTX? Thanks!

ansukla commented 6 months ago

This is a bug in the code. Thanks for reporting - let me take a look.

ansukla commented 6 months ago

This is resolved. 0.1.5 is building right now, please pull in about 10 mins.

ghost commented 6 months ago

that fixed it, thanks for the quick response time!