nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
971 stars 124 forks source link

UnicodeEncodeError when trying to save as HTML #7

Open RadekOnCrypto opened 6 months ago

RadekOnCrypto commented 6 months ago

I am using Docker image and below simple code for parsing the test.pdf file (An overlooked danger of ketogenic diets: Making the case that ketone bodies induce vascular damage by the same mechanisms as glucose):

from llmsherpa.readers import LayoutPDFReader
from rich import print

llmsherpa_api_url = "http://localhost:5001/api/parseDocument?renderFormat=all"
pdf_url = "./test.pdf"
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

# It works
print(doc.to_text())

# It breaks with `UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 966: character maps to <undefined>`
with open("./test.html", "w") as f:
    f.write(doc.to_html())

Unfortunately, I am getting and error:

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 966: character maps to <undefined>

but surprisingly, the doc.to_text() works normally.

I am on Windows 11, Python 3.12.1. I am attaching the test.pdf.