from llmsherpa.readers import LayoutPDFReader
from rich import print
llmsherpa_api_url = "http://localhost:5001/api/parseDocument?renderFormat=all"
pdf_url = "./test.pdf"
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)
# It works
print(doc.to_text())
# It breaks with `UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 966: character maps to <undefined>`
with open("./test.html", "w") as f:
f.write(doc.to_html())
Unfortunately, I am getting and error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 966: character maps to <undefined>
but surprisingly, the doc.to_text() works normally.
I am on Windows 11, Python 3.12.1. I am attaching the test.pdf.
I am using Docker image and below simple code for parsing the
test.pdf
file (An overlooked danger of ketogenic diets: Making the case that ketone bodies induce vascular damage by the same mechanisms as glucose):Unfortunately, I am getting and error:
but surprisingly, the
doc.to_text()
works normally.I am on Windows 11, Python 3.12.1. I am attaching the test.pdf.