nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.37k stars 134 forks source link

Bug in API function: Repeated content in HTML when trying to convert PDF to HTML. #56

Open PrashantK-DS opened 7 months ago

PrashantK-DS commented 7 months ago

First of all, I would like to appreciate the great work, you have done to convert PDF to well tagged HTML pages. Many Thanks for this contribution.

The Issue I faced that I am getting repeated pages while converting pdf to HTML . .. To recreate the issue, use following code...

Use this file to get code with indentations. pdf_to_html_llmsherpa.txt

Actual code --

def convert_pdf_to_html(pdf_file, output_html): try: llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_file) print(doc.to_html())

    # Write to a html file
    with open(output_html, 'w', encoding='utf-8') as html_file:
        html_file.write(f'{doc.to_html()}')

    print(f"Conversion successful. HTML file saved to {output_html}")
except Exception as e:
    print(f"Error during conversion: {e}")

pdf_file_path = 'pdf_upload/AbanPearlPteLtd310322.pdf' output_html_path = 'pdf_upload/AbanPearlPteLtd310322_modified_2.html'

convert_pdf_to_html(pdf_file_path, output_html_path)

AbanPearlPteLtd310322.pdf