Open PrashantK-DS opened 7 months ago
First of all, I would like to appreciate the great work, you have done to convert PDF to well tagged HTML pages. Many Thanks for this contribution.
The Issue I faced that I am getting repeated pages while converting pdf to HTML . .. To recreate the issue, use following code...
Use this file to get code with indentations. pdf_to_html_llmsherpa.txt Actual code --
Use this file to get code with indentations. pdf_to_html_llmsherpa.txt
Actual code --
def convert_pdf_to_html(pdf_file, output_html): try: llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_file) print(doc.to_html())
# Write to a html file with open(output_html, 'w', encoding='utf-8') as html_file: html_file.write(f'{doc.to_html()}') print(f"Conversion successful. HTML file saved to {output_html}") except Exception as e: print(f"Error during conversion: {e}")
pdf_file_path = 'pdf_upload/AbanPearlPteLtd310322.pdf' output_html_path = 'pdf_upload/AbanPearlPteLtd310322_modified_2.html'
convert_pdf_to_html(pdf_file_path, output_html_path)
AbanPearlPteLtd310322.pdf
First of all, I would like to appreciate the great work, you have done to convert PDF to well tagged HTML pages. Many Thanks for this contribution.
The Issue I faced that I am getting repeated pages while converting pdf to HTML . .. To recreate the issue, use following code...
def convert_pdf_to_html(pdf_file, output_html): try: llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_file) print(doc.to_html())
pdf_file_path = 'pdf_upload/AbanPearlPteLtd310322.pdf' output_html_path = 'pdf_upload/AbanPearlPteLtd310322_modified_2.html'
convert_pdf_to_html(pdf_file_path, output_html_path)
AbanPearlPteLtd310322.pdf