pqzx / html2docx

Convert html to docx
MIT License
69 stars 49 forks source link

Converting block level elements like <div> into paragraphs #44

Open Pikamander2 opened 2 years ago

Pikamander2 commented 2 years ago

I did a quick test to see how the parser would handle different tags, but the results weren't great.

This code:

from docx import Document
from htmldocx import HtmlToDocx

document = Document()
new_parser = HtmlToDocx()

html = '<h1>Test file</h1><p>Test paragraph 1</p><p>Test paragraph 2</p><div>Test div 1</div><div><span>Test div 2</span></div>'

new_parser.add_html_to_document(html, document)

document.save('test1.docx')

Results in this document:

image

The p tags were converted properly, but the divs are being treated as inline text rather than as paragraphs.

I'm guessing that most other block level elements like <section> and <main> probably have the same issue as well.