pqzx / html2docx

Convert html to docx
MIT License
69 stars 49 forks source link

Suggestion for HTML class to Word style mapping #35

Open djplaner opened 2 years ago

djplaner commented 2 years ago

I've had the need where I have HTML with various different classess. When converting the HTML to .docx, I needed to map to classses to specific Word styles. I couldn't see an existing way to do it.

The code below demostrated how my solution works. Happy to do a pull request, but wanted to get some feedback on the approach and gauge interest/value before I did that. Thoughts?


from docx import Document

from htmldocx import HtmlToDocx

# example html using classes

html = """
<title>Dev exploration</title>

<p>
<span class="canvasFileLink">Hello</span>
</p>

<h1 class="canvasFile">Canvas File</h1>

<p class="canvasFile">COM31 Study Guide-Week 1.pdf</p>

<p class="intenseQuote">This should be a quote.</p>

"""

# Define the mapping from HTML class to Word style
# Each HTML tag has a dict keyed on HTML class where
# the value is the Word style name

STYLE_MAP = {
        "h1" : {
        "canvasFile" : 'Canvas File',
        "canvasSubHeader" : 'Canvas SubHeader',
        "canvasDiscussion" : 'Canvas Discussion',
        "canvasQuiz" : 'Canvas Quiz',
        "canvasAssignment" : 'Canvas Assignment',
        "canvasExternalTool" : 'Canvas External Tool',
        'canvasExternalUrl' : 'Canvas External Url',
    }, 
    "p" : {
        "embed": 'Embed',
        "hide" : "Hide",
        "canvasFileLink": 'Canvas File Link'
    },
    "span" : {
        "embed": 'Embed',
        "hide" : "Hide",
        "canvasFileLink": 'Canvas File Link'
    }
}

# Start with a blank Word doc that has the Word styles 
# from above defined
document = Document('template.docx')

# create the parser and point to the style map
new_parser = HtmlToDocx()
new_parser.style_map = STYLE_MAP

new_parser.add_html_to_document(html,document)

document.save('sample.docx')
pqzx commented 2 years ago

Hi @djplaner , thanks for the suggestion.

I'm not sure I really understand how this works. Are Word styles built-ins in Word, or are they custom? In your example, how are the mappings in STYLE_MAP used? What result would we get in sample.docx?

I'm not sure how much interest there is in this, but happy for you to add it if there is value to you, as long as this feature is optional.

djplaner commented 2 years ago

Sorry for the delay in following up...work, life etc.

The Word styles - at least in my case - are custom styles. i.e. template.docx contains defined Word styles for Canvas File Link, Embed etc. These are meaningful in my context.

The HTML that is being converted to Word contains matching HTML elements that are styled. The STYLE_MAP dict specifies a way to map from specific HTML styles to specific Word styles.

The keys are based on CSS selectors e.g. a <span class="embed"> will be mapped to to a Word style called Emebd. A <h1 class="canvasAssignment"> gets mapped to a Word style Canvas Assignment.

Implementation - definitely optional

There's a new function checkStyleMap that is called at relevant places which checks the provided dict. If there's a match for the current tag and class it returns the corresponding style which is applied. Otherwise, nothing happens.

Hence if you choose not to define a style map, all just proceeds as normal.