mwilliamson / mammoth.js

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
4.86k stars 524 forks source link

Feature request: Add classes to text blocks instead of discarding style information entirely #57

Open mdorazio opened 9 years ago

mdorazio commented 9 years ago

I know this goes somewhat against the philosophy of mammoth, but in my testing with users I've had many complaints about the loss of font color, typeface, and size when importing documents. It seems the way to address this while preserving compliant HTML would be to add classes for these attributes that correspond to the underlying Word markup.

For example, this block of XML from Word is size 16 red font (standard typeface).

<w:p w:rsidR="00A0306C" w:rsidRDefault="001545AB">
        <w:pPr>
            <w:rPr>
                <w:color w:val="FF0000"/>
                <w:sz w:val="32"/>
                <w:szCs w:val="32"/>
            </w:rPr>
        </w:pPr>
        <w:r>
            <w:rPr>
                <w:color w:val="FF0000"/>
                <w:sz w:val="32"/>
                <w:szCs w:val="32"/>
            </w:rPr>
            <w:t>Size 16 red!</w:t>
        </w:r>
    </w:p>

It would be great if it could be converted to classes, perhaps by concatenating the parameter and value, like <p class="sz32 colorFF0000">Size 16 red!</p>

Then it's up to the developer/designer to decide how to handle the classes in the HTML (if at all).

I could attempt this myself if someone could point me in the right direction, but unfortunately large segments of mammoth's codebase are beyond my current level of understanding.

mwilliamson commented 9 years ago

The best way of dealing with this in Mammoth is to use document transforms, which allow you to add styles to paragraphs and runs based on their properties. I think the only thing stopping you from doing so is that the docx reader ignores some of the properties you want to use, such as colour and size, so taking a look at how document.xml is parsed is probably a good place to start.

satchelspencer commented 8 years ago

One potential solution is to give access to the rest of the xml properties that are currently ignored by the docx reader, and use them to apply custom styles as needed. This opens up a lot of possibilities like auto-detecting style types in documents without style names in the xml. see pull request #75

baleeds commented 5 years ago

@mwilliamson Hi Michael, do you think you will work on including size and color in the docx reader? I'd love to be able to convert headings based on file size.