plutext / docx4j-ImportXHTML

Converts XHTML to OpenXML WordML (docx) using docx4j
135 stars 124 forks source link

A lot of additional blanks will be generated #59

Open mueller-jens opened 4 years ago

mueller-jens commented 4 years ago

I try to convert a html file to docx using the library. If i try it every blank in the tempate will be converted in a blank in the dockument. I used a template like

       String html="    <html><body><b>Type:</b> <span style='font-size: 10.0pt; font-family: \"Arial\", \"sans-serif\"'>TEXT</span>\n" + 
            "            <br/>\n" + 
            "            <span style='font-size: 10.0pt; font-family: \"Arial\", \"sans-serif\"'>\n" + 
            "               <b> another text: </b><span>10.0</span>\n" + 
            "            </span></body></html>";

        WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();

        XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);

        wordMLPackage.getMainDocumentPart().getContent().addAll( 
            XHTMLImporter.convert( html, null) );
        String docx = XmlUtils.marshaltoString(wordMLPackage
                .getMainDocumentPart().getJaxbElement(), true, true);

        FileOutputStream outputStream = new FileOutputStream("C:/jmu/tmp/generated.docx");
        Save saver = new Save(wordMLPackage); 
        saver.save(outputStream);

And the result looks like:

Type: TEXT             
                              another text: 10.0             

expected:

Type: TEXT
another text: 10.0
achimmihca commented 3 years ago

I had the same issue. Also when using an img-Tag.

Looking into the generated docx, I found that attribute space="preserve" seems to be the reason. This attribute is added in XHTMLImporterImpl.java.

I argue to remove this hardcoded "preserve" or make it configurable because whitespace in XML and HTML is ignored in most cases. If one really wants space in unusual places, one could use a non-breaking-space.

Anyway, my workaround is to remove the "preserve" arribute from the generated content:

    private static void removeSpacePreserveRecursive(Object obj)
    {
        if (obj instanceof Text)
        {
            var text = (Text) obj;
            if ("preserve".equals(text.getSpace()))
            {
                text.setSpace(null);
            }
        }
        else if (obj instanceof ContentAccessor)
        {
            ContentAccessor contentAccessor = (ContentAccessor) obj;
            for (Object child : contentAccessor.getContent())
            {
                removeSpacePreserveRecursive(child);
            }
        }
    }

You can call this method, for example, on wordMLPackage.getMainDocumentPart().getJaxbElement().getBody().