A lot of additional blanks will be generated

plutext / docx4j-ImportXHTML

Converts XHTML to OpenXML WordML (docx) using docx4j

135 stars 124 forks source link

String html=" <html><body>Type: TEXT\n" + " \n" + " \n" + " another text: 10.0\n" + " </body></html>"; WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage(); XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage); wordMLPackage.getMainDocumentPart().getContent().addAll( XHTMLImporter.convert( html, null) ); String docx = XmlUtils.marshaltoString(wordMLPackage .getMainDocumentPart().getJaxbElement(), true, true); FileOutputStream outputStream = new FileOutputStream("C:/jmu/tmp/generated.docx"); Save saver = new Save(wordMLPackage); saver.save(outputStream);

I had the same issue. Also when using an img-Tag.

Looking into the generated docx, I found that attribute space="preserve" seems to be the reason. This attribute is added in XHTMLImporterImpl.java.

I argue to remove this hardcoded "preserve" or make it configurable because whitespace in XML and HTML is ignored in most cases. If one really wants space in unusual places, one could use a non-breaking-space.

Anyway, my workaround is to remove the "preserve" arribute from the generated content:

    private static void removeSpacePreserveRecursive(Object obj)
    {
        if (obj instanceof Text)
        {
            var text = (Text) obj;
            if ("preserve".equals(text.getSpace()))
            {
                text.setSpace(null);
            }
        }
        else if (obj instanceof ContentAccessor)
        {
            ContentAccessor contentAccessor = (ContentAccessor) obj;
            for (Object child : contentAccessor.getContent())
            {
                removeSpacePreserveRecursive(child);
            }
        }
    }

You can call this method, for example, on wordMLPackage.getMainDocumentPart().getJaxbElement().getBody().

plutext / docx4j-ImportXHTML

A lot of additional blanks will be generated #59