plutext / docx4j-ImportXHTML

Converts XHTML to OpenXML WordML (docx) using docx4j
135 stars 124 forks source link

Unable to convert html to docx file #56

Open devasakshay opened 5 years ago

devasakshay commented 5 years ago

I have converted docx files to HTML file and when I try to convert it back to docx below error is shown-

The entity "nbsp" was referenced, but not declared.. Stacktrace follows: org.xml.sax.SAXParseException; lineNumber: 62; columnNumber: 182; The entity "nbsp" was referenced, but not declared. at org.apache.xalan.transformer.TransformerIdentityImpl.transform(TransformerIdentityImpl.java:485) at org.docx4j.org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.createXMLResource(XMLResource.java:190) at org.docx4j.org.xhtmlrenderer.resource.XMLResource.load(XMLResource.java:71) at org.docx4j.org.xhtmlrenderer.swing.NaiveUserAgent.getXMLResource(NaiveUserAgent.java:212) at org.docx4j.org.xhtmlrenderer.docx.DocxRenderer.loadDocument(DocxRenderer.java:183) at org.docx4j.convert.in.xhtml.XHTMLImporterImpl.convert(XHTMLImporterImpl.java:492) at com.mdt.v1.TranslationJobsController$_convertHTMLToDocx1_closure4$$ERLLK27M.doCall(TranslationJobsController.groovy:1141) at com.mdt.v1.TranslationJobsController$$ERLLK27M.convertHTMLToDocx1(TranslationJobsController.groovy:1125) at grails.plugin.cache.web.filter.PageFragmentCachingFilter.doFilter(PageFragmentCachingFilter.java:189) at grails.plugin.cache.web.filter.AbstractFilter.doFilter(AbstractFilter.java:63) at grails.plugin.springsecurity.web.filter.GrailsAnonymousAuthenticationFilter.doFilter(GrailsAnonymousAuthenticationFilter.java:53) at 1.docx

com.mdt.util.security.CustomRestAuthenticationFilter.doFilter(CustomRestAuthenticationFilter.groovy:120) at grails.plugin.springsecurity.web.authentication.logout.MutableLogoutFilter.doFilter(MutableLogoutFilter.java:62) at grails.plugin.springsecurity.rest.RestLogoutFilter.doFilter(RestLogoutFilter.groovy:80) at grails.plugin.springsecurity.web.SecurityRequestHolderFilter.doFilter(SecurityRequestHolderFilter.java:59) at com.planetj.servlet.filter.compression.CompressingFilter.doFilter(CompressingFilter.java:270) at com.brandseye.cors.CorsFilter.doFilter(CorsFilter.java:82) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

plutext commented 5 years ago

See https://www.docx4java.org/forums/xhtml-import-f28/problems-while-converting-html-entities-t1197.html

devasakshay commented 5 years ago

I have tried below code -

  String unescape = org.jsoup.parser.Parser.unescapeEntities(htmlContent, true);

Now the error is gone for   but in my text content there is a '&' now, here is new stack trace-

The entity name must immediately follow the '&' in the entity reference. javax.xml.transform.TransformerException: org.xml.sax.SAXParseException; lineNumber: 28; columnNumber: 274; The entity name must immediately follow the '&' in the entity reference. at org.apache.xalan.transformer.TransformerIdentityImpl.transform(TransformerIdentityImpl.java:502) at org.docx4j.org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.createXMLResource(XMLResource.java:190) at org.docx4j.org.xhtmlrenderer.resource.XMLResource.load(XMLResource.java:75) at org.docx4j.convert.in.xhtml.XHTMLImporterImpl.convert(XHTMLImporterImpl.java:669) at com.mdt.v1.TranslationJobsController$_convertHTMLToDocx_closure4$$ERLLTwqG.doCall(TranslationJobsController.groovy:1128) at com.mdt.v1.TranslationJobsController$$ERLLTwqG.convertHTMLToDocx(TranslationJobsController.groovy:1103) at grails.plugin.cache.web.filter.PageFragmentCachingFilter.doFilter(PageFragmentCachingFilter.java:189) at grails.plugin.cache.web.filter.AbstractFilter.doFilter(AbstractFilter.java:63) at grails.plugin.springsecurity.web.filter.GrailsAnonymousAuthenticationFilter.doFilter(GrailsAnonymousAuthenticationFilter.java:53) at com.mdt.util.security.CustomRestAuthenticationFilter.doFilter(CustomRestAuthenticationFilter.groovy:120) at grails.plugin.springsecurity.web.authentication.logout.MutableLogoutFilter.doFilter(MutableLogoutFilter.java:62) at grails.plugin.springsecurity.rest.RestLogoutFilter.doFilter(RestLogoutFilter.groovy:80) at grails.plugin.springsecurity.web.SecurityRequestHolderFilter.doFilter(SecurityRequestHolderFilter.java:59) at com.planetj.servlet.filter.compression.CompressingFilter.doFilter(CompressingFilter.java:270) at com.brandseye.cors.CorsFilter.doFilter(CorsFilter.java:82) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

achimmihca commented 3 years ago

For me it worked with the unicode non-breaking-space character. You can copy it from wikipedia for example.

Thus, you might be able to replace   in your XHTML-string with this character. Could be that MS Word will render it correctly.