`w2a` uses LibreOffice to export Word to HTML, change to use LibreOffice XSLT directly

ronaldtse commented 5 years ago

I've extracted out LibreOffice's Word related XSLTs here:

https://github.com/metanorma/ooo-word-xslt

This task is to utilize these xslt files to directly transform Word -> HTML, instead of needing to install LibreOffice.

ronaldtse commented 5 years ago

This is for @w00lf . Thanks!

ronaldtse commented 5 years ago

@w00lf could you please help make this work? Thanks!

w00lf commented 5 years ago

@w00lf could you please help make this work? Thanks!

Sure, let me look more closely what can be used here.

ronaldtse commented 4 years ago

@w00lf I think you can use Nokogiri to run XSLT (which runs libxslt underneath) to transform Word to HTML

e.g. http://craftingruby.com/posts/2014/01/14/transforming-xml-in-ruby-with-xslt-and-nokogiri.html

opoudjis commented 4 years ago

Yes, and if you want an example of our existing stack doing this, see the use of XSLT in the html2doc gem: https://github.com/metanorma/html2doc/blob/master/lib/html2doc/math.rb

opoudjis commented 4 years ago

I've unassigned myself, but of course please let me know how this goes.

ronaldtse commented 4 years ago

Thanks for the tips @opoudjis . @w00lf let us know if you run into any issues.

w00lf commented 4 years ago

Thanks for the tips @opoudjis . @w00lf let us know if you run into any issues.

Hi there, @ronaldtse. I tried to use libreoffice xslt(https://github.com/metanorma/ooo-word-xslt) with Nokogiri. Unfortunately, i cannot make them to work with the test docx document. For example, ./wordml2ooo/wordml2ooo_text.xsl produces pure text from document without any formating:

<?xml version="1.0"?>
HelloH20i=1n&#x3B2;2i

As i understand the main entry point for this xslt is wordml2ooo/wordml2ooo.xsl file, as it includes all other xslt, but for me it just produces blank file with xml notation:

<?xml version="1.0"?>

Why did you choose these particular xslt files? Are there any documentation for their structure? Maybe the input docx xml files should be linked properly before transforming with that stylesheets?

This is the code i am using to transform:

document = Nokogiri::XML(File.read('./word/document.xml')) 
template = Nokogiri::XSLT(File.read('./wordml2ooo/wordml2ooo.xsl'))
transformed_document = template.transform(document)
File.open('output.html', 'w') {  |file| file.write(transformed_document) }

ronaldtse commented 4 years ago

@w00lf the source of the XSLT files are given at: https://github.com/metanorma/ooo-word-xslt#history

Unfortunately there doesn't seem to be any documentation on how to use the XSLTs.

By searching the source repo (https://github.com/LibreOffice/core/search?q=wordml2ooo&unscoped_q=wordml2ooo), the only place it is used is here: https://github.com/LibreOffice/core/blob/330df37c7e2af0564bcd2de1f171bed4befcc074/filter/source/config/fragments/filters/MS_Word_2003_XML.xcu#L22

The code points to XMLOasisImporter and XMLOasisExporter which is the software used to import/export OOO.

A search of XMLOasisImporter provides this: https://github.com/search?p=4&q=org%3ALibreOffice+XMLOasisImporter&type=Code

A google search of it indicates it is a class available to use by developers. It doesn't seem like this class does anything special for this process except run XSLTs.

opoudjis commented 4 years ago

@ronaldtse,

BZZZZT

Those are the wrong XSLTs. They are the transforms between Microsoft OOXML and Open Office.

It looks like what you want is at: https://github.com/LibreOffice/core/tree/master/filter/source/xslt/odf2xhtml/export/xhtml

But it also looks like that only converts from OpenOffice to XHTML, so in fact you need two stages:

Microsoft OOXML > OpenOffice XML > XHTML

So you still need to get the first stage working with wordml2ooo.xsl

ronaldtse commented 4 years ago

@opoudjis yes that's what I was thinking this morning.

Given that there's a transform from OOO to XHTML, I wonder if we can even create an OOO to AsciiDoc XSLT from it directly? That might work even better.

w00lf commented 4 years ago

Hi there. So, after some testing i was able to determine why these particular xslt files do not work for test docx i was using. I have noticed that the entry xlst file https://github.com/metanorma/ooo-word-xslt/blob/master/wordml2ooo/wordml2ooo.xsl#L37 is using w:wordDocument as an entry point for a document. There is no such tag in docx file word/document.xml. Rtf file(docx extension) is using w:document as its root tag. I have checked random tags from the test document and it turns out that wordml2ooo xslt does not have a number of them, for example, there is no transform rules for tags: sSubSup, w:document, oMathPara, ctrlPr. I have searched for w:wordDocument signature, this is its description - https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats. It seems that these particular xslt are for other types of documents, not docx(rtf) itself. I have searched core LibreOffice repo for mentions of rtf tags and found this file - https://github.com/LibreOffice/core/blob/1c5465ef1158ebf0f3f64e3343c2ed610024e5a8/writerfilter/source/rtftok/rtfcontrolwords.cxx. This file has all tags from test docx file and it seems that this is the file that LibreOffice is using to convert rtf files and there is no xslt files for them at all. Here is another file - https://github.com/LibreOffice/core/blob/93eeaf0ad902214fb6b4205606b24046a458ee45/starmath/source/rtfexport.cxx. So, obviously, we cannot use that file separately and we will need to find another way to do this if we dont want to use LibreOffice anymore. I can look for other gems that can work with docx, what do you think?

ronaldtse commented 4 years ago

Interestingly I think you're right! Seems that the WordML in these XSLT files are for WordProcessingML of the 2003 version, not the 2007 version.

So with some Googling I found these XSLTs that are docx2foo where foo is the format:

Maybe @Intelligent2013 is more familiar with working with XSLT? Probably docx2json will provide a very good entry?

Intelligent2013 commented 4 years ago

@ronaldtse Microsoft Word supports these xml formats (Save As command, I use Word 2010 for example):

.docx - it's a multi-component .zip file with 'main entry-point' file 'word\document.xml'. The root tag is <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main". . .
.xml (XML document Word 2003) - i's an one xml file with root tag <w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml". Another name is 'WordML"
.xml (XML document Word) - it's one xml file with root tag <pkg:package xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage" . . . This file contains a few components (rels, themes, styles, fonttable, document, etc.) similar to .docx zip file, but not compressed into .zip.

Main format of LibreOffice are:

.odt (ODF document), Open Document Format - multicomponent .zip file, 'main entry point' is ./content.xml with root tag <office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" . . . This format is 'similar' .docx (multicomponent+zip)
.fodt (Flat XML ODF Text Document) - it's one xml file file with root tag <office:document xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0". . . This file contains a few components similar to .odt zip file, but not compressed into .zip. This format is 'similar' MsWord .xml (with <pkg:package root)

Regarding xslt:

wordml2ooo.xsl - is xslt to convert .xml file (with root tag w:wordDocument, i.e. XML document Word 2003) into .fodt
ooo2wordml.xsl - is xslt to convert .fodt file into .xml file (with root tag w:wordDocument, i.e. XML document Word 2003)

If you need to convert .docx into html, then you need:

unpack .docx zip file into some folder
convert the ./word/document.xml ((WordprocessingML) into html with using some XSLT (may be https://github.com/ottoville/DOCX2HTML.XSL, I didn't work with it). Please note, that .docx can contain inside Excel Spreadsheet, Drawings, etc with own XML format (full Office Open XML specification (ECMA 376 standart) is a huge, about 5000 pages).

opoudjis commented 4 years ago

(full Office Open XML specification (ECMA 376 standart) is a huge, about 5000 pages)

https://sebsauvage.net/wiki/doku.php?id=word_document_generation :

Possible Solutions: Generate .docx files (Afterall, that's XML, isn't it ?) BANNED. I don't have time to read a 7500 pages specification no-one is capable of implementing - not even Microsoft !

opoudjis commented 4 years ago

I'm worried with where this is going: if the available XSLTs online do a worse job of converting DOCX to a clean HTML with complete coverage, then this approach of externalised XHTML has to be rejected. Phrases like "So far following features are supported" in the https://github.com/ottoville/DOCX2HTML.XSL readme, or code targetting the much simpler Markdown format, do not inspire confidence. At all.

So with any of these XSLTs, we will need to ensure that they generate all the markup that we want to see in Asciidoctor. That includes footnotes, mathematics, images, bookmarks, and so on.

Given that there's a transform from OOO to XHTML, I wonder if we can even create an OOO to AsciiDoc XSLT from it directly? That might work even better.

... Obviously with a 1500 pp spec, and with a conversion already in place in LibreOffice (and presumably elsewhere) such a brand new XSLT from scratch is not a good use of anybody's time.

w00lf commented 4 years ago

@ronaldtse @opoudjis I have tested our option with unzipped test docx file, here are some results:

https://github.com/ottoville/DOCX2HTML.XSL - currently requires xslt 2.0 support, and latest version of nokogiri supports only 1.0 and 1.1 syntaxis:

2.5.3 :002 > template = Nokogiri::XSLT(File.read('/Users/mitaraskin/Work/Personal/Metanorma/DOCX2HTML.XSL/docx2html.xsl'))
XPath error : Invalid expression
max(($r2,$g2,$b2))
    ^
XPath error : Invalid expression
min(($r2,$g2,$b2))
    ^
.....
RuntimeError (compilation error: element stylesheet)
xsl:version: only 1.1 features are supported

So if you to use this xsl we still will need to use external dependency with xsl 2.0 support

https://github.com/chrahunt/docx - does not support images at all,
https://github.com/kaleguy/docx2json/blob/master/wordtoxml.xsl - does not support images either.

ronaldtse commented 4 years ago

@w00lf there's a newer gem https://github.com/openxml/openxml-docx that seems that have some basics implemented. Could you have a try to see what support it has?

w00lf commented 4 years ago

@w00lf there's a newer gem https://github.com/openxml/openxml-docx that seems that have some basics implemented. Could you have a try to see what support it has?

I have inspected it a little bit. It has some support for image embedding: https://github.com/openxml/openxml-docx/blob/fc093111eb6b0640d0b34901de6d39ba3907df3d/examples/image-embedding, but code itself is focused on docx creation and will require some work in order to use it for parsing docx documents if it even possible.

ronaldtse commented 4 years ago

I see. @w00lf I think we can also try https://github.com/ottoville/DOCX2HTML.XSL with Saxon HE which supports XLST 2.0.

w00lf commented 4 years ago

I see. @w00lf I think we can also try https://github.com/ottoville/DOCX2HTML.XSL with Saxon HE which supports XLST 2.0.

There is no support of saxon in MRI ruby, only jRuby.

ronaldtse commented 4 years ago

@w00lf right, if we need to use XSLT 2.0 we need to run an off-band Java process.

opoudjis commented 4 years ago

XSLT 2.0 is the Devil's own proprietary monopoly, and anyone who codes in XSLT 2.0 deserves bastinado'ing. And the fact that it is the Devil's own monopoly demonstrates what a niche fail XSLT has become. (Couldn't have happened to a more deserving spec.)

And yes indeed, XSLT 2.0 commits you to Java. If that's not an indictment, I don't know what is.

w00lf commented 4 years ago

@w00lf right, if we need to use XSLT 2.0 we need to run an off-band Java process.

Is it even ok to do? I though initial intent was to go away from 3d party dependencies(libreoffice)?

w00lf commented 4 years ago

@opoudjis @ronaldtse what's our plan here next?

opoudjis commented 4 years ago

@ronaldtse Strong suggest this ticket be closed

ronaldtse commented 4 years ago

I'm not convinced that an off band Saxon process does harm.

Is it even ok to do? I though initial intent was to go away from 3d party dependencies(libreoffice)?

Yes. We wanted to get away from libreoffice, not all third-party dependencies.

metanorma / coradoc

`w2a` uses LibreOffice to export Word to HTML, change to use LibreOffice XSLT directly #84