Open ronaldtse opened 5 years ago
This is for @w00lf . Thanks!
@w00lf could you please help make this work? Thanks!
@w00lf could you please help make this work? Thanks!
Sure, let me look more closely what can be used here.
@w00lf I think you can use Nokogiri to run XSLT (which runs libxslt
underneath) to transform Word to HTML
e.g. http://craftingruby.com/posts/2014/01/14/transforming-xml-in-ruby-with-xslt-and-nokogiri.html
Yes, and if you want an example of our existing stack doing this, see the use of XSLT in the html2doc gem: https://github.com/metanorma/html2doc/blob/master/lib/html2doc/math.rb
I've unassigned myself, but of course please let me know how this goes.
Thanks for the tips @opoudjis . @w00lf let us know if you run into any issues.
Thanks for the tips @opoudjis . @w00lf let us know if you run into any issues.
Hi there, @ronaldtse. I tried to use libreoffice xslt(https://github.com/metanorma/ooo-word-xslt) with Nokogiri. Unfortunately, i cannot make them to work with the test docx document. For example, ./wordml2ooo/wordml2ooo_text.xsl produces pure text from document without any formating:
<?xml version="1.0"?>
HelloH20i=1nβ2i
As i understand the main entry point for this xslt is wordml2ooo/wordml2ooo.xsl
file, as it includes all other xslt, but for me it just produces blank file with xml notation:
<?xml version="1.0"?>
Why did you choose these particular xslt files? Are there any documentation for their structure? Maybe the input docx xml files should be linked properly before transforming with that stylesheets?
This is the code i am using to transform:
document = Nokogiri::XML(File.read('./word/document.xml'))
template = Nokogiri::XSLT(File.read('./wordml2ooo/wordml2ooo.xsl'))
transformed_document = template.transform(document)
File.open('output.html', 'w') { |file| file.write(transformed_document) }
@w00lf the source of the XSLT files are given at: https://github.com/metanorma/ooo-word-xslt#history
Unfortunately there doesn't seem to be any documentation on how to use the XSLTs.
By searching the source repo (https://github.com/LibreOffice/core/search?q=wordml2ooo&unscoped_q=wordml2ooo), the only place it is used is here: https://github.com/LibreOffice/core/blob/330df37c7e2af0564bcd2de1f171bed4befcc074/filter/source/config/fragments/filters/MS_Word_2003_XML.xcu#L22
The code points to XMLOasisImporter
and XMLOasisExporter
which is the software used to import/export OOO.
A search of XMLOasisImporter
provides this: https://github.com/search?p=4&q=org%3ALibreOffice+XMLOasisImporter&type=Code
A google search of it indicates it is a class available to use by developers. It doesn't seem like this class does anything special for this process except run XSLTs.
@ronaldtse,
BZZZZT
Those are the wrong XSLTs. They are the transforms between Microsoft OOXML and Open Office.
It looks like what you want is at: https://github.com/LibreOffice/core/tree/master/filter/source/xslt/odf2xhtml/export/xhtml
But it also looks like that only converts from OpenOffice to XHTML, so in fact you need two stages:
Microsoft OOXML > OpenOffice XML > XHTML
So you still need to get the first stage working with wordml2ooo.xsl
@opoudjis yes that's what I was thinking this morning.
Given that there's a transform from OOO to XHTML, I wonder if we can even create an OOO to AsciiDoc XSLT from it directly? That might work even better.
Hi there. So, after some testing i was able to determine why these particular xslt files do not work for test docx i was using. I have noticed that the entry xlst file https://github.com/metanorma/ooo-word-xslt/blob/master/wordml2ooo/wordml2ooo.xsl#L37 is using w:wordDocument
as an entry point for a document. There is no such tag in docx file word/document.xml
. Rtf file(docx extension) is using w:document
as its root tag. I have checked random tags from the test document and it turns out that wordml2ooo xslt does not have a number of them, for example, there is no transform rules for tags: sSubSup
, w:document
, oMathPara
, ctrlPr
. I have searched for w:wordDocument
signature, this is its description - https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats. It seems that these particular xslt are for other types of documents, not docx(rtf) itself. I have searched core LibreOffice repo for mentions of rtf tags and found this file - https://github.com/LibreOffice/core/blob/1c5465ef1158ebf0f3f64e3343c2ed610024e5a8/writerfilter/source/rtftok/rtfcontrolwords.cxx. This file has all tags from test docx file and it seems that this is the file that LibreOffice is using to convert rtf files and there is no xslt files for them at all. Here is another file - https://github.com/LibreOffice/core/blob/93eeaf0ad902214fb6b4205606b24046a458ee45/starmath/source/rtfexport.cxx. So, obviously, we cannot use that file separately and we will need to find another way to do this if we dont want to use LibreOffice anymore. I can look for other gems that can work with docx, what do you think?
Interestingly I think you're right! Seems that the WordML in these XSLT files are for WordProcessingML of the 2003 version, not the 2007 version.
So with some Googling I found these XSLTs that are docx2foo
where foo is the format:
Maybe @Intelligent2013 is more familiar with working with XSLT? Probably docx2json will provide a very good entry?
@ronaldtse Microsoft Word supports these xml formats (Save As command, I use Word 2010 for example):
Main format of LibreOffice are:
Regarding xslt:
If you need to convert .docx into html, then you need:
(full Office Open XML specification (ECMA 376 standart) is a huge, about 5000 pages)
https://sebsauvage.net/wiki/doku.php?id=word_document_generation :
Possible Solutions: Generate .docx files (Afterall, that's XML, isn't it ?) BANNED. I don't have time to read a 7500 pages specification no-one is capable of implementing - not even Microsoft !
I'm worried with where this is going: if the available XSLTs online do a worse job of converting DOCX to a clean HTML with complete coverage, then this approach of externalised XHTML has to be rejected. Phrases like "So far following features are supported" in the https://github.com/ottoville/DOCX2HTML.XSL readme, or code targetting the much simpler Markdown format, do not inspire confidence. At all.
So with any of these XSLTs, we will need to ensure that they generate all the markup that we want to see in Asciidoctor. That includes footnotes, mathematics, images, bookmarks, and so on.
Given that there's a transform from OOO to XHTML, I wonder if we can even create an OOO to AsciiDoc XSLT from it directly? That might work even better.
... Obviously with a 1500 pp spec, and with a conversion already in place in LibreOffice (and presumably elsewhere) such a brand new XSLT from scratch is not a good use of anybody's time.
@ronaldtse @opoudjis I have tested our option with unzipped test docx file, here are some results:
https://github.com/ottoville/DOCX2HTML.XSL - currently requires xslt 2.0 support, and latest version of nokogiri supports only 1.0 and 1.1 syntaxis:
2.5.3 :002 > template = Nokogiri::XSLT(File.read('/Users/mitaraskin/Work/Personal/Metanorma/DOCX2HTML.XSL/docx2html.xsl'))
XPath error : Invalid expression
max(($r2,$g2,$b2))
^
XPath error : Invalid expression
min(($r2,$g2,$b2))
^
.....
RuntimeError (compilation error: element stylesheet)
xsl:version: only 1.1 features are supported
So if you to use this xsl we still will need to use external dependency with xsl 2.0 support
https://github.com/chrahunt/docx - does not support images at all,
https://github.com/kaleguy/docx2json/blob/master/wordtoxml.xsl - does not support images either.
@w00lf there's a newer gem https://github.com/openxml/openxml-docx that seems that have some basics implemented. Could you have a try to see what support it has?
@w00lf there's a newer gem https://github.com/openxml/openxml-docx that seems that have some basics implemented. Could you have a try to see what support it has?
I have inspected it a little bit. It has some support for image embedding: https://github.com/openxml/openxml-docx/blob/fc093111eb6b0640d0b34901de6d39ba3907df3d/examples/image-embedding, but code itself is focused on docx creation and will require some work in order to use it for parsing docx documents if it even possible.
I see. @w00lf I think we can also try https://github.com/ottoville/DOCX2HTML.XSL with Saxon HE which supports XLST 2.0.
I see. @w00lf I think we can also try https://github.com/ottoville/DOCX2HTML.XSL with Saxon HE which supports XLST 2.0.
There is no support of saxon in MRI ruby, only jRuby.
@w00lf right, if we need to use XSLT 2.0 we need to run an off-band Java process.
XSLT 2.0 is the Devil's own proprietary monopoly, and anyone who codes in XSLT 2.0 deserves bastinado'ing. And the fact that it is the Devil's own monopoly demonstrates what a niche fail XSLT has become. (Couldn't have happened to a more deserving spec.)
And yes indeed, XSLT 2.0 commits you to Java. If that's not an indictment, I don't know what is.
@w00lf right, if we need to use XSLT 2.0 we need to run an off-band Java process.
Is it even ok to do? I though initial intent was to go away from 3d party dependencies(libreoffice)?
@opoudjis @ronaldtse what's our plan here next?
@ronaldtse Strong suggest this ticket be closed
I'm not convinced that an off band Saxon process does harm.
Is it even ok to do? I though initial intent was to go away from 3d party dependencies(libreoffice)?
Yes. We wanted to get away from libreoffice, not all third-party dependencies.
I've extracted out LibreOffice's Word related XSLTs here:
This task is to utilize these xslt files to directly transform Word -> HTML, instead of needing to install LibreOffice.