metanorma / coradoc

Coradoc is the Core AsciiDoc Parser used by Metanorma
MIT License
1 stars 2 forks source link

Ability to convert Word into Coradoc (and to adoc) #115

Open ronaldtse opened 2 months ago

ronaldtse commented 2 months ago

Could use the docx gem.

hmdne commented 2 months ago

We already kinda support this with w2a: https://github.com/metanorma/coradoc/blob/main/exe/w2a

This could be amended and an API exposed. Related issues: #100, #64

ronaldtse commented 2 months ago

w2a worked poorly because it was actually a Word => HTML => AsciiDoc which means we lose plenty of information.

We now have a document to process from Word directory to AsciiDoc. Let's use this opportunity to make Coradoc work.

hmdne commented 1 month ago

@ronaldtse I have yet to evaluate that in full, as of now I have no idea yet how much data is lost (my assumption is that at least from my long-time-ago experience, at least Microsoft Word made it sure, that the generated HTML file is still editable).

But - another experience of mine was that I had to choose a format for postprocessing - either ODT or DOCX. DOCX I felt mostly as a dump of Microsoft Word memory. It was very verbose, hard to work with. Comparing that, ODT felt much more like HTML - it was a well designed format. Even if we use a docx gem, I feel like we would have to heavily amend it for this task - for now, it looks like it only supports the most basic elements.

Below, I will show a result of my basic test, of creating a simple test document in LibreOffice:

image

I have extracted the exact fragment that corresponded to document structure, as that's what we're most interested in. This is ODT:

image

And this is DOCX:

image image

To compare, below is the entire document converted to HTML:

image

Of course, this document has been generated carefully. I have, for instance, used a correct button in Libreoffice to generate a heading. I can assume it won't be the case all the time.

My proposal would be to:

  1. Attempt a work on #87 using w2a, the current solution.
  2. If we can proceed with this issue without many problems, let's keep it as-is and scrap the idea for handling DOCX format directly.
ronaldtse commented 2 weeks ago

Thank you for the investigation. Unfortunately, most users use DOCX, not ODT, and we definitely cannot get people to migrate from DOCX to ODT.

If we do DOCX, we should do it right, instead of using ODT or the XSLT stylesheet (as to convert to HTML), because there are semantic losses.

I actually believe that lutaml-model will make parsing and working with a DOCX much easier than using the docx gem itself.

hmdne commented 2 weeks ago

I actually believe that lutaml-model will make parsing and working with a DOCX much easier than using the docx gem itself.

As I understand, by writing our own rules for handling DOCX.

Unfortunately, most users use DOCX, not ODT, and we definitely cannot get people to migrate from DOCX to ODT.

But, for implementation, we could possibly keep Libreoffice, just to use it to convert from DOCX to ODT (instead of converting to HTML, as we do now), therefore supporting both formats. Since both formats are, from what I know, semantically interchangeable, this wouldn't hamper the task, but make the implementation simpler.

If we do DOCX, we should do it right, instead of using ODT or the XSLT stylesheet (as to convert to HTML), because there are semantic losses.

And that could be some alternative to using Libreoffice for that conversion in the future.

opoudjis commented 1 week ago

@hmdne You are currently going through the same process I went through 6 years ago. See the readme on

https://github.com/strogonoff/reverse_asciidoctor

(Since my original reverse_adoc readme appears to have been memory-holed.)

reverse_adoc, which you are now reimplementing, had decided to use ODT HTML rather than DOCX HTML, precisely because its HTML was much neater and closer to the semantics.

Ronald does not want to pursue this approach, and he does not want the dependence on LibreOffice. He wants to implement this directly from the object model, with a serialiser currently under development.

opoudjis commented 1 week ago

Even if we use a docx gem, I feel like we would have to heavily amend it for this task - for now, it looks like it only supports the most basic elements.

https://github.com/metanorma/html2doc/wiki/Why-not-docx%3F, authored around the same time. I did my own survey of Node-based authoring tools for my day job last year; they had more features than what I found in Ruby in 2018, but most features are still only available if you pay for them, and from what I have seen, I have little confidence that they will ever cope with the complexity of stuff Metanorma expects in a Word document. Even colouring table cells turned out to be surprisingly difficult with Node tools.

The direction this is clearly heading towards is using a full Word SDK: Word-authoring gems, which assume nothing more complex than an image, simply won't cut it, and neither will naive serialisers solve the semantic complexities of OOXML. And that means fully understanding the OOXML spec, if you're going to use a Word SDK.

Sebastian Sauvage was right when he wrote, 10 years ago,

BANNED. I don’t have time to read a 7500 pages specification no-one is capable of implementing - not even Microsoft !

(You will find, when you scrutinise real OOXML in documents, how right he is. And how underdocumented Word formatting actually is: there is no documentation of their Word CSS in MHT at all, you can only update it through trial and error.)

hmdne commented 6 hours ago

@opoudjis That's the current approach - we take DOCX, convert that using LibreOffice to HTML and then we roll with that using the existing HTML pipeline. We are not reimplementing reverse_adoc - for the most part, we have split the existing code in two parts - one is concerned with converting HTML to a Coradoc tree, another is concerned with converting said tree to AsciiDoc.

What I propose is to use ODT directly, not ODT HTML. Compared to the second, the first preserves the semantics, but it's much closer to HTML in terms of readability than DOCX is. I assume this should be a fairly straightforward task, at least to get parity with what we have with ODT HTML currently.