metanorma / coradoc

Coradoc is the Core AsciiDoc Parser used by Metanorma
MIT License
1 stars 2 forks source link

Parse Word using a Word gem #82

Open ronaldtse opened 5 years ago

ronaldtse commented 5 years ago

This gem should always produce correct output. This means that it is necessary to use a Word gem that parses the content according to the document structure, rather than LibreOffice to do an XHTML conversion.

And that makes life much easier without a huge dependency.

skalee commented 5 years ago

@ronaldtse Pandoc also reads *.docx files (see https://pandoc.org/MANUAL.html), so perhaps we can do XHTML conversion in some other way.

Apart from that, there are some somewhat promising gems, but I doubt they are as mature as Pandoc:

BTW is *.docx enough, or the older MS Word formats must be recognized as well?

ronaldtse commented 5 years ago

@skalee Pandoc is generally poor in preserving semantics and is overzealous in other ways. We only need to do docx for now.

In fact, we are looking for very specific tag conversions. I will share the BasicDoc document with you shortly, which describes the underlying document model of Metanorma. Ie. Word only needs to be converted into that document model directly — anything not supported by that model should be discarded.

Now that we’re discussing about this, we will want a BasicDoc gem to handle the document as a Class, and then write a to-adoc method for every node type to serialize the tree out.

It should be very doable with your experience :wink:

ronaldtse commented 5 years ago

@skalee here's the document model for BasicDoc: https://github.com/CalConnect/csd-lightweight-doc