mwilliamson / python-mammoth

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
785 stars 121 forks source link

Feasibility: Convert to Documentation XML Formats (DITA, Custom) #106

Closed bai-yi-bai closed 2 years ago

bai-yi-bai commented 3 years ago

I hope this post kicks off a discussion.

What is the feasibility of extending mammoth to export to XML formats?

I write and edit a lot of documentation in my professional life. The organizations I've worked for use DITA, DocBook, or other XML-based documentation standards as part of their authoring flow. Usually, the content is not natively authored in WYSIWYG XML editors, such as XMetaL, FrameMaker, or Oxygen, but is instead originally created in Word. Many technical writers are not savvy programmers and end up manually converting documentation by copying and pasting text from Word to their XML editor. A simple conversion tool would greatly benefit them.

Let's survey the field of existing Word to XML conversion tools.

Mammoth Mammoth was explained to me as having two parts: a parser and a converter. The parser does a great job... the converter is where the focus needs to be. I examined the conversion.py and found that the html tags used by mammoth are hard-coded. It also seems that the conversion is done in clearly delineated steps. It seems there are a few problems for how to go about creating XML output:

-How would HTML and XML schemas be stored? -How could a schema be expanded with a custom one? -Some schemas nest elements inside other elements. How could a schema be created to ensure that the XML validates? I cannot conceive of a way to map one XML DTD to another XML DTD. My brain struggles to define a dictionary key and value pair schema which would allow conversion.

Thank you for reading. I think it's really funny that I am using markdown a Microsoft-owned tool (Github) to discuss how to convert its 20-year-old docx format to other formats. The real solution it seems is to teach all my stakeholders how to write in asciidoc or markdown and just convert that to our company's stylesheet/letterhead... but that is easier said than done.

mwilliamson commented 3 years ago

I think converting to other formats is outside the scope of Mammoth, and I'm not sure that there's really that much to share when converting from the parsed document to different formats.

It sounds like the parser is the part that would be useful to you. I haven't really looked around at other docx parsers recently, but my suggestion would be to use the parsed document from Mammoth or other docx library directly, and try writing something that outputs the XML you need from that parsed document, rather than trying to adjust Mammoth's conversion code. Mammoth doesn't make any stability guarantees about the parser since it's an internal implementation detail (although it's pretty stable in practice) so you'd probably want to pin to a specific version if you use Mammoth's parser.

mwilliamson commented 2 years ago

I'm closing since I'm not sure there's anything in particular that Mammoth can do to help here.