mwilliamson / python-mammoth

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
811 stars 121 forks source link

Support for equations #17

Open sunu opened 8 years ago

sunu commented 8 years ago

Currently it ignores equations with the following warning An unrecognised element was ignored: {http://schemas.openxmlformats.org/officeDocument/2006/math}oMath

Is it possible to add support for equations?

Thanks for all your work. :)

mwilliamson commented 8 years ago

I haven't really looked into adding support for equations. I suspect, however, that it would be quite a lot of work, and take more time than I currently have spare, unless there happens to be another library that already handles this.

Would you mind providing a small example document that I can take a look at in case I find the time?

sunu commented 8 years ago

Sure. I'm attaching a file which only has a equation in it. I'm hoping that will make things easier to handle. If you need a bigger file to look at let me know. equation.docx

The problem, as far I understand, is that there is no native way to present OMML in HTML. We have to convert it to either MathML or LaTeX and then use some kind of external JavaScript library like Mathjax to properly render it in the browser.

There are some libraries like https://github.com/xiilei/dwml to help with the conversion.

Another alternative way of representing equation would be to convert them into images. But I'm not very sure how that can be done.

Let me know what you think. I can also help getting a PR ready for this if we can make a concrete plan for the implementation.

Thanks again :)

GitBruno commented 8 years ago

Images are no good in my opinion, as it looses semantics. MathML seems to be the best fit for HTML. Maybe look at https://github.com/jgm/texmath to do the heavy lifting? This library can go from OMML (Office Math Markup Language, used in Microsoft Office) to MathML.

sulazix commented 6 years ago

Hello, I can confirm that image conversion is not an ideal solution (for accessibility issue). MathML is currently the recommanded by the W3C and WAI for equation markup in HTML. I also know that a lot of people are using MathType for equation typing in Word, a compatility with this soft can be realy great :-) As anyone progressed in any implementation of this feature ?

vikasvisking commented 5 years ago

Any progess for this feature...?

GitBruno commented 5 years ago

Just found out about KaTeX might be a good alternative to MathML

ildarakhmetov commented 5 years ago

A really important feature, would love to see it in python-mammoth.

zlqm commented 4 years ago

There is a trick to convert equation inside a docx file into LaTeX.

As equation is stored as omath tag inside word/document.xml, we can extract it out and transform it into LaTeX format, then put it back as normal text.

Here is a demo

RuiLiu0129 commented 4 years ago

There is a trick to convert equation inside a docx file into LaTeX.

As equation is stored as omath tag inside word/document.xml, we can extract it out and transform it into LaTeX format, then put it back as normal text.

Here is a demo

Very helpful! Thanks a lot!

Flore-Acher commented 1 year ago

Hi @mwilliamson There are any news or suggestions to support Math and Chemistry Formulas? How could we collaborate with this issue to achieve it?