mwilliamson / mammoth.js

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
4.99k stars 544 forks source link

Supporting math #83

Open darobin opened 8 years ago

darobin commented 8 years ago

I don't know if you're planning to support math ever, but just in case I thought I'd give you a heads up that I have a JS converter for the Office math markup to MathML, so it could just be reused in Mammoth: omml2mathml.

I don't have time to dig into the Mammoth code to integrate this (in case you're interested) but I'm happy to help if someone takes it on.

mwilliamson commented 8 years ago

Thanks! To the best of my recollection, I don't think anybody's asked for support, so not really a priority.

As a couple of notes should anybody want to take a look:

hubgit commented 7 years ago

I had a quick look at this but couldn't work out how to convert the Mammoth XML element to a DOM node (and back).

hubgit commented 7 years ago

There's also an issue for MathJax support of OMML as an input format, though I'm not sure how it would be represented in the HTML.

hubgit commented 7 years ago

There's a related request for equation support in python-mammoth.

darobin commented 7 years ago

@hubgit It would be pretty straightforward to walk the returned HTMLElement and produce an equivalent Element, if that's what's needed. I'm not sure what Mammoth would best use. Alternatively it would be possible to add support for omml2mathml for XML output, but that would either require binding it to a specific XML DOM implementation, or it would need to be configurable with implementations that in turn would need to be interoperable enough to be usable (which isn't guaranteed at all in the JS world).

For MathJax it should be completely straightforward to implement an input jax by using omml2mathml and then piping that to the MathML input jax, but that might feel a bit indirect. I am no super familiar with MathJax's internals, it might be easy to port omml2mathml to address the internal API directly.

As for Python, I guess the best option is a direct port. That shouldn't take more than a day's work. (DOMs in Python tend to be quite painful, but it's not like our usage is highly complex.)

jmealo commented 7 years ago

Any news on this? I may take a stab at it.

hubgit commented 7 years ago

@jmealo Please do, I haven't tried yet.

MCTaylor17 commented 6 years ago

For the record, it would be quite useful for our purposes (education) if Mammoth could convert equations to MathML or LaTeX.

vikasvisking commented 5 years ago

any update on this...?

does it supoorts now

jkorff commented 5 years ago

It would help if you could implement an option that lets Mammoth mark the position of any elements it doesn't understand yet, such as equations. This would allow us to trace insertion points in the converted HTML and do the conversion externally.

I've just completed that for OMML - converting it via a PowerShell script to MathML. All that I need now for a successful fully automated solution is to know where which equation sits within the HTML. Presently Mammoth is not capable of giving that clue.

How about:

var options = {
    markUnknownElementsWithClass: "unknown"
};

which could give me in the HTML <p class="unknown"></p> We can then easily operate on this element or filter it out.

lkkkeith commented 4 years ago

I had a quick look at this but couldn't work out how to convert the Mammoth XML element to a DOM node (and back).

Hi. I am currently working for a publisher to generate the HTML from docx having a lot of Math equations. I managed to parse the "element" to MathML using omml2mathml but the elementResult() is returning null. Any idea how the html is generated form Mammoth?

I added the below code to the object xmlElementReaders in lib/docx/body-reader.js

`"m:oMath": function (element) {

        function getXML(node) {

            let attrString = "";

            if (node.value) {
                return node.value;
            }

            if (!node.children) {
                return "";
            };

            // Add attributes to the node if necessary
            let attrs = node.attributes;

            // Add only if attributes in not an empty object
            if (attrs && Object.entries(attrs).length){
                Object.entries(attrs).forEach( ([key, value]) =>{
                    attrString += ` ${key}="${value}"`
                })
            }

            if (node.children.length > 0){

                let content = ""
                for (child of node.children) {
                    content += getXML(child)
                }

                return `<${node.name}${attrString}>${content}</${node.name}>`;
            }
            else{
                return `<${node.name}${attrString}/>`;
            }

        }

        // Parse the element to XML string
        let xml = getXML(element);

        // Add namespace to change it to OMML for parsing
        xml = 
        `<?xml version="1.0" encoding="UTF-8" standalone="yes"?>`+
        `<w:document xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math">`+
        `${xml}`+
        `</w:document>`

        // Parse the XML document to DOM
       var DOMParser = require('xmldom').DOMParser;
        var xmlDoc = new DOMParser().parseFromString(xml, 'text/xml');

        // Parse the XML Doc with OMML to MathML
        let mathmlElement = omml2mathml(xmlDoc);

        // Add proper namespace to display in browser
        mathmlElement.removeAttribute("display")
        mathmlElement.setAttribute("xmlns", "http://www.w3.org/1998/Math/MathML")

        return elementResult(mathmlElement);
    },`
liyongleihf2006 commented 4 years ago

@lkkkeith You gave me inspiration, and I realized it according to your idea. Let me share my implementation plan.

Three libraries need to be introduced first

npm install --save mathjax omml2mathml xmldom

Then introduce them in node_modules\mammoth\lib\docx\body-reader.js

var omml2mathml =  require('omml2mathml');
var xmldom = require('xmldom');
require('mathjax/es5/mml-svg')

Then change the code above you

"oMath": function (element) {
    var om = transform(element)
    function transform(data) {
        var el = _transform(data)
        el.setAttribute('xmlns:wpc', 'http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas')
        el.setAttribute('xmlns:mo', 'http://schemas.microsoft.com/office/mac/office/2008/main')
        el.setAttribute('xmlns:mc', 'http://schemas.openxmlformats.org/markup-compatibility/2006')
        el.setAttribute('xmlns:mv', 'urn:schemas-microsoft-com:mac:vml')
        el.setAttribute('xmlns:o', 'urn:schemas-microsoft-com:office:office')
        el.setAttribute('xmlns:r', 'http://schemas.openxmlformats.org/officeDocument/2006/relationships')
        el.setAttribute('xmlns:m', 'http://schemas.openxmlformats.org/officeDocument/2006/math')
        el.setAttribute('xmlns:v', 'urn:schemas-microsoft-com:vml')
        el.setAttribute('xmlns:wp14', 'http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing')
        el.setAttribute('xmlns:wp', 'http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing')
        el.setAttribute('xmlns:w10', 'urn:schemas-microsoft-com:office:word')
        el.setAttribute('xmlns:w', 'http://schemas.openxmlformats.org/wordprocessingml/2006/main')
        el.setAttribute('xmlns:w14', 'http://schemas.microsoft.com/office/word/2010/wordml')
        el.setAttribute('xmlns:w15', 'http://schemas.microsoft.com/office/word/2012/wordml')
        el.setAttribute('xmlns:wpg', 'http://schemas.microsoft.com/office/word/2010/wordprocessingGroup')
        el.setAttribute('xmlns:wpi', 'http://schemas.microsoft.com/office/word/2010/wordprocessingInk')
        el.setAttribute('xmlns:wne', 'http://schemas.microsoft.com/office/word/2006/wordml')
        el.setAttribute('xmlns:wps', 'http://schemas.microsoft.com/office/word/2010/wordprocessingShape')
        return el
        function _transform(data) {
            var name = data.name
            if (!name) {
            return document.createTextNode(data.value)
            }
            var tagName = name.replace(/\{.*\}/, 'm:')
            var el = document.createElement(tagName)
            var children = data.children
            if (children) {
            children.forEach(element => {
                el.appendChild(_transform(element))
            });
            }
            return el
        }
    }
    var doc = new xmldom.DOMParser().parseFromString(om.outerHTML)
    var math = omml2mathml(doc)
    var abc = MathJax.mathml2svg(math.outerHTML)
    abc = abc.outerHTML
    var svg = abc.match(/\<svg.*\<\/svg\>/)[0]
    var math2 = abc.match(/\<math.*\<\/math\>/)[0].replace(/\sdisplay=".+?"/, '').replace(/\</g, "«").replace(/\>/g, "»").replace(/"/g, "¨")
    var style = ''
    svg.replace(/style="(.+?)"\>/, function (match, $1) {
        style = $1
    })
    svg = "data:image/svg+xml,"+window.encodeURIComponent(svg)
    var img = `<img align="middle" class="Wirisformula" src="${svg}" data-mathml="${math2}" alt="1 half" role="math" style="${style}">`
    //
    //  Then update the document to include the adjusted CSS for the
    //    content of the new equation.
    //
    MathJax.startup.document.clear();
    MathJax.startup.document.updateDocument();
    return elementResult(new documents.Text(img));
},

This method readXmlElement needs to be changed to the following

function readXmlElement(element) {
    if (element.type === "element") {
        var handler = xmlElementReaders[element.name];
        if (handler) {
            return handler(element);
        }else if(/math/.test(element.name)){
            return xmlElementReaders.oMath(element);
        } else if (!Object.prototype.hasOwnProperty.call(ignoreElements, element.name)) {
            var message = warning("An unrecognised element was ignored: " + element.name);
            return emptyResultWithMessages([message]);
        }
    }
    return emptyResult();
}

In the final return result, the compiled picture labels "& lt;" and "& gt;" need to be replaced with "<" and" >"

mammoth.convertToHtml({ arrayBuffer: arrayBuffer }).then(function (resultObject) {
  var html = resultObject.value.replace(/&lt;(img[^]+)&gt;/g, '<$1>')
  console.log(html)
})

If the formula is inserted from WPS, the situation is different The formula inserted in WPS is in picture format, but the picture is in x-wmf format, and x-wmf is not supported by the browser, so it needs to be converted to PNG format after the last additional conversion You need this library firstudoc And then introduce it

<script src="./UDOC.js"></script>
<script src="./FromWMF.js"></script>
<script src="./ToContext2D.js"></script>

Finally, we need to change it

mammoth.convertToHtml({ arrayBuffer: arrayBuffer }).then(function (resultObject) {
  var html = resultObject.value.replace(/&lt;(img[^]+)&gt;/g, '<$1>')
  var newHtml = html.replace(/<img[^\\>]*?"(data:image\/x-wmf;.*?)"[^\\>]*?\/>/g, function (match, $1) {
      return transformWMF($1)
    })
console.log(newHtml)
})
function transformWMF(src) {
  var base64 = src.replace(/.*;base64,/, '')
  var rawData = window.atob(base64)
  const outputArray = new Uint8Array(rawData.length);
  for (let i = rawData.length - 1; i >= 0; --i) {
    outputArray[i] = rawData.charCodeAt(i);
  }
  var pNum = 0;  // number of the page, that you want to render
  var scale = 1;  // the scale of the document
  var wrt = new ToContext2D(pNum, scale);
  FromWMF.Parse(outputArray, wrt);
  var canvas = wrt.canvas
  var { width, height } = canvas
  var ctx = canvas.getContext('2d');
  var { data } = ctx.getImageData(0, 0, width, height)
  var len = data.length
  var row_len = width * 4
  var col_len = height
  var arr = []
  for (var i = 0; i < col_len; i++) {
    var per_arr = data.slice(i * row_len, (i + 1) * row_len)
    arr.push(per_arr)
  }
  var canvas2 = document.createElement('canvas');
  canvas2.width = width
  canvas2.height = height
  var ctx2 = canvas2.getContext('2d');
  var imageData = ctx2.createImageData(width, height)
  var n = row_len * col_len
  var arr2 = new Uint8ClampedArray(n)
  var curr_row = 0;
  var len = arr.length
  for (var i = len - 1; i >= 0; i--) {
    var curr_row = arr[i]
    for (var j = 0; j < curr_row.length; j++) {
      arr2[(len - i) * row_len + j] = curr_row[j]
    }
  }
  var imageData = new ImageData(arr2, width, height)
  ctx2.putImageData(imageData, 0, 0)
  var dataurl = canvas2.toDataURL()
  var img = new Image()
  img.src = dataurl
  img.width = width
  img.height = height
  return img.outerHTML
}

@darobin @mwilliamson thank you very much

lkkkeith commented 4 years ago

@liyongleihf2006 Amazing!

By the way, do you have any issues parsing equations with alignment? The result is like this: parse_issue

I think the is ignored in the library omml2mathml

brockfanning commented 3 years ago

@mwilliamson Is there any mechanism for funding of a particular feature? If so, I think I could arrange for support for this feature.

brockfanning commented 3 years ago

I've been giving this a try and have had partial success. The code already posted in this thread has been extremely helpful. I have a working solution for my own needs, but it would not be suitable for this library yet. In my case, I only want the conversion to MathML -- I did not need the conversion to an image, as described above. I've got some code that is working, but which unfortunately results in escaped MathML (ie, with &lt; instead of <), which is the reason it is not suitable for this library yet. My application code needs to unescape the MathML after the conversion.

But in case it is helpful, here are the changes I'm using: https://github.com/mwilliamson/mammoth.js/compare/master...brockfanning:math-support

And here is the application code that unescapes the MathML afterwards:

let html = getMyMammothProducedHtml()
let regexp = /&lt;math(.*?)&lt;\/math&gt;/g
let matches = html.matchAll(regexp)
for (const match of matches) {
  let escaped = match[0]
  let unescaped = escaped.replace(/&lt;/g, '<').replace(/&gt;/g, '>')
  html = html.replace(escaped, unescaped)
}

The reason that my MathML output is escaped is that I'm using the text type. If anyone has ideas on how to avoid that, I would be happy to try something out and report back.

Inouyasha commented 2 years ago

@lkkkeith You gave me inspiration, and I realized it according to your idea. Let me share my implementation plan.

Three libraries need to be introduced first

npm install --save mathjax omml2mathml xmldom

Then introduce them in node_modules\mammoth\lib\docx\body-reader.js

var omml2mathml =  require('omml2mathml');
var xmldom = require('xmldom');
require('mathjax/es5/mml-svg')

Then change the code above you

"oMath": function (element) {
    var om = transform(element)
    function transform(data) {
        var el = _transform(data)
        el.setAttribute('xmlns:wpc', 'http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas')
        el.setAttribute('xmlns:mo', 'http://schemas.microsoft.com/office/mac/office/2008/main')
        el.setAttribute('xmlns:mc', 'http://schemas.openxmlformats.org/markup-compatibility/2006')
        el.setAttribute('xmlns:mv', 'urn:schemas-microsoft-com:mac:vml')
        el.setAttribute('xmlns:o', 'urn:schemas-microsoft-com:office:office')
        el.setAttribute('xmlns:r', 'http://schemas.openxmlformats.org/officeDocument/2006/relationships')
        el.setAttribute('xmlns:m', 'http://schemas.openxmlformats.org/officeDocument/2006/math')
        el.setAttribute('xmlns:v', 'urn:schemas-microsoft-com:vml')
        el.setAttribute('xmlns:wp14', 'http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing')
        el.setAttribute('xmlns:wp', 'http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing')
        el.setAttribute('xmlns:w10', 'urn:schemas-microsoft-com:office:word')
        el.setAttribute('xmlns:w', 'http://schemas.openxmlformats.org/wordprocessingml/2006/main')
        el.setAttribute('xmlns:w14', 'http://schemas.microsoft.com/office/word/2010/wordml')
        el.setAttribute('xmlns:w15', 'http://schemas.microsoft.com/office/word/2012/wordml')
        el.setAttribute('xmlns:wpg', 'http://schemas.microsoft.com/office/word/2010/wordprocessingGroup')
        el.setAttribute('xmlns:wpi', 'http://schemas.microsoft.com/office/word/2010/wordprocessingInk')
        el.setAttribute('xmlns:wne', 'http://schemas.microsoft.com/office/word/2006/wordml')
        el.setAttribute('xmlns:wps', 'http://schemas.microsoft.com/office/word/2010/wordprocessingShape')
        return el
        function _transform(data) {
            var name = data.name
            if (!name) {
            return document.createTextNode(data.value)
            }
            var tagName = name.replace(/\{.*\}/, 'm:')
            var el = document.createElement(tagName)
            var children = data.children
            if (children) {
            children.forEach(element => {
                el.appendChild(_transform(element))
            });
            }
            return el
        }
    }
    var doc = new xmldom.DOMParser().parseFromString(om.outerHTML)
    var math = omml2mathml(doc)
    var abc = MathJax.mathml2svg(math.outerHTML)
    abc = abc.outerHTML
    var svg = abc.match(/\<svg.*\<\/svg\>/)[0]
    var math2 = abc.match(/\<math.*\<\/math\>/)[0].replace(/\sdisplay=".+?"/, '').replace(/\</g, "«").replace(/\>/g, "»").replace(/"/g, "¨")
    var style = ''
    svg.replace(/style="(.+?)"\>/, function (match, $1) {
        style = $1
    })
    svg = "data:image/svg+xml,"+window.encodeURIComponent(svg)
    var img = `<img align="middle" class="Wirisformula" src="${svg}" data-mathml="${math2}" alt="1 half" role="math" style="${style}">`
    //
    //  Then update the document to include the adjusted CSS for the
    //    content of the new equation.
    //
    MathJax.startup.document.clear();
    MathJax.startup.document.updateDocument();
    return elementResult(new documents.Text(img));
},

This method readXmlElement needs to be changed to the following

function readXmlElement(element) {
    if (element.type === "element") {
        var handler = xmlElementReaders[element.name];
        if (handler) {
            return handler(element);
        }else if(/math/.test(element.name)){
            return xmlElementReaders.oMath(element);
        } else if (!Object.prototype.hasOwnProperty.call(ignoreElements, element.name)) {
            var message = warning("An unrecognised element was ignored: " + element.name);
            return emptyResultWithMessages([message]);
        }
    }
    return emptyResult();
}

In the final return result, the compiled picture labels "& lt;" and "& gt;" need to be replaced with "<" and" >"

mammoth.convertToHtml({ arrayBuffer: arrayBuffer }).then(function (resultObject) {
  var html = resultObject.value.replace(/&lt;(img[^]+)&gt;/g, '<$1>')
  console.log(html)
})

If the formula is inserted from WPS, the situation is different The formula inserted in WPS is in picture format, but the picture is in x-wmf format, and x-wmf is not supported by the browser, so it needs to be converted to PNG format after the last additional conversion You need this library firstudoc And then introduce it

<script src="./UDOC.js"></script>
<script src="./FromWMF.js"></script>
<script src="./ToContext2D.js"></script>

Finally, we need to change it

mammoth.convertToHtml({ arrayBuffer: arrayBuffer }).then(function (resultObject) {
  var html = resultObject.value.replace(/&lt;(img[^]+)&gt;/g, '<$1>')
  var newHtml = html.replace(/<img[^\\>]*?"(data:image\/x-wmf;.*?)"[^\\>]*?\/>/g, function (match, $1) {
      return transformWMF($1)
    })
console.log(newHtml)
})
function transformWMF(src) {
  var base64 = src.replace(/.*;base64,/, '')
  var rawData = window.atob(base64)
  const outputArray = new Uint8Array(rawData.length);
  for (let i = rawData.length - 1; i >= 0; --i) {
    outputArray[i] = rawData.charCodeAt(i);
  }
  var pNum = 0;  // number of the page, that you want to render
  var scale = 1;  // the scale of the document
  var wrt = new ToContext2D(pNum, scale);
  FromWMF.Parse(outputArray, wrt);
  var canvas = wrt.canvas
  var { width, height } = canvas
  var ctx = canvas.getContext('2d');
  var { data } = ctx.getImageData(0, 0, width, height)
  var len = data.length
  var row_len = width * 4
  var col_len = height
  var arr = []
  for (var i = 0; i < col_len; i++) {
    var per_arr = data.slice(i * row_len, (i + 1) * row_len)
    arr.push(per_arr)
  }
  var canvas2 = document.createElement('canvas');
  canvas2.width = width
  canvas2.height = height
  var ctx2 = canvas2.getContext('2d');
  var imageData = ctx2.createImageData(width, height)
  var n = row_len * col_len
  var arr2 = new Uint8ClampedArray(n)
  var curr_row = 0;
  var len = arr.length
  for (var i = len - 1; i >= 0; i--) {
    var curr_row = arr[i]
    for (var j = 0; j < curr_row.length; j++) {
      arr2[(len - i) * row_len + j] = curr_row[j]
    }
  }
  var imageData = new ImageData(arr2, width, height)
  ctx2.putImageData(imageData, 0, 0)
  var dataurl = canvas2.toDataURL()
  var img = new Image()
  img.src = dataurl
  img.width = width
  img.height = height
  return img.outerHTML
}

@darobin @mwilliamson thank you very much

I've found the previous method occur some problem that can't show the subscript character for equation. When I check the code, I found that the dom element would make the tag to lowercase. For example, m:sSub->m:ssub. The omml2mathml can't match this type of word. To solve this problem, I replace htmlDom to xmlDom, the result shows good to me. I think the previous discussion is really helpful for me. I think I can share my code to save others' time.

image

By the way, I just use the svg rather than the img, it's easy to change text color. But remember, it should transform to safe html.

'm:oMathPara': function (element) {
      var xmlDOM = document.implementation.createDocument(null, null);
      var om = transform(element, xmlDOM);
      function transform(data) {
        var el = _transform(data);
        el.setAttribute('xmlns:wpc', 'http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas');
        el.setAttribute('xmlns:mo', 'http://schemas.microsoft.com/office/mac/office/2008/main');
        el.setAttribute('xmlns:mc', 'http://schemas.openxmlformats.org/markup-compatibility/2006');
        el.setAttribute('xmlns:mv', 'urn:schemas-microsoft-com:mac:vml');
        el.setAttribute('xmlns:o', 'urn:schemas-microsoft-com:office:office');
        el.setAttribute('xmlns:r', 'http://schemas.openxmlformats.org/officeDocument/2006/relationships');
        el.setAttribute('xmlns:m', 'http://schemas.openxmlformats.org/officeDocument/2006/math');
        el.setAttribute('xmlns:v', 'urn:schemas-microsoft-com:vml');
        el.setAttribute('xmlns:wp14', 'http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing');
        el.setAttribute('xmlns:wp', 'http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing');
        el.setAttribute('xmlns:w10', 'urn:schemas-microsoft-com:office:word');
        el.setAttribute('xmlns:w', 'http://schemas.openxmlformats.org/wordprocessingml/2006/main');
        el.setAttribute('xmlns:w14', 'http://schemas.microsoft.com/office/word/2010/wordml');
        el.setAttribute('xmlns:w15', 'http://schemas.microsoft.com/office/word/2012/wordml');
        el.setAttribute('xmlns:wpg', 'http://schemas.microsoft.com/office/word/2010/wordprocessingGroup');
        el.setAttribute('xmlns:wpi', 'http://schemas.microsoft.com/office/word/2010/wordprocessingInk');
        el.setAttribute('xmlns:wne', 'http://schemas.microsoft.com/office/word/2006/wordml');
        el.setAttribute('xmlns:wps', 'http://schemas.microsoft.com/office/word/2010/wordprocessingShape');
        return el;
        function _transform(data) {
          var name = data.name;
          if (!name) {
            return xmlDOM.createTextNode(data.value);
          }
          var tagName = name.replace(/\{.*\}/, 'm:');
          var el = xmlDOM.createElement(tagName);
          var children = data.children;
          if (children) {
            children.forEach(element => {
              el.appendChild(_transform(element));
            });
          }
          return el;
        }
      }
arnavmehta7 commented 2 years ago

Hey can you explain me if same is available in python?

congnguyentk commented 2 months ago

Hi all. I am using mammoths for math but when using vector math formulas, I get an error. Please help. Thanks.

Screenshot 2024-09-06 at 08 56 21 Screenshot 2024-09-06 at 08 57 12