Open darobin opened 8 years ago
Thanks! To the best of my recollection, I don't think anybody's asked for support, so not really a priority.
As a couple of notes should anybody want to take a look:
I had a quick look at this but couldn't work out how to convert the Mammoth XML element to a DOM node (and back).
There's also an issue for MathJax support of OMML as an input format, though I'm not sure how it would be represented in the HTML.
There's a related request for equation support in python-mammoth
.
@hubgit It would be pretty straightforward to walk the returned HTMLElement
and produce an equivalent Element
, if that's what's needed. I'm not sure what Mammoth would best use. Alternatively it would be possible to add support for omml2mathml for XML output, but that would either require binding it to a specific XML DOM implementation, or it would need to be configurable with implementations that in turn would need to be interoperable enough to be usable (which isn't guaranteed at all in the JS world).
For MathJax it should be completely straightforward to implement an input jax by using omml2mathml and then piping that to the MathML input jax, but that might feel a bit indirect. I am no super familiar with MathJax's internals, it might be easy to port omml2mathml to address the internal API directly.
As for Python, I guess the best option is a direct port. That shouldn't take more than a day's work. (DOMs in Python tend to be quite painful, but it's not like our usage is highly complex.)
Any news on this? I may take a stab at it.
@jmealo Please do, I haven't tried yet.
For the record, it would be quite useful for our purposes (education) if Mammoth could convert equations to MathML or LaTeX.
any update on this...?
does it supoorts now
It would help if you could implement an option that lets Mammoth mark the position of any elements it doesn't understand yet, such as equations. This would allow us to trace insertion points in the converted HTML and do the conversion externally.
I've just completed that for OMML - converting it via a PowerShell script to MathML. All that I need now for a successful fully automated solution is to know where which equation sits within the HTML. Presently Mammoth is not capable of giving that clue.
How about:
var options = {
markUnknownElementsWithClass: "unknown"
};
which could give me in the HTML
<p class="unknown"></p>
We can then easily operate on this element or filter it out.
I had a quick look at this but couldn't work out how to convert the Mammoth XML element to a DOM node (and back).
Hi. I am currently working for a publisher to generate the HTML from docx having a lot of Math equations. I managed to parse the "element" to MathML using omml2mathml but the elementResult() is returning null. Any idea how the html is generated form Mammoth?
I added the below code to the object xmlElementReaders in lib/docx/body-reader.js
`"m:oMath": function (element) {
function getXML(node) {
let attrString = "";
if (node.value) {
return node.value;
}
if (!node.children) {
return "";
};
// Add attributes to the node if necessary
let attrs = node.attributes;
// Add only if attributes in not an empty object
if (attrs && Object.entries(attrs).length){
Object.entries(attrs).forEach( ([key, value]) =>{
attrString += ` ${key}="${value}"`
})
}
if (node.children.length > 0){
let content = ""
for (child of node.children) {
content += getXML(child)
}
return `<${node.name}${attrString}>${content}</${node.name}>`;
}
else{
return `<${node.name}${attrString}/>`;
}
}
// Parse the element to XML string
let xml = getXML(element);
// Add namespace to change it to OMML for parsing
xml =
`<?xml version="1.0" encoding="UTF-8" standalone="yes"?>`+
`<w:document xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math">`+
`${xml}`+
`</w:document>`
// Parse the XML document to DOM
var DOMParser = require('xmldom').DOMParser;
var xmlDoc = new DOMParser().parseFromString(xml, 'text/xml');
// Parse the XML Doc with OMML to MathML
let mathmlElement = omml2mathml(xmlDoc);
// Add proper namespace to display in browser
mathmlElement.removeAttribute("display")
mathmlElement.setAttribute("xmlns", "http://www.w3.org/1998/Math/MathML")
return elementResult(mathmlElement);
},`
@lkkkeith You gave me inspiration, and I realized it according to your idea. Let me share my implementation plan.
Three libraries need to be introduced first
npm install --save mathjax omml2mathml xmldom
Then introduce them in node_modules\mammoth\lib\docx\body-reader.js
var omml2mathml = require('omml2mathml');
var xmldom = require('xmldom');
require('mathjax/es5/mml-svg')
Then change the code above you
"oMath": function (element) {
var om = transform(element)
function transform(data) {
var el = _transform(data)
el.setAttribute('xmlns:wpc', 'http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas')
el.setAttribute('xmlns:mo', 'http://schemas.microsoft.com/office/mac/office/2008/main')
el.setAttribute('xmlns:mc', 'http://schemas.openxmlformats.org/markup-compatibility/2006')
el.setAttribute('xmlns:mv', 'urn:schemas-microsoft-com:mac:vml')
el.setAttribute('xmlns:o', 'urn:schemas-microsoft-com:office:office')
el.setAttribute('xmlns:r', 'http://schemas.openxmlformats.org/officeDocument/2006/relationships')
el.setAttribute('xmlns:m', 'http://schemas.openxmlformats.org/officeDocument/2006/math')
el.setAttribute('xmlns:v', 'urn:schemas-microsoft-com:vml')
el.setAttribute('xmlns:wp14', 'http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing')
el.setAttribute('xmlns:wp', 'http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing')
el.setAttribute('xmlns:w10', 'urn:schemas-microsoft-com:office:word')
el.setAttribute('xmlns:w', 'http://schemas.openxmlformats.org/wordprocessingml/2006/main')
el.setAttribute('xmlns:w14', 'http://schemas.microsoft.com/office/word/2010/wordml')
el.setAttribute('xmlns:w15', 'http://schemas.microsoft.com/office/word/2012/wordml')
el.setAttribute('xmlns:wpg', 'http://schemas.microsoft.com/office/word/2010/wordprocessingGroup')
el.setAttribute('xmlns:wpi', 'http://schemas.microsoft.com/office/word/2010/wordprocessingInk')
el.setAttribute('xmlns:wne', 'http://schemas.microsoft.com/office/word/2006/wordml')
el.setAttribute('xmlns:wps', 'http://schemas.microsoft.com/office/word/2010/wordprocessingShape')
return el
function _transform(data) {
var name = data.name
if (!name) {
return document.createTextNode(data.value)
}
var tagName = name.replace(/\{.*\}/, 'm:')
var el = document.createElement(tagName)
var children = data.children
if (children) {
children.forEach(element => {
el.appendChild(_transform(element))
});
}
return el
}
}
var doc = new xmldom.DOMParser().parseFromString(om.outerHTML)
var math = omml2mathml(doc)
var abc = MathJax.mathml2svg(math.outerHTML)
abc = abc.outerHTML
var svg = abc.match(/\<svg.*\<\/svg\>/)[0]
var math2 = abc.match(/\<math.*\<\/math\>/)[0].replace(/\sdisplay=".+?"/, '').replace(/\</g, "«").replace(/\>/g, "»").replace(/"/g, "¨")
var style = ''
svg.replace(/style="(.+?)"\>/, function (match, $1) {
style = $1
})
svg = "data:image/svg+xml,"+window.encodeURIComponent(svg)
var img = `<img align="middle" class="Wirisformula" src="${svg}" data-mathml="${math2}" alt="1 half" role="math" style="${style}">`
//
// Then update the document to include the adjusted CSS for the
// content of the new equation.
//
MathJax.startup.document.clear();
MathJax.startup.document.updateDocument();
return elementResult(new documents.Text(img));
},
This method readXmlElement needs to be changed to the following
function readXmlElement(element) {
if (element.type === "element") {
var handler = xmlElementReaders[element.name];
if (handler) {
return handler(element);
}else if(/math/.test(element.name)){
return xmlElementReaders.oMath(element);
} else if (!Object.prototype.hasOwnProperty.call(ignoreElements, element.name)) {
var message = warning("An unrecognised element was ignored: " + element.name);
return emptyResultWithMessages([message]);
}
}
return emptyResult();
}
In the final return result, the compiled picture labels "& lt;" and "& gt;" need to be replaced with "<" and" >"
mammoth.convertToHtml({ arrayBuffer: arrayBuffer }).then(function (resultObject) {
var html = resultObject.value.replace(/<(img[^]+)>/g, '<$1>')
console.log(html)
})
If the formula is inserted from WPS, the situation is different The formula inserted in WPS is in picture format, but the picture is in x-wmf format, and x-wmf is not supported by the browser, so it needs to be converted to PNG format after the last additional conversion You need this library firstudoc And then introduce it
<script src="./UDOC.js"></script>
<script src="./FromWMF.js"></script>
<script src="./ToContext2D.js"></script>
Finally, we need to change it
mammoth.convertToHtml({ arrayBuffer: arrayBuffer }).then(function (resultObject) {
var html = resultObject.value.replace(/<(img[^]+)>/g, '<$1>')
var newHtml = html.replace(/<img[^\\>]*?"(data:image\/x-wmf;.*?)"[^\\>]*?\/>/g, function (match, $1) {
return transformWMF($1)
})
console.log(newHtml)
})
function transformWMF(src) {
var base64 = src.replace(/.*;base64,/, '')
var rawData = window.atob(base64)
const outputArray = new Uint8Array(rawData.length);
for (let i = rawData.length - 1; i >= 0; --i) {
outputArray[i] = rawData.charCodeAt(i);
}
var pNum = 0; // number of the page, that you want to render
var scale = 1; // the scale of the document
var wrt = new ToContext2D(pNum, scale);
FromWMF.Parse(outputArray, wrt);
var canvas = wrt.canvas
var { width, height } = canvas
var ctx = canvas.getContext('2d');
var { data } = ctx.getImageData(0, 0, width, height)
var len = data.length
var row_len = width * 4
var col_len = height
var arr = []
for (var i = 0; i < col_len; i++) {
var per_arr = data.slice(i * row_len, (i + 1) * row_len)
arr.push(per_arr)
}
var canvas2 = document.createElement('canvas');
canvas2.width = width
canvas2.height = height
var ctx2 = canvas2.getContext('2d');
var imageData = ctx2.createImageData(width, height)
var n = row_len * col_len
var arr2 = new Uint8ClampedArray(n)
var curr_row = 0;
var len = arr.length
for (var i = len - 1; i >= 0; i--) {
var curr_row = arr[i]
for (var j = 0; j < curr_row.length; j++) {
arr2[(len - i) * row_len + j] = curr_row[j]
}
}
var imageData = new ImageData(arr2, width, height)
ctx2.putImageData(imageData, 0, 0)
var dataurl = canvas2.toDataURL()
var img = new Image()
img.src = dataurl
img.width = width
img.height = height
return img.outerHTML
}
@darobin @mwilliamson thank you very much
@liyongleihf2006 Amazing!
By the way, do you have any issues parsing equations with alignment? The result is like this:
I think the
@mwilliamson Is there any mechanism for funding of a particular feature? If so, I think I could arrange for support for this feature.
I've been giving this a try and have had partial success. The code already posted in this thread has been extremely helpful. I have a working solution for my own needs, but it would not be suitable for this library yet. In my case, I only want the conversion to MathML -- I did not need the conversion to an image, as described above. I've got some code that is working, but which unfortunately results in escaped MathML (ie, with <
instead of <
), which is the reason it is not suitable for this library yet. My application code needs to unescape the MathML after the conversion.
But in case it is helpful, here are the changes I'm using: https://github.com/mwilliamson/mammoth.js/compare/master...brockfanning:math-support
And here is the application code that unescapes the MathML afterwards:
let html = getMyMammothProducedHtml()
let regexp = /<math(.*?)<\/math>/g
let matches = html.matchAll(regexp)
for (const match of matches) {
let escaped = match[0]
let unescaped = escaped.replace(/</g, '<').replace(/>/g, '>')
html = html.replace(escaped, unescaped)
}
The reason that my MathML output is escaped is that I'm using the text
type. If anyone has ideas on how to avoid that, I would be happy to try something out and report back.
@lkkkeith You gave me inspiration, and I realized it according to your idea. Let me share my implementation plan.
Three libraries need to be introduced first
npm install --save mathjax omml2mathml xmldom
Then introduce them in node_modules\mammoth\lib\docx\body-reader.js
var omml2mathml = require('omml2mathml'); var xmldom = require('xmldom'); require('mathjax/es5/mml-svg')
Then change the code above you
"oMath": function (element) { var om = transform(element) function transform(data) { var el = _transform(data) el.setAttribute('xmlns:wpc', 'http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas') el.setAttribute('xmlns:mo', 'http://schemas.microsoft.com/office/mac/office/2008/main') el.setAttribute('xmlns:mc', 'http://schemas.openxmlformats.org/markup-compatibility/2006') el.setAttribute('xmlns:mv', 'urn:schemas-microsoft-com:mac:vml') el.setAttribute('xmlns:o', 'urn:schemas-microsoft-com:office:office') el.setAttribute('xmlns:r', 'http://schemas.openxmlformats.org/officeDocument/2006/relationships') el.setAttribute('xmlns:m', 'http://schemas.openxmlformats.org/officeDocument/2006/math') el.setAttribute('xmlns:v', 'urn:schemas-microsoft-com:vml') el.setAttribute('xmlns:wp14', 'http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing') el.setAttribute('xmlns:wp', 'http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing') el.setAttribute('xmlns:w10', 'urn:schemas-microsoft-com:office:word') el.setAttribute('xmlns:w', 'http://schemas.openxmlformats.org/wordprocessingml/2006/main') el.setAttribute('xmlns:w14', 'http://schemas.microsoft.com/office/word/2010/wordml') el.setAttribute('xmlns:w15', 'http://schemas.microsoft.com/office/word/2012/wordml') el.setAttribute('xmlns:wpg', 'http://schemas.microsoft.com/office/word/2010/wordprocessingGroup') el.setAttribute('xmlns:wpi', 'http://schemas.microsoft.com/office/word/2010/wordprocessingInk') el.setAttribute('xmlns:wne', 'http://schemas.microsoft.com/office/word/2006/wordml') el.setAttribute('xmlns:wps', 'http://schemas.microsoft.com/office/word/2010/wordprocessingShape') return el function _transform(data) { var name = data.name if (!name) { return document.createTextNode(data.value) } var tagName = name.replace(/\{.*\}/, 'm:') var el = document.createElement(tagName) var children = data.children if (children) { children.forEach(element => { el.appendChild(_transform(element)) }); } return el } } var doc = new xmldom.DOMParser().parseFromString(om.outerHTML) var math = omml2mathml(doc) var abc = MathJax.mathml2svg(math.outerHTML) abc = abc.outerHTML var svg = abc.match(/\<svg.*\<\/svg\>/)[0] var math2 = abc.match(/\<math.*\<\/math\>/)[0].replace(/\sdisplay=".+?"/, '').replace(/\</g, "«").replace(/\>/g, "»").replace(/"/g, "¨") var style = '' svg.replace(/style="(.+?)"\>/, function (match, $1) { style = $1 }) svg = "data:image/svg+xml,"+window.encodeURIComponent(svg) var img = `<img align="middle" class="Wirisformula" src="${svg}" data-mathml="${math2}" alt="1 half" role="math" style="${style}">` // // Then update the document to include the adjusted CSS for the // content of the new equation. // MathJax.startup.document.clear(); MathJax.startup.document.updateDocument(); return elementResult(new documents.Text(img)); },
This method readXmlElement needs to be changed to the following
function readXmlElement(element) { if (element.type === "element") { var handler = xmlElementReaders[element.name]; if (handler) { return handler(element); }else if(/math/.test(element.name)){ return xmlElementReaders.oMath(element); } else if (!Object.prototype.hasOwnProperty.call(ignoreElements, element.name)) { var message = warning("An unrecognised element was ignored: " + element.name); return emptyResultWithMessages([message]); } } return emptyResult(); }
In the final return result, the compiled picture labels "& lt;" and "& gt;" need to be replaced with "<" and" >"
mammoth.convertToHtml({ arrayBuffer: arrayBuffer }).then(function (resultObject) { var html = resultObject.value.replace(/<(img[^]+)>/g, '<$1>') console.log(html) })
If the formula is inserted from WPS, the situation is different The formula inserted in WPS is in picture format, but the picture is in x-wmf format, and x-wmf is not supported by the browser, so it needs to be converted to PNG format after the last additional conversion You need this library firstudoc And then introduce it
<script src="./UDOC.js"></script> <script src="./FromWMF.js"></script> <script src="./ToContext2D.js"></script>
Finally, we need to change it
mammoth.convertToHtml({ arrayBuffer: arrayBuffer }).then(function (resultObject) { var html = resultObject.value.replace(/<(img[^]+)>/g, '<$1>') var newHtml = html.replace(/<img[^\\>]*?"(data:image\/x-wmf;.*?)"[^\\>]*?\/>/g, function (match, $1) { return transformWMF($1) }) console.log(newHtml) }) function transformWMF(src) { var base64 = src.replace(/.*;base64,/, '') var rawData = window.atob(base64) const outputArray = new Uint8Array(rawData.length); for (let i = rawData.length - 1; i >= 0; --i) { outputArray[i] = rawData.charCodeAt(i); } var pNum = 0; // number of the page, that you want to render var scale = 1; // the scale of the document var wrt = new ToContext2D(pNum, scale); FromWMF.Parse(outputArray, wrt); var canvas = wrt.canvas var { width, height } = canvas var ctx = canvas.getContext('2d'); var { data } = ctx.getImageData(0, 0, width, height) var len = data.length var row_len = width * 4 var col_len = height var arr = [] for (var i = 0; i < col_len; i++) { var per_arr = data.slice(i * row_len, (i + 1) * row_len) arr.push(per_arr) } var canvas2 = document.createElement('canvas'); canvas2.width = width canvas2.height = height var ctx2 = canvas2.getContext('2d'); var imageData = ctx2.createImageData(width, height) var n = row_len * col_len var arr2 = new Uint8ClampedArray(n) var curr_row = 0; var len = arr.length for (var i = len - 1; i >= 0; i--) { var curr_row = arr[i] for (var j = 0; j < curr_row.length; j++) { arr2[(len - i) * row_len + j] = curr_row[j] } } var imageData = new ImageData(arr2, width, height) ctx2.putImageData(imageData, 0, 0) var dataurl = canvas2.toDataURL() var img = new Image() img.src = dataurl img.width = width img.height = height return img.outerHTML }
@darobin @mwilliamson thank you very much
I've found the previous method occur some problem that can't show the subscript character for equation. When I check the code, I found that the dom element would make the tag to lowercase. For example, m:sSub->m:ssub. The omml2mathml can't match this type of word. To solve this problem, I replace htmlDom to xmlDom, the result shows good to me. I think the previous discussion is really helpful for me. I think I can share my code to save others' time.
By the way, I just use the svg rather than the img, it's easy to change text color. But remember, it should transform to safe html.
'm:oMathPara': function (element) {
var xmlDOM = document.implementation.createDocument(null, null);
var om = transform(element, xmlDOM);
function transform(data) {
var el = _transform(data);
el.setAttribute('xmlns:wpc', 'http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas');
el.setAttribute('xmlns:mo', 'http://schemas.microsoft.com/office/mac/office/2008/main');
el.setAttribute('xmlns:mc', 'http://schemas.openxmlformats.org/markup-compatibility/2006');
el.setAttribute('xmlns:mv', 'urn:schemas-microsoft-com:mac:vml');
el.setAttribute('xmlns:o', 'urn:schemas-microsoft-com:office:office');
el.setAttribute('xmlns:r', 'http://schemas.openxmlformats.org/officeDocument/2006/relationships');
el.setAttribute('xmlns:m', 'http://schemas.openxmlformats.org/officeDocument/2006/math');
el.setAttribute('xmlns:v', 'urn:schemas-microsoft-com:vml');
el.setAttribute('xmlns:wp14', 'http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing');
el.setAttribute('xmlns:wp', 'http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing');
el.setAttribute('xmlns:w10', 'urn:schemas-microsoft-com:office:word');
el.setAttribute('xmlns:w', 'http://schemas.openxmlformats.org/wordprocessingml/2006/main');
el.setAttribute('xmlns:w14', 'http://schemas.microsoft.com/office/word/2010/wordml');
el.setAttribute('xmlns:w15', 'http://schemas.microsoft.com/office/word/2012/wordml');
el.setAttribute('xmlns:wpg', 'http://schemas.microsoft.com/office/word/2010/wordprocessingGroup');
el.setAttribute('xmlns:wpi', 'http://schemas.microsoft.com/office/word/2010/wordprocessingInk');
el.setAttribute('xmlns:wne', 'http://schemas.microsoft.com/office/word/2006/wordml');
el.setAttribute('xmlns:wps', 'http://schemas.microsoft.com/office/word/2010/wordprocessingShape');
return el;
function _transform(data) {
var name = data.name;
if (!name) {
return xmlDOM.createTextNode(data.value);
}
var tagName = name.replace(/\{.*\}/, 'm:');
var el = xmlDOM.createElement(tagName);
var children = data.children;
if (children) {
children.forEach(element => {
el.appendChild(_transform(element));
});
}
return el;
}
}
Hey can you explain me if same is available in python?
Hi all. I am using mammoths for math but when using vector math formulas, I get an error. Please help. Thanks.
I don't know if you're planning to support math ever, but just in case I thought I'd give you a heads up that I have a JS converter for the Office math markup to MathML, so it could just be reused in Mammoth: omml2mathml.
I don't have time to dig into the Mammoth code to integrate this (in case you're interested) but I'm happy to help if someone takes it on.