Closed paulcarroty closed 3 years ago
fails to parse query docx, which uses namespaced tags like <w:t>hello</w:t>
var fs = require("fs");
var JSZip = require("jszip");
const { parse } = require('node-html-parser');
const docxPath = process.argv[2];
async function main() {
const data = fs.readFileSync(docxPath);
const zip = await JSZip.loadAsync(data);
const xml = await zip.files["word/document.xml"].async("text");
const doc = parse(xml);
//console.dir(doc.querySelectorAll('w:t')); // Error: unmatched pseudo-class :t
console.dir(doc.querySelectorAll('w\\:t')); // == [] (empty result)
} // async function main
main();
alternatives: xml2js, ...
I believe the exception thrown out is because we cannot select a node which tagname contains :
, not because we can not parse it.
yes, sorry ... its a parser bug in https://github.com/fb55/css-what/issues/512
Not a parser bug, but CSS requires the colon to be escaped here.
CSS requires the colon to be escaped here
aah, thanks!
fixed my sample code, now css-what
gives
{ rules: [ { type: 'tag', name: 'w:t', namespace: null } ] }
and querySelectorAll
returns an empty array ...
new problem seems to be in node-html-parser
:
only a few xml tags are parsed, and the rest is parsed as a TextNode
including </w:body></w:document></documentfragmentcontainer>
sample input docx, generated by libreoffice writer
It also won't parse <link>
elements in XML. It's not returning the value. I'm digging through the code now but I'm guessing the code is written to assume <link>
s would never have innerText because in usual HTML they do not. I'm going to have to find another parser because this is a requirement for my project. But would love to use your faster library if/when it supports XML. Thanks for the great work!
It also won't parse elements in XML. It's not returning the value.
I had a look into this. In HTML5 spec, link
is a void element (meaning it is self closing). Because we don't have a mode for XML spec, this unfortunately can't be addressed, as it's beyond the scope of the library.
only a few xml tags are parsed, and the rest is parsed as a TextNode
@milahu I actually think you've run into the same issue as this: https://github.com/taoqf/node-html-parser/issues/156
I believe it matched w:pStyle
as style
and treated it as a block-text element. I'll try to get a fix out for this quickly.
A temporary workaround is to use the following config:
{ blockTextElements: { script: true, noscript: true } }
I'm going to go ahead and close this issue for housekeeping, but you can track the applicable bug here:
Worth mention in README. Tested on Atom feed - working fine.