taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

xml support? #124

Closed paulcarroty closed 3 years ago

paulcarroty commented 3 years ago

Worth mention in README. Tested on Atom feed - working fine.

milahu commented 3 years ago

fails to parse query docx, which uses namespaced tags like <w:t>hello</w:t>

var fs = require("fs");
var JSZip = require("jszip");
const { parse } = require('node-html-parser');

const docxPath = process.argv[2];

async function main() {

const data = fs.readFileSync(docxPath);
const zip = await JSZip.loadAsync(data);
const xml = await zip.files["word/document.xml"].async("text");
const doc = parse(xml);

//console.dir(doc.querySelectorAll('w:t')); // Error: unmatched pseudo-class :t

console.dir(doc.querySelectorAll('w\\:t')); // == [] (empty result)

} // async function main

main();

alternatives: xml2js, ...

taoqf commented 3 years ago

I believe the exception thrown out is because we cannot select a node which tagname contains :, not because we can not parse it.

milahu commented 3 years ago

yes, sorry ... its a parser bug in https://github.com/fb55/css-what/issues/512

fb55 commented 3 years ago

Not a parser bug, but CSS requires the colon to be escaped here.

milahu commented 3 years ago

CSS requires the colon to be escaped here

aah, thanks!

fixed my sample code, now css-what gives

{ rules: [ { type: 'tag', name: 'w:t', namespace: null } ] }

and querySelectorAll returns an empty array ...

new problem seems to be in node-html-parser: only a few xml tags are parsed, and the rest is parsed as a TextNode including </w:body></w:document></documentfragmentcontainer>

sample input docx, generated by libreoffice writer

```xml hello ``` the textnode starts at `` `` are missing
bamadesigner commented 3 years ago

It also won't parse <link> elements in XML. It's not returning the value. I'm digging through the code now but I'm guessing the code is written to assume <link>s would never have innerText because in usual HTML they do not. I'm going to have to find another parser because this is a requirement for my project. But would love to use your faster library if/when it supports XML. Thanks for the great work!

nonara commented 3 years ago

It also won't parse elements in XML. It's not returning the value.

I had a look into this. In HTML5 spec, link is a void element (meaning it is self closing). Because we don't have a mode for XML spec, this unfortunately can't be addressed, as it's beyond the scope of the library.

only a few xml tags are parsed, and the rest is parsed as a TextNode

@milahu I actually think you've run into the same issue as this: https://github.com/taoqf/node-html-parser/issues/156

I believe it matched w:pStyle as style and treated it as a block-text element. I'll try to get a fix out for this quickly.

A temporary workaround is to use the following config:

{ blockTextElements: { script: true, noscript: true } }

I'm going to go ahead and close this issue for housekeeping, but you can track the applicable bug here: