Open papirosko opened 11 months ago
actually it finished parsing in 7 minutes on my macbook pro with i7. is it considered to be correct?
as a workaround i use this (i mostly need data only from
): const root = parse(html, {
parseNoneClosedTags: false,
fixNestedATags: false,
blockTextElements: {
'div': true,
'p': true,
'pre': true
}
});
const title = option(root.querySelector('title'))
.map(x => x.text)
.filter(x => !!x && x.trim().length > 0);
I'm so sorry I could find any clue about your usecase. I even could not find title
element. I did not get a macbook either. But I parsed the file you uploaded and it finished parsing immediately .
Using:
...
"dependencies": {
"axios": "^1.6.2",
"node-html-parser": "^6.1.10",
}
...
This is how to reproduce it (using default options in parse
):
import parse from 'node-html-parser';
import axios from 'axios';
async function runImpl() {
const url = 'https://www.a1supplements.com/';
const resp = await axios.get(url);
const html = resp.data;
const start = Date.now();
parse(html);
const duration = Date.now() - start;
console.log(`Parsing took: ${duration.toLocaleString()}ms, document size: ${html.length.toLocaleString()} chars`)
}
runImpl().then(() => console.log('done'))
result:
Parsing took: 383,073ms, document size: 4,705,196 chars
done
Using custom options in parse
:
...
const start = Date.now();
parse(html, {
parseNoneClosedTags: false,
fixNestedATags: false,
blockTextElements: {
'div': true,
'p': true,
'pre': true,
script: true,
noscript: true,
style: true,
}
});
const duration = Date.now() - start;
...
results:
Parsing took: 284ms, document size: 4,705,196 chars
done
I believe that blockTextElements
-> div
generally fixes the issue
I don't think we should block div elements. that maybe the html is broken, the option parseNoneClosedTags: true
will speed up.
parseNoneClosedTags: true
Not sure if it was related, but had a big html page and adding this fixed it.
Curious what this mean @taoqf , couldn't find any documentation or issues around it?
simple code hangs causing kube to kill pod:
I use:
host: https://www.a1supplements.com/ html size: 4620497 symbols
I have these from
node inspect
:The contents (in case the website will be updated):
a1supplements.com.html.txt