mgdm / htmlq

Like jq, but for HTML.
MIT License
7.09k stars 111 forks source link

Does htmlq support very large XML? #30

Closed she3o closed 2 years ago

she3o commented 2 years ago

Hi

I downloaded htmlq to process a large XML database (1.4GB, link) before data analysis.

when I run cat 'full database' | htmlq 'drug'the command would run for 10 seconds before htmlq runs out of memory.

Is that behaviour expected or is this a memory bug?

ralyodio commented 2 years ago

I'm wondering the same. Need to parse xml as well as html

mgdm commented 2 years ago

I don't really expect this tool to behave all that well with that size of XML input. It specifically uses an HTML5 parser to implement all of the rules in that spec, which diverged quite a bit from that of XML. Also, the CSS selector syntax is pretty closely tied to HTML rather than XML. I'd probably recommend you use something like XMLStarlet for that purpose (it's what I use in those cases).