kovidgoyal / html5-parser

Fast C based HTML 5 parsing for python
Apache License 2.0
678 stars 33 forks source link

Parsed result as an HTML tree #23

Closed whalebot-helmsman closed 3 years ago

whalebot-helmsman commented 3 years ago

It is less an issue, it is more a question. What is the reason of using lxml.xml as output structure instead of lxml.html? E.g. clean_html is one of the methods from top of my head isn't supported by xml.

kovidgoyal commented 3 years ago

lxml.html and lxml.etree both use the same underlying datastructure, the Element class. The trees generated by both are the same, you can use clean_html on etre based trees as well, just be sure not to namespace the elements.

whalebot-helmsman commented 3 years ago

Sorry, it isn't working

python -c 'from html5_parser import parse;from lxml.html.clean import clean_html;h=parse("<html></html>");clean_html(h)'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "src/lxml/html/clean.py", line 558, in lxml.html.clean.Cleaner.clean_html
  File "src/lxml/html/clean.py", line 305, in lxml.html.clean.Cleaner.__call__
AttributeError: 'lxml.etree._Element' object has no attribute 'rewrite_links'
kovidgoyal commented 3 years ago

Ah well lxml.html must be using a sub class of Element in that case. In which case you wont be able to get fast parsing with it anyway, so there is not much point. The way html5-parser gets its speed is by doing the tree construction in C, not python. You can use the treebuilder argument to parse to build various different tree types instead and I suppose one could add the lxml.html one there, but its not something I care about personally, patches are welcome.