Open davaya opened 4 years ago
https://html5lib.readthedocs.io/en/latest/ has docs, though it doesn't answer the why. https://html5lib.readthedocs.io/en/latest/html5lib.html#html5lib.html5parser.HTMLParser specifically mentions the namespaceHTMLElements
arg.
The why is quite simple: it's what the HTML spec says (that almost all elements get inserted in the HTML namespace from HTML); you can see browsers do this via document.documentElement.namespaceURI
on this page.
Thanks - creating a parser object explicitly with the proper args gives the desired results. New users reading the overview might benefit from some rationale (pros and cons) for choosing a particular tree type, and a note pointing out that namespacing is one of the options that can be specified.
It could be argued that new users might not be aware of namespacing and should not get it by default, while those who do need it would know enough to opt in.
I think basic usage example would be helpful.
Example: parse html, replace the innerHTML of all <a>
links to "super foo". And then write it out again.
Up to now the docs are just about parsing.
Please add some example how to process the parsed data.
Something like this would be nice to have in the docs:
from xml.etree import ElementTree
from html5lib import HTMLParser
parser = HTMLParser(namespaceHTMLElements=False)
tree = parser.parse('''
foo
<h1>Moonlight</h1>
bar''')
for e in tree.findall('.//h1'):
e.text = 'Sunshine'
print(ElementTree.tostring(etree))
Is there any tutorial documentation for this package? Something that would answer questions like when parsing an HTML file:
Why does the result look like this?
A naive user would expect
tag
to return the literal value contained in the HTML element, not the tag prefixed with a qualifier of some sort. It would be helpful to have a document that explains why the prefix has been injected and how to configure the library to return unadorned tags.