html5lib / html5lib-python

Standards-compliant library for parsing and serializing HTML documents and fragments in Python
MIT License
1.11k stars 282 forks source link

User documentation? #489

Open davaya opened 4 years ago

davaya commented 4 years ago

Is there any tutorial documentation for this package? Something that would answer questions like when parsing an HTML file:

with open(fname, 'rb') as f:
    doc = html5lib.parse(f)
for e in list(doc[0]):
    print(e.tag)

Why does the result look like this?

{http://www.w3.org/1999/xhtml}meta
{http://www.w3.org/1999/xhtml}title
{http://www.w3.org/1999/xhtml}link

A naive user would expect tag to return the literal value contained in the HTML element, not the tag prefixed with a qualifier of some sort. It would be helpful to have a document that explains why the prefix has been injected and how to configure the library to return unadorned tags.

gsnedders commented 4 years ago

https://html5lib.readthedocs.io/en/latest/ has docs, though it doesn't answer the why. https://html5lib.readthedocs.io/en/latest/html5lib.html#html5lib.html5parser.HTMLParser specifically mentions the namespaceHTMLElements arg.

The why is quite simple: it's what the HTML spec says (that almost all elements get inserted in the HTML namespace from HTML); you can see browsers do this via document.documentElement.namespaceURI on this page.

davaya commented 4 years ago

Thanks - creating a parser object explicitly with the proper args gives the desired results. New users reading the overview might benefit from some rationale (pros and cons) for choosing a particular tree type, and a note pointing out that namespacing is one of the options that can be specified.

It could be argued that new users might not be aware of namespacing and should not get it by default, while those who do need it would know enough to opt in.

guettli commented 3 years ago

I think basic usage example would be helpful.

Example: parse html, replace the innerHTML of all <a> links to "super foo". And then write it out again.

Up to now the docs are just about parsing.

Please add some example how to process the parsed data.

guettli commented 3 years ago

Something like this would be nice to have in the docs:

from xml.etree import ElementTree
from html5lib import HTMLParser

parser = HTMLParser(namespaceHTMLElements=False)

tree = parser.parse('''
  foo
  <h1>Moonlight</h1>
  bar''')

for e in tree.findall('.//h1'):
    e.text = 'Sunshine'

print(ElementTree.tostring(etree))

Source: https://stackoverflow.com/questions/68313619/how-to-replace-the-innerhtml-of-all-h1-tags-with-html5lib