kovidgoyal / html5-parser

Fast C based HTML 5 parsing for python
Apache License 2.0
678 stars 33 forks source link

Fragment parsing #15

Closed xmo-odoo closed 6 years ago

xmo-odoo commented 6 years ago

The documentation only mentions a single method, but doesn't seem to say anything about fragments.

Is there a way to parse fragments (as non-document) with html5parser?

kovidgoyal commented 6 years ago

Fragments parse fine already as far as I know.

xmo-odoo commented 6 years ago

@kovidgoyal it (logically) parses the fragment as an entire document:

fragment = b'a    \n<b>foo</b><span>bar</span>\n    \n'

p = parse(fragment)
print(p)
print(html.tostring(p))

results in

<Element html at 0x1089bc888>
b'<html><head></head><body>a    \n<b>foo</b><span>bar</span>\n    \n</body></html>'

which is not necessarily convenient especially when the incoming fragment might be an entire document, we've got to disambiguate between fragment-fragment (and keep only the body's content I guess) and document-fragment. I guess checking if the <head> is completely empty might to the trick though.

kovidgoyal commented 6 years ago

When would you possibly want to parse something as a fragment or a document? Either you want fragments or you want documents, the two are incompatible. If you want your parsing to result in fragments, simply add a "<div>" to the start of the string, parse it and return the first div from the parse tree. If you want your parsing to result in documents, you dont need to do anything.