Closed gryznar closed 1 month ago
This library does not read HTML, but XML. Your input is not valid XML.
The entity mapping you specify tells the decoder to interpret characters such as ä
like in HTML5, which is quite common in XML as well.
Ok, thank you for answer. My previous experience come from Python lxml
which handles HTML. That is the source of my confusion here. The obvious choice for my usecase would be html
lib, but in this specific case, efficiency is crucial factor for me and thus I thought that threating it as a XML would greatly speed up parsing
Consider following HTML:
Passing it to:
outputs in:
IMHO unclosed tags should be handled.