renggli / dart-xml

Lightweight library for parsing, traversing, and transforming XML in Dart.
http://pub.dartlang.org/packages/xml
MIT License
223 stars 52 forks source link

Unable to parse valid HTML document with unclosed tags #186

Closed gryznar closed 1 month ago

gryznar commented 1 month ago

Consider following HTML:

<!DOCTYPE html>
<html lang="en">
   <head>
      <title>foo</title>
      <meta name="foo" content="bar">
   </head>
</html>

Passing it to:

XmlDocument.parse(html, entityMapping: XmlDefaultEntityMapping.html5());

outputs in:

Unhandled exception:
XmlTagException: Expected </meta>, but found </head> at 6:4
#0      XmlAnnotator.annotate (package:xml/src/xml_events/annotations/annotator.dart:92:15)
#1      XmlEventIterator.moveNext (package:xml/src/xml_events/iterator.dart:32:20)
#2      Iterable.forEach (dart:core/iterable.dart:347:23)
#3      XmlNodeDecoder.convertIterable (package:xml/src/xml_events/converters/node_decoder.dart:53:12)
#4      new XmlDocument.parse (package:xml/src/xml/nodes/document.dart:34:47)
#5      CPythonNextChangelog.getActualData (package:notifier/src/source/cpython/python_next_changelog.dart:73:17)
<asynchronous suspension>
#6      AbstractSource.getChanges (package:notifier/src/source/abstract.dart:45:24)
<asynchronous suspension>
#7      _RemoteRunner._run (dart:isolate:1092:18)
<asynchronous suspension>

IMHO unclosed tags should be handled.

renggli commented 1 month ago

This library does not read HTML, but XML. Your input is not valid XML.

The entity mapping you specify tells the decoder to interpret characters such as &auml; like in HTML5, which is quite common in XML as well.

gryznar commented 1 month ago

Ok, thank you for answer. My previous experience come from Python lxml which handles HTML. That is the source of my confusion here. The obvious choice for my usecase would be html lib, but in this specific case, efficiency is crucial factor for me and thus I thought that threating it as a XML would greatly speed up parsing