Open GoogleCodeExporter opened 8 years ago
Original comment by ed.summers
on 11 Apr 2011 at 4:24
Original comment by ed.summers
on 11 Apr 2011 at 4:25
After some poking around it looks like HTML with a DOCTYPE is successfully
parsed by minidom (but the entity is stripped out). But when the HTML lacks a
DOCTYPE it fails mindom parsing with a:
xml.parsers.expat.ExpatError: undefined entity: line 4, column 60
and is then parsed by html5lib (where the entity is successfully converted to
utf8). So that explains why the entities are stripped out from RDFa HTML with
the DOCTYPE and converted to UTF8 when there is no DOCTYPE.
Now to figure out what the fix should be, if any...
Original comment by ed.summers
on 15 Apr 2011 at 1:18
Wondering if maybe we just hand off all html parsing to html5lib and make it an
rdflib dependency?
Original comment by ed.summers
on 15 Apr 2011 at 1:28
Looks like xml.minidom has no support for entities:
The following interfaces have no implementation in xml.dom.minidom...Entity
http://docs.python.org/library/xml.dom.minidom.html
Original comment by ed.summers
on 15 Apr 2011 at 1:47
Original issue reported on code.google.com by
ed.summers
on 11 Apr 2011 at 4:23Attachments: