walidazizi / rdflib

Automatically exported from code.google.com/p/rdflib
Other
0 stars 0 forks source link

html entities being stripped when parsing rdfa with DOCTYPE #167

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
I am not 100% certain if this is the accepted behavior here, but I've noticed 
that rdflib's RDFa parser is stripping HTML entities when processing XHTML, but 
that similar processing without the RDFa DOCTYPE results in the entity being 
translated to UTF-8.

See the attached test: entities.py ...

Original issue reported on code.google.com by ed.summers on 11 Apr 2011 at 4:23

Attachments:

GoogleCodeExporter commented 8 years ago

Original comment by ed.summers on 11 Apr 2011 at 4:24

GoogleCodeExporter commented 8 years ago

Original comment by ed.summers on 11 Apr 2011 at 4:25

GoogleCodeExporter commented 8 years ago
After some poking around it looks like HTML with a DOCTYPE is successfully 
parsed by minidom (but the entity is stripped out). But when the HTML lacks a 
DOCTYPE it fails mindom parsing with a: 

    xml.parsers.expat.ExpatError: undefined entity: line 4, column 60

and is then parsed by html5lib (where the entity is successfully converted to 
utf8). So that explains why the entities are stripped out from RDFa HTML with 
the DOCTYPE and converted to UTF8 when there is no DOCTYPE.

Now to figure out what the fix should be, if any...

Original comment by ed.summers on 15 Apr 2011 at 1:18

GoogleCodeExporter commented 8 years ago
Wondering if maybe we just hand off all html parsing to html5lib and make it an 
rdflib dependency?

Original comment by ed.summers on 15 Apr 2011 at 1:28

GoogleCodeExporter commented 8 years ago
Looks like xml.minidom has no support for entities:

    The following interfaces have no implementation in xml.dom.minidom...Entity
    http://docs.python.org/library/xml.dom.minidom.html

Original comment by ed.summers on 15 Apr 2011 at 1:47