semarglproject / semargl

Highly performant, lightweight framework for linked data processing. Supports RDFa, JSON-LD, RDF/XML and plain text formats, runs on Android and GAE, provides integration with Jena, Sesame and Clerezza.
Other
51 stars 17 forks source link

RDFa 1.0 xmlns namespace not parsed ? #49

Open tfrancart opened 7 years ago

tfrancart commented 7 years ago
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:eli="http://data.europa.eu/eli/ontology#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" version="XHTML+RDFa 1.0" lang="fr">
    <head>
        <title>xxx</title>
        <meta property="eli:passed_by" content="Foo" />
    </head>

    <body>
    </body>
</html>

Parsed with the following code :

Model model = ModelFactory.createDefaultModel();            
StreamProcessor streamProcessor = new StreamProcessor(RdfaParser.connect(JenaSink.connect(model)));

nu.validator.htmlparser.sax.HtmlParser reader = new nu.validator.htmlparser.sax.HtmlParser(XmlViolationPolicy.ALTER_INFOSET);
streamProcessor.setProperty(StreamProcessor.XML_READER_PROPERTY, reader);

streamProcessor.process(htmlPage.openStream(), htmlPage.toString());
return model;

Returns :

<file:/home/thomas/temp/test.html>
        <eli:passed_by>  "Foo"@fr .

Note how the prefix "eli" is not resolved. Are the prefix declarations using xmlns supported ? setting .setProperty(RdfaParser.RDFA_VERSION_PROPERTY, RDFa.VERSION_10) doesn't change.

Is there anything I could do in the code to parse the above HTML without changing it ? if no, does anyone sees which modifications need to be done in the XHTML above ?

Thanks a lot !

tfrancart commented 7 years ago

Actually, I think the problem is in nu.validator.htmlparser.sax.HtmlParser that does not pass in the SAX events corresponding to the xmlns: declarations. The situation is a bit confuse because HTML, strictly speaking and as far as I can see, does not allow xmlns declarations, other than the html namespace. So I don't know what should happen if an alternate DTD is declared like in this case.

kaefer3000 commented 7 years ago

The same happens when preprocessing the HTML using TagSoup as suggested in #37. TagSoup removes the xmlns declarations.

bipika commented 5 years ago

hello, I am getting an error at: "JenaSink.connect(model)" point. Error says: "The method connect(com.hp.hpl.jena.rdf.model.Model) in the type JenaSink is not applicable for the arguments (org.apache.jena.rdf.model.Model)" Please help me with the problem.