semarglproject / semargl

Highly performant, lightweight framework for linked data processing. Supports RDFa, JSON-LD, RDF/XML and plain text formats, runs on Android and GAE, provides integration with Jena, Sesame and Clerezza.
Other
51 stars 17 forks source link

Allow to specify XMLReader for SesameRDFaParser #14

Closed lbihanic closed 11 years ago

lbihanic commented 11 years ago

Attempt to parse RDFa from http://semarglproject.org fails when using the SesameRDFaParser: Xerces reports many parse errors (unclosed meta tag, missing g namespace prefix definition, unescaped & characters in href). This is consistent with the output from http://demo.semarglproject.org/process?uri=http://semarglproject.org Yet, directing the form in http://semarglproject.org/demo-rdfa.html to the same URL returns some RDFa data.

Is there a specific configuration to avoid using Xerces as parser for HTML? Maybe Tagsoup would ignore these errors?

levkhomich commented 11 years ago

Thank you for your report!

Demo RDFa endpoint uses NekoHTML parser to process HTML pages in following manner

XMLReader htmlReader = new SAXParser();
htmlReader.setFeature("http://cyberneko.org/html/features/override-namespaces", false) ;
htmlReader.setFeature("http://cyberneko.org/html/features/balance-tags", true);
htmlReader.setFeature("http://cyberneko.org/html/features/scanner/allow-selfclosing-tags", true);
htmlReader.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");

htmlProcessor = new StreamProcessor(RdfaParser.connect(TurtleSerializer.connect(charOutputSink)));
htmlProcessor.setProperty(StreamProcessor.XML_READER_PROPERTY, htmlReader);

So, you can use Semargl API directly to create RDFa processor (just change TurtleSerializer to SesameSink), or try using System.setProperty("org.xml.sax.driver". yourXmlReaderName).

Anyway I've fixed project's page markup (levkhomich/semargl@f04142af24eac50fed6eab86ea415490b2e07eb2).

lbihanic commented 11 years ago

Hi,

Thanks for your prompt answer!

Le 29/01/13 12:43, Lev Khomich a écrit :

Demo RDFa endpoint uses NekoHTML parser to process HTML pages in following manner

XMLReader htmlReader = new SAXParser(); htmlReader.setFeature("http://cyberneko.org/html/features/override-namespaces", false) ; htmlReader.setFeature("http://cyberneko.org/html/features/balance-tags", true); htmlReader.setFeature("http://cyberneko.org/html/features/scanner/allow-selfclosing-tags", true); htmlReader.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");

htmlProcessor = new StreamProcessor(RdfaParser.connect(TurtleSerializer.connect(charOutputSink))); htmlProcessor.setProperty(StreamProcessor.XML_READER_PROPERTY, htmlReader);

So, you can use Semargl API directly to create RDFa processor (just change TurtleSerializer to SesameSink), or try using |System.setProperty("org.xml.sax.driver". yourXmlReaderName)|.

Unfortunately, our application heavily relies on the Sesame API and I would like to make no difference between parsing regular RDF files (RDF/XML, Turtle...) and extracting RDFa from HTML files. Hence, I need a Sesame RDFParser object at some point.

I thought it would be easy to fix but currently SesameRDFaParser is final and does not allow passing a SAXParser object or properties to the streamProcessor.

Would it be possible to make SesameRDFaParser non final and streamProcessor protected? That way I would create a subclass that set the XML_READER_PROPERTY property in its constructor after the call to super().

I can't force "org.xml.sax.driver" as this is a server-side application, thus many threads can be running at any time, allocating XML parsers. Same issue for NekoHTML that uses Xerces (our application currently runs fine using the JVM default XML parser (a Xerces clone by Sun), so I'm reluctant to switch to Xerces): I'll probably try to use TagSoup (rather than twiddling the JAXP properties).

Anyway I've fixed project's page markup (levkhomich/semargl@f04142a https://github.com/levkhomich/semargl/commit/f04142af24eac50fed6eab86ea415490b2e07eb2).

Closing the meta tag won't be enough. The tag on line 85 with undeclared "g" prefix is a problem as well as the unescaped "&" characters in the