Allow to specify XMLReader for SesameRDFaParser

lbihanic commented 11 years ago

Attempt to parse RDFa from http://semarglproject.org fails when using the SesameRDFaParser: Xerces reports many parse errors (unclosed meta tag, missing g namespace prefix definition, unescaped & characters in href). This is consistent with the output from http://demo.semarglproject.org/process?uri=http://semarglproject.org Yet, directing the form in http://semarglproject.org/demo-rdfa.html to the same URL returns some RDFa data.

Is there a specific configuration to avoid using Xerces as parser for HTML? Maybe Tagsoup would ignore these errors?

levkhomich commented 11 years ago

Thank you for your report!

Demo RDFa endpoint uses NekoHTML parser to process HTML pages in following manner

XMLReader htmlReader = new SAXParser();
htmlReader.setFeature("http://cyberneko.org/html/features/override-namespaces", false) ;
htmlReader.setFeature("http://cyberneko.org/html/features/balance-tags", true);
htmlReader.setFeature("http://cyberneko.org/html/features/scanner/allow-selfclosing-tags", true);
htmlReader.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");

htmlProcessor = new StreamProcessor(RdfaParser.connect(TurtleSerializer.connect(charOutputSink)));
htmlProcessor.setProperty(StreamProcessor.XML_READER_PROPERTY, htmlReader);

So, you can use Semargl API directly to create RDFa processor (just change TurtleSerializer to SesameSink), or try using System.setProperty("org.xml.sax.driver". yourXmlReaderName).

Anyway I've fixed project's page markup (levkhomich/semargl@f04142af24eac50fed6eab86ea415490b2e07eb2).

lbihanic commented 11 years ago

Hi,

Thanks for your prompt answer!

Le 29/01/13 12:43, Lev Khomich a écrit :

Demo RDFa endpoint uses NekoHTML parser to process HTML pages in following manner

XMLReader htmlReader = new SAXParser(); htmlReader.setFeature("http://cyberneko.org/html/features/override-namespaces", false) ; htmlReader.setFeature("http://cyberneko.org/html/features/balance-tags", true); htmlReader.setFeature("http://cyberneko.org/html/features/scanner/allow-selfclosing-tags", true); htmlReader.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");

htmlProcessor = new StreamProcessor(RdfaParser.connect(TurtleSerializer.connect(charOutputSink))); htmlProcessor.setProperty(StreamProcessor.XML_READER_PROPERTY, htmlReader);

So, you can use Semargl API directly to create RDFa processor (just change TurtleSerializer to SesameSink), or try using |System.setProperty("org.xml.sax.driver". yourXmlReaderName)|.

Unfortunately, our application heavily relies on the Sesame API and I would like to make no difference between parsing regular RDF files (RDF/XML, Turtle...) and extracting RDFa from HTML files. Hence, I need a Sesame RDFParser object at some point.

I thought it would be easy to fix but currently SesameRDFaParser is final and does not allow passing a SAXParser object or properties to the streamProcessor.

Would it be possible to make SesameRDFaParser non final and streamProcessor protected? That way I would create a subclass that set the XML_READER_PROPERTY property in its constructor after the call to super().

I can't force "org.xml.sax.driver" as this is a server-side application, thus many threads can be running at any time, allocating XML parsers. Same issue for NekoHTML that uses Xerces (our application currently runs fine using the JVM default XML parser (a Xerces clone by Sun), so I'm reluctant to switch to Xerces): I'll probably try to use TagSoup (rather than twiddling the JAXP properties).

Anyway I've fixed project's page markup (levkhomich/semargl@f04142a https://github.com/levkhomich/semargl/commit/f04142af24eac50fed6eab86ea415490b2e07eb2).

Closing the meta tag won't be enough. The tag on line 85 with undeclared "g" prefix is a problem as well as the unescaped "&" characters in the source attribute on line 86.</p> <p>Regards,</p> <p>Laurent</p> <hr /> <p>Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité du groupe Atos ne pourra être engagée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être engagée pour tout dommage résultant d'un virus transmis.</p> <p>This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos group liability cannot be triggered for the message content. Although the sender endeavors to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/levkhomich"><img src="https://avatars.githubusercontent.com/u/478590?v=4" />levkhomich</a> commented <strong> 11 years ago</strong> </div> <div class="markdown-body"> <p>Needed constructor was added to <code>SesameRDFaParser</code> (levkhomich/semargl@660a60497475543efc89551556e4f6d33414365e). You can check it using <code>0.5-SNAPSHOT</code> version. It should appear in repository shortly.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/lbihanic"><img src="https://avatars.githubusercontent.com/u/1193557?v=4" />lbihanic</a> commented <strong> 11 years ago</strong> </div> <div class="markdown-body"> <p>Le 29/01/13 17:41, Lev Khomich a écrit :</p> <blockquote> <p>Needed constructor was added to |SesameRDFaParser| (levkhomich/semargl@660a604 <a href="https://github.com/levkhomich/semargl/commit/660a60497475543efc89551556e4f6d33414365e">https://github.com/levkhomich/semargl/commit/660a60497475543efc89551556e4f6d33414365e</a>). You can check it using |0.5-SNAPSHOT| version. It should appear in repository shortly.</p> </blockquote> <p>Thanks!</p> <p>One last question: I can't clone your GitHub repository:</p> <p>$ git clone -v <a href="https://github.com/levkhomich/semargl.git">https://github.com/levkhomich/semargl.git</a> Cloning into 'semargl'... error: Could not resolve host: (nil); nodename nor servname provided, or not known while accessing <a href="https://github.com/levkhomich/semargl.git/info/refs">https://github.com/levkhomich/semargl.git/info/refs</a> fatal: HTTP request failed</p> <p>Any idea where this error might be coming from?</p> <p>Regards,</p> <p>Laurent</p> <hr /> <p>Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité du groupe Atos ne pourra être engagée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être engagée pour tout dommage résultant d'un virus transmis.</p> <p>This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos group liability cannot be triggered for the message content. Although the sender endeavors to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/levkhomich"><img src="https://avatars.githubusercontent.com/u/478590?v=4" />levkhomich</a> commented <strong> 11 years ago</strong> </div> <div class="markdown-body"> <p>Probably it's a proxy (system or .gitconfig) configuration problem.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/scor"><img src="https://avatars.githubusercontent.com/u/77741?v=4" />scor</a> commented <strong> 11 years ago</strong> </div> <div class="markdown-body"> <p>Laurent: I have no problem running git clone -v <a href="https://github.com/levkhomich/semargl.git">https://github.com/levkhomich/semargl.git</a> - have you tried to clone other github repos to see if it's either github or your local settings?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/lbihanic"><img src="https://avatars.githubusercontent.com/u/1193557?v=4" />lbihanic</a> commented <strong> 11 years ago</strong> </div> <div class="markdown-body"> <p>Le 29/01/13 19:25, Lev Khomich a écrit :</p> <blockquote> <p>Probably it's a proxy (system or .gitconfig) configuration problem.</p> </blockquote> <p>Yes, that was it: corporate proxy :-(</p> <p>Got the git clone OK. Now I'm having an error on the Maven build:</p> <p>Resource /Users/lbihanic/work/datalift/tools/semargl/core/checkstyle/LICENSE_HEADER not found in file system, classpath or URL: no protocol: /Users/lbihanic/work/datalift/tools/semargl/core/checkstyle/LICENSE_HEADER</p> <p>Is there a specific Maven option required for building Semargl? Or maybe my Maven version is too old: 2.2.1?</p> <p>I got it working by adding a symbolic link to the root checkstyle directory in every module but that's propably not the way it is intended to work!</p> <p>Laurent</p> <hr /> <p>Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité du groupe Atos ne pourra être engagée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être engagée pour tout dommage résultant d'un virus transmis.</p> <p>This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos group liability cannot be triggered for the message content. Although the sender endeavors to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/levkhomich"><img src="https://avatars.githubusercontent.com/u/478590?v=4" />levkhomich</a> commented <strong> 11 years ago</strong> </div> <div class="markdown-body"> <p>Today I've tried to build project under 3 different systems (linux and windows) with no luck to reproduce such error. Probably it's MacOS (or shell) specific. I will keep this in mind. Thank you for participation!</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/lbihanic"><img src="https://avatars.githubusercontent.com/u/1193557?v=4" />lbihanic</a> commented <strong> 11 years ago</strong> </div> <div class="markdown-body"> <p>Hi,</p> <p>Le 29/01/13 17:41, Lev Khomich a écrit :</p> <blockquote> <p>Needed constructor was added to |SesameRDFaParser| (levkhomich/semargl@660a604 <a href="https://github.com/levkhomich/semargl/commit/660a60497475543efc89551556e4f6d33414365e">https://github.com/levkhomich/semargl/commit/660a60497475543efc89551556e4f6d33414365e</a>). You can check it using |0.5-SNAPSHOT| version. It should appear in repository shortly.</p> </blockquote> <p>A quick mail to let you know I successfully integrated Semargl with TagSoup (<a href="http://mercury.ccil.org/~cowan/XML/tagsoup/">http://mercury.ccil.org/~cowan/XML/tagsoup/</a>) for parsing ill-formed (X)HTML thanks to this new constructor. It's as simple as : parser = new SesameRDFaParser(new Parser());</p> <p>Thank's again,</p> <p>Laurent</p> <hr /> <p>Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité du groupe Atos ne pourra être engagée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être engagée pour tout dommage résultant d'un virus transmis.</p> <p>This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos group liability cannot be triggered for the message content. Although the sender endeavors to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>

semarglproject / semargl

Allow to specify XMLReader for SesameRDFaParser #14