Closed lbihanic closed 11 years ago
Thank you for your report!
Demo RDFa endpoint uses NekoHTML parser to process HTML pages in following manner
XMLReader htmlReader = new SAXParser();
htmlReader.setFeature("http://cyberneko.org/html/features/override-namespaces", false) ;
htmlReader.setFeature("http://cyberneko.org/html/features/balance-tags", true);
htmlReader.setFeature("http://cyberneko.org/html/features/scanner/allow-selfclosing-tags", true);
htmlReader.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
htmlProcessor = new StreamProcessor(RdfaParser.connect(TurtleSerializer.connect(charOutputSink)));
htmlProcessor.setProperty(StreamProcessor.XML_READER_PROPERTY, htmlReader);
So, you can use Semargl API directly to create RDFa processor (just change TurtleSerializer to SesameSink), or try using System.setProperty("org.xml.sax.driver". yourXmlReaderName)
.
Anyway I've fixed project's page markup (levkhomich/semargl@f04142af24eac50fed6eab86ea415490b2e07eb2).
Hi,
Thanks for your prompt answer!
Le 29/01/13 12:43, Lev Khomich a écrit :
Demo RDFa endpoint uses NekoHTML parser to process HTML pages in following manner
XMLReader htmlReader = new SAXParser(); htmlReader.setFeature("http://cyberneko.org/html/features/override-namespaces", false) ; htmlReader.setFeature("http://cyberneko.org/html/features/balance-tags", true); htmlReader.setFeature("http://cyberneko.org/html/features/scanner/allow-selfclosing-tags", true); htmlReader.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
htmlProcessor = new StreamProcessor(RdfaParser.connect(TurtleSerializer.connect(charOutputSink))); htmlProcessor.setProperty(StreamProcessor.XML_READER_PROPERTY, htmlReader);
So, you can use Semargl API directly to create RDFa processor (just change TurtleSerializer to SesameSink), or try using |System.setProperty("org.xml.sax.driver". yourXmlReaderName)|.
Unfortunately, our application heavily relies on the Sesame API and I would like to make no difference between parsing regular RDF files (RDF/XML, Turtle...) and extracting RDFa from HTML files. Hence, I need a Sesame RDFParser object at some point.
I thought it would be easy to fix but currently SesameRDFaParser is final and does not allow passing a SAXParser object or properties to the streamProcessor.
Would it be possible to make SesameRDFaParser non final and streamProcessor protected? That way I would create a subclass that set the XML_READER_PROPERTY property in its constructor after the call to super().
I can't force "org.xml.sax.driver" as this is a server-side application, thus many threads can be running at any time, allocating XML parsers. Same issue for NekoHTML that uses Xerces (our application currently runs fine using the JVM default XML parser (a Xerces clone by Sun), so I'm reluctant to switch to Xerces): I'll probably try to use TagSoup (rather than twiddling the JAXP properties).
Anyway I've fixed project's page markup (levkhomich/semargl@f04142a https://github.com/levkhomich/semargl/commit/f04142af24eac50fed6eab86ea415490b2e07eb2).
Closing the meta tag won't be enough. The
Regards,
Laurent
Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité du groupe Atos ne pourra être engagée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être engagée pour tout dommage résultant d'un virus transmis.
This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos group liability cannot be triggered for the message content. Although the sender endeavors to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.
Needed constructor was added to SesameRDFaParser
(levkhomich/semargl@660a60497475543efc89551556e4f6d33414365e). You can check it using 0.5-SNAPSHOT
version. It should appear in repository shortly.
Le 29/01/13 17:41, Lev Khomich a écrit :
Needed constructor was added to |SesameRDFaParser| (levkhomich/semargl@660a604 https://github.com/levkhomich/semargl/commit/660a60497475543efc89551556e4f6d33414365e). You can check it using |0.5-SNAPSHOT| version. It should appear in repository shortly.
Thanks!
One last question: I can't clone your GitHub repository:
$ git clone -v https://github.com/levkhomich/semargl.git Cloning into 'semargl'... error: Could not resolve host: (nil); nodename nor servname provided, or not known while accessing https://github.com/levkhomich/semargl.git/info/refs fatal: HTTP request failed
Any idea where this error might be coming from?
Regards,
Laurent
Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité du groupe Atos ne pourra être engagée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être engagée pour tout dommage résultant d'un virus transmis.
This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos group liability cannot be triggered for the message content. Although the sender endeavors to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.
Probably it's a proxy (system or .gitconfig) configuration problem.
Laurent: I have no problem running git clone -v https://github.com/levkhomich/semargl.git - have you tried to clone other github repos to see if it's either github or your local settings?
Le 29/01/13 19:25, Lev Khomich a écrit :
Probably it's a proxy (system or .gitconfig) configuration problem.
Yes, that was it: corporate proxy :-(
Got the git clone OK. Now I'm having an error on the Maven build:
Resource /Users/lbihanic/work/datalift/tools/semargl/core/checkstyle/LICENSE_HEADER not found in file system, classpath or URL: no protocol: /Users/lbihanic/work/datalift/tools/semargl/core/checkstyle/LICENSE_HEADER
Is there a specific Maven option required for building Semargl? Or maybe my Maven version is too old: 2.2.1?
I got it working by adding a symbolic link to the root checkstyle directory in every module but that's propably not the way it is intended to work!
Laurent
Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité du groupe Atos ne pourra être engagée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être engagée pour tout dommage résultant d'un virus transmis.
This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos group liability cannot be triggered for the message content. Although the sender endeavors to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.
Today I've tried to build project under 3 different systems (linux and windows) with no luck to reproduce such error. Probably it's MacOS (or shell) specific. I will keep this in mind. Thank you for participation!
Hi,
Le 29/01/13 17:41, Lev Khomich a écrit :
Needed constructor was added to |SesameRDFaParser| (levkhomich/semargl@660a604 https://github.com/levkhomich/semargl/commit/660a60497475543efc89551556e4f6d33414365e). You can check it using |0.5-SNAPSHOT| version. It should appear in repository shortly.
A quick mail to let you know I successfully integrated Semargl with TagSoup (http://mercury.ccil.org/~cowan/XML/tagsoup/) for parsing ill-formed (X)HTML thanks to this new constructor. It's as simple as : parser = new SesameRDFaParser(new Parser());
Thank's again,
Laurent
Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité du groupe Atos ne pourra être engagée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être engagée pour tout dommage résultant d'un virus transmis.
This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos group liability cannot be triggered for the message content. Although the sender endeavors to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.
Attempt to parse RDFa from http://semarglproject.org fails when using the SesameRDFaParser: Xerces reports many parse errors (unclosed meta tag, missing g namespace prefix definition, unescaped & characters in href). This is consistent with the output from http://demo.semarglproject.org/process?uri=http://semarglproject.org Yet, directing the form in http://semarglproject.org/demo-rdfa.html to the same URL returns some RDFa data.
Is there a specific configuration to avoid using Xerces as parser for HTML? Maybe Tagsoup would ignore these errors?