semarglproject / semargl

Highly performant, lightweight framework for linked data processing. Supports RDFa, JSON-LD, RDF/XML and plain text formats, runs on Android and GAE, provides integration with Jena, Sesame and Clerezza.
Other
51 stars 17 forks source link

Can't parse html of http://rdfa.info/ (plus relative URL nit) #50

Open bblfish opened 7 years ago

bblfish commented 7 years ago

It is easy to reproduce this using the Ammonite shell

$ curl http://rdfa.info/ > rdfa.info.txt

For future reference I placed this file here rdfa.info.txt

$ amm
@  import $ivy.`org.semarglproject:semargl-rdfa:0.7`
@ import $ivy.`org.semarglproject:semargl-sesame:0.7`
@ import org.openrdf.rio._
val rdfa = read("rdfa.info.txt")
def rd = new java.io.StringReader(rdfa)
Rio.parse(rd,"",RDFFormat.RDFA)

Btw, that throws an exception because of relative URIs which one could argue about. But let us continue...

If I try the method recommended on the Sesame repository, having specified a base, I get the following exception.

@ Rio.parse(rd,"http://rdfa.info/",RDFFormat.RDFA)
[Fatal Error] :51:5: The element type "link" must be terminated by the matching end-tag "</link>".
org.openrdf.rio.RDFParseException: org.xml.sax.SAXParseException; lineNumber: 51; columnNumber: 5; The element type "link" must be terminated by the matching end-tag "</link>".
  org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:111)
  org.openrdf.rio.Rio.parse(Rio.java:425)
  org.openrdf.rio.Rio.parse(Rio.java:323)
  ammonite.$sess.cmd16$.<init>(cmd16.sc:1)
  ammonite.$sess.cmd16$.<clinit>(cmd16.sc)
org.semarglproject.rdf.ParseException: org.xml.sax.SAXParseException; lineNumber: 51; columnNumber: 5; The element type "link" must be terminated by the matching end-tag "</link>".

which seems to indicate that an xml parser is used there rather than an html parser.

bblfish commented 7 years ago

So in order to then see if this was something I could deal with using other parsers I tried the following:

First, in order to work at the lower required I wrote myself a class that allows me to construct a parser, but still use it easily.

implicit class SesameParserExt(val parser: org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser) extends AnyVal {
    def parse(rdfa: String, base: String) = {
       import org.openrdf.model.impl.{LinkedHashModel,ValueFactoryImpl}
       val model = new LinkedHashModel()
       val collector = new org.openrdf.rio.helpers.ContextStatementCollector(model,ValueFactoryImpl.getInstance())
       parser.setRDFHandler(collector)
       parser.parse(new java.io.StringReader(rdfa),base)
       model
    }
}

Then in order to make life easier creating new parsers and being able to set preferences

import org.semarglproject.sesame.rdf.rdfa._
def RdfaParser(setup: SesameRDFaParser => Unit): SesameRDFaParser = {
   val p = new SesameRDFaParser()
   setup(p)
   p
}

Then I tried a couple of libs to move html to xml.

First TagSoup, that has not changed in the past 5 years.

import scala.util.Try
import $ivy.`org.ccil.cowan.tagsoup:tagsoup:1.2.1`
val tagsoupParser =  org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(null)
val attemptTS = Try{
   RdfaParser(_.setXmlReader(tagsoupParser.getXMLReader())).parse(rdfa,"http://rdfa.info/")
 }

This actually works.

import scala.collection.JavaConverters._
val triples = attemptTS.get.iterator().asScala.toList
browse(triples) 

the last line gives the following

List(
  (http://rdfa.info/, doap:name, "RDFa"@en) [null],
  (http://rdfa.info/, doap:shortdesc, "The Resource Description Framework in Attributes"@en) [null],
  (http://rdfa.info/, doap:homepage, http://rdfa.info/) [null],
  (http://rdfa.info/, http://www.w3.org/1999/xhtml/vocab#stylesheet, https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css) [null],
  (http://rdfa.info/, doap:description, "
RDFa is an extension to HTML5 that helps you markup things like People, Places,
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
        "@en) [null],
  (http://rdfa.info/, dc:description, "
RDFa is an extension to HTML5 that helps you markup things like People, Places,
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
        "@en) [null]
)

there seem to be 6 triples in there. RDFa distiller found the following:

@base <http://rdfa.info/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix doap: <http://usefulinc.com/ns/doap#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<> dcterms:description """
RDFa is an extension to HTML5 that helps you markup things like People, Places, 
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
        """@en;
   doap:description """
RDFa is an extension to HTML5 that helps you markup things like People, Places, 
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
        """@en;
   doap:homepage <>;
   doap:name "RDFa"@en;
   doap:shortdesc "The Resource Description Framework in Attributes"@en .

so it looks like semargle found one extra statement, namely

(http://rdfa.info/, http://www.w3.org/1999/xhtml/vocab#stylesheet, https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css) [null],

which is fine with me.

bblfish commented 7 years ago

I don't seem to have the same luck with the NekoParser

import $ivy.`net.sourceforge.nekohtml:nekohtml:1.9.22`
val nekoParser = new org.cyberneko.html.parsers.SAXParser()
val attempt = Try{
  RdfaParser(_.setXmlReader(nekoParser)).parse(rdfa,"http://rdfa.info/")
}

which captures a `NullpointerException

attempt.get
java.lang.NullPointerException
  org.semarglproject.sesame.core.sink.SesameSink.convertNonLiteral(SesameSink.java:78)
  org.semarglproject.sesame.core.sink.SesameSink.addPlainLiteral(SesameSink.java:94)
bblfish commented 7 years ago

but if I set the RDF version to 1.1 then it works.

import org.openrdf.rio.helpers.RDFaVersion
val attemptNK2 = Try(RdfaParser{p=>
       p.setXmlReader(nekoParser);
       p.setRdfaCompatibility(RDFaVersion.RDFA_1_1)
    }.parse(rdfa,"http://rdfa.info/"))

and we get 6 statements again.

Should the parsing not set the version? How is one meant to know from the outside which version to use?