ruby-rdf / rdf-rdfa

Ruby RDFa reader/writer for RDF.rb.
http://ruby-rdf.github.com/rdf-rdfa
The Unlicense
35 stars 11 forks source link

HTML5 + RDFa support ? #19

Closed stain closed 8 years ago

stain commented 8 years ago

I'm trying to extract and validate the RDFa from http://stain.github.io/bridgedb-vocabulary/ which uses HTML5.

/home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-rdfa-1.1.6/lib/rdf/rdfa/reader.rb:492:in `add_error': Syntax errors: (RDF::ReaderError)
[#<Nokogiri::XML::SyntaxError: Tag nav invalid>, #<Nokogiri::XML::SyntaxError: Tag section invalid>, #<Nokogiri::XML::SyntaxError: Tag section invalid>, #<Nokogiri::XML::SyntaxError: Tag section invalid>, #<Nokogiri::XML::SyntaxError: Tag section invalid>, #<Nokogiri::XML::SyntaxError: Tag section invalid>]
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-rdfa-1.1.6/lib/rdf/rdfa/reader.rb:316:in `block in initialize'

Is there a way to tell Nokogiri to use HTML5 support so that <tag> is supported?

Also on http://rdf.greggkellogg.net/distiller?uri=http://stain.github.io/bridgedb-vocabulary/ I get these errors:

Warnings

http://stain.github.io/bridgedb-vocabulary/html/head/link: Term stylesheet is not defined http://stain.github.io/bridgedb-vocabulary/html/head/link: Term stylesheet is not defined

Errors

http://stain.github.io/bridgedb-vocabulary/: Tag nav invalid Tag section invalid

gkellogg commented 8 years ago

Best I can probably do is create an issue on nokogiri. A small stand-alone example would be useful. A workaround might be to filter known nokogiri problems.

stain commented 8 years ago

Shorter example:

<!DOCTYPE html>
<html lang="en" prefix="ex: http://example.com/ ">
<head>
  <meta charset="utf-8">
  <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css">
</head>
<body>
  <!-- Typical Bootstrap use of HTML5 tag <nav> -->
  <nav class="navbar navbar-inverse navbar-fixed-top">
    <div class="container">
      <div class="navbar-header">
        <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
          <span class="sr-only">Toggle navigation</span>
          <span class="icon-bar"></span>
        </button>
        Ontology
      </div>
      <div id="navbar" class="collapse navbar-collapse">
        <ul class="nav navbar-nav">
          <li class="active"><a href="#">Home</a></li>
          <li><a href="#_classes">Classes</a></li>
        </ul>
      </div>
    </div>
  </nav>

<!-- Bootstrap use of "container" div and role="main" -->
<div class="container" role="main" style="margin-top: 3em">

<!-- START RDFa bit -->
  <div about="http://example.com/">
    <div rev="skos:inScheme">
      <section id="_classes">
        <h2>Classes</h2>
        <div id="Example" about="http://example.com/Example" typeof="owl:Class">
          <h3 property="rdfs:label">Example label</h3>
        </div>
      </section>
    </div>
  </div>
<!-- END RDFa bit -->

</div>

</body>
</html>

Triples are extracted as expected (debatable if role: main should be in or out)

stain@biggie:~/Desktop$ rdf --input-format html5 serialize rdfa-issue-19.html 
<http://example.com/Example> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
<http://example.com/Example> <http://www.w3.org/2004/02/skos/core#inScheme> <http://example.com/> .
<http://example.com/Example> <http://www.w3.org/2000/01/rdf-schema#label> "Example label"@en .
_:g26763080 <http://www.w3.org/1999/xhtml/vocab#role> <http://www.w3.org/1999/xhtml/vocab#main> .

But won't --validate:

stain@biggie:~/Desktop$ rdf --validate --input-format html5 serialize rdfa-issue-19.html 
/home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-rdfa-1.1.6/lib/rdf/rdfa/reader.rb:492:in `add_error': Syntax errors: (RDF::ReaderError)
[#<Nokogiri::XML::SyntaxError: Tag nav invalid>, #<Nokogiri::XML::SyntaxError: Tag section invalid>]
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-rdfa-1.1.6/lib/rdf/rdfa/reader.rb:316:in `block in initialize'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/reader.rb:207:in `instance_eval'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/reader.rb:207:in `initialize'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-rdfa-1.1.6/lib/rdf/rdfa/reader.rb:277:in `initialize'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/reader.rb:148:in `new'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/reader.rb:148:in `block in open'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/util/file.rb:346:in `open_file'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/reader.rb:136:in `open'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/cli.rb:196:in `block in parse'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/cli.rb:195:in `each'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/cli.rb:195:in `parse'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/cli.rb:67:in `block in <class:CLI>'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/cli.rb:169:in `call'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/cli.rb:169:in `exec_command'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/bin/rdf:18:in `<top (required)>'
    from /home/stain/.rvm/gems/ruby-2.1.7/bin/rdf:23:in `load'
    from /home/stain/.rvm/gems/ruby-2.1.7/bin/rdf:23:in `<main>'
    from /home/stain/.rvm/gems/ruby-2.1.7/bin/ruby_executable_hooks:15:in `eval'
    from /home/stain/.rvm/gems/ruby-2.1.7/bin/ruby_executable_hooks:15:in `<main>'

Perhaps HTML5 is not detected? I get text/html as the type used. Can HTML5 be forced?

gkellogg commented 8 years ago

Thanks for the example, I'll look at this shortly. I believe the processing mode can be specified using an option to the reader. You should also be able to reproduce by calling Nokogiri::HTML.parse directly.

gkellogg commented 8 years ago

It would be great to use Nokogumbo, which uses google's Gumbo HTML parser, but it's not quite ready. See #20.