ruby-rdf / rdf-rdfa

Ruby RDFa reader/writer for RDF.rb.
The Unlicense
35 stars 11 forks source link

HTML5 + RDFa support ? #19

Closed stain closed 8 years ago

stain commented 8 years ago

I'm trying to extract and validate the RDFa from which uses HTML5.

/home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-rdfa-1.1.6/lib/rdf/rdfa/reader.rb:492:in `add_error': Syntax errors: (RDF::ReaderError)
[#<Nokogiri::XML::SyntaxError: Tag nav invalid>, #<Nokogiri::XML::SyntaxError: Tag section invalid>, #<Nokogiri::XML::SyntaxError: Tag section invalid>, #<Nokogiri::XML::SyntaxError: Tag section invalid>, #<Nokogiri::XML::SyntaxError: Tag section invalid>, #<Nokogiri::XML::SyntaxError: Tag section invalid>]
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-rdfa-1.1.6/lib/rdf/rdfa/reader.rb:316:in `block in initialize'

Is there a way to tell Nokogiri to use HTML5 support so that <tag> is supported?

Also on I get these errors:

Warnings Term stylesheet is not defined Term stylesheet is not defined

Errors Tag nav invalid Tag section invalid

gkellogg commented 8 years ago

Best I can probably do is create an issue on nokogiri. A small stand-alone example would be useful. A workaround might be to filter known nokogiri problems.

stain commented 8 years ago

Shorter example:

<!DOCTYPE html>
<html lang="en" prefix="ex: ">
  <meta charset="utf-8">
  <link rel="stylesheet" href="">
  <!-- Typical Bootstrap use of HTML5 tag <nav> -->
  <nav class="navbar navbar-inverse navbar-fixed-top">
    <div class="container">
      <div class="navbar-header">
        <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
          <span class="sr-only">Toggle navigation</span>
          <span class="icon-bar"></span>
      <div id="navbar" class="collapse navbar-collapse">
        <ul class="nav navbar-nav">
          <li class="active"><a href="#">Home</a></li>
          <li><a href="#_classes">Classes</a></li>

<!-- Bootstrap use of "container" div and role="main" -->
<div class="container" role="main" style="margin-top: 3em">

<!-- START RDFa bit -->
  <div about="">
    <div rev="skos:inScheme">
      <section id="_classes">
        <div id="Example" about="" typeof="owl:Class">
          <h3 property="rdfs:label">Example label</h3>
<!-- END RDFa bit -->



Triples are extracted as expected (debatable if role: main should be in or out)

stain@biggie:~/Desktop$ rdf --input-format html5 serialize rdfa-issue-19.html 
<> <> <> .
<> <> <> .
<> <> "Example label"@en .
_:g26763080 <> <> .

But won't --validate:

stain@biggie:~/Desktop$ rdf --validate --input-format html5 serialize rdfa-issue-19.html 
/home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-rdfa-1.1.6/lib/rdf/rdfa/reader.rb:492:in `add_error': Syntax errors: (RDF::ReaderError)
[#<Nokogiri::XML::SyntaxError: Tag nav invalid>, #<Nokogiri::XML::SyntaxError: Tag section invalid>]
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-rdfa-1.1.6/lib/rdf/rdfa/reader.rb:316:in `block in initialize'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf- `instance_eval'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf- `initialize'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-rdfa-1.1.6/lib/rdf/rdfa/reader.rb:277:in `initialize'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf- `new'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf- `block in open'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf- `open_file'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf- `open'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf- `block in parse'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf- `each'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf- `parse'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf- `block in <class:CLI>'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf- `call'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf- `exec_command'
    from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf- `<top (required)>'
    from /home/stain/.rvm/gems/ruby-2.1.7/bin/rdf:23:in `load'
    from /home/stain/.rvm/gems/ruby-2.1.7/bin/rdf:23:in `<main>'
    from /home/stain/.rvm/gems/ruby-2.1.7/bin/ruby_executable_hooks:15:in `eval'
    from /home/stain/.rvm/gems/ruby-2.1.7/bin/ruby_executable_hooks:15:in `<main>'

Perhaps HTML5 is not detected? I get text/html as the type used. Can HTML5 be forced?

gkellogg commented 8 years ago

Thanks for the example, I'll look at this shortly. I believe the processing mode can be specified using an option to the reader. You should also be able to reproduce by calling Nokogiri::HTML.parse directly.

gkellogg commented 8 years ago

It would be great to use Nokogumbo, which uses google's Gumbo HTML parser, but it's not quite ready. See #20.