Closed stain closed 8 years ago
Best I can probably do is create an issue on nokogiri. A small stand-alone example would be useful. A workaround might be to filter known nokogiri problems.
Shorter example:
<!DOCTYPE html>
<html lang="en" prefix="ex: http://example.com/ ">
<head>
<meta charset="utf-8">
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css">
</head>
<body>
<!-- Typical Bootstrap use of HTML5 tag <nav> -->
<nav class="navbar navbar-inverse navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
</button>
Ontology
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li class="active"><a href="#">Home</a></li>
<li><a href="#_classes">Classes</a></li>
</ul>
</div>
</div>
</nav>
<!-- Bootstrap use of "container" div and role="main" -->
<div class="container" role="main" style="margin-top: 3em">
<!-- START RDFa bit -->
<div about="http://example.com/">
<div rev="skos:inScheme">
<section id="_classes">
<h2>Classes</h2>
<div id="Example" about="http://example.com/Example" typeof="owl:Class">
<h3 property="rdfs:label">Example label</h3>
</div>
</section>
</div>
</div>
<!-- END RDFa bit -->
</div>
</body>
</html>
Triples are extracted as expected (debatable if role
: main
should be in or out)
stain@biggie:~/Desktop$ rdf --input-format html5 serialize rdfa-issue-19.html
<http://example.com/Example> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
<http://example.com/Example> <http://www.w3.org/2004/02/skos/core#inScheme> <http://example.com/> .
<http://example.com/Example> <http://www.w3.org/2000/01/rdf-schema#label> "Example label"@en .
_:g26763080 <http://www.w3.org/1999/xhtml/vocab#role> <http://www.w3.org/1999/xhtml/vocab#main> .
But won't --validate
:
stain@biggie:~/Desktop$ rdf --validate --input-format html5 serialize rdfa-issue-19.html
/home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-rdfa-1.1.6/lib/rdf/rdfa/reader.rb:492:in `add_error': Syntax errors: (RDF::ReaderError)
[#<Nokogiri::XML::SyntaxError: Tag nav invalid>, #<Nokogiri::XML::SyntaxError: Tag section invalid>]
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-rdfa-1.1.6/lib/rdf/rdfa/reader.rb:316:in `block in initialize'
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/reader.rb:207:in `instance_eval'
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/reader.rb:207:in `initialize'
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-rdfa-1.1.6/lib/rdf/rdfa/reader.rb:277:in `initialize'
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/reader.rb:148:in `new'
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/reader.rb:148:in `block in open'
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/util/file.rb:346:in `open_file'
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/reader.rb:136:in `open'
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/cli.rb:196:in `block in parse'
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/cli.rb:195:in `each'
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/cli.rb:195:in `parse'
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/cli.rb:67:in `block in <class:CLI>'
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/cli.rb:169:in `call'
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/lib/rdf/cli.rb:169:in `exec_command'
from /home/stain/.rvm/gems/ruby-2.1.7/gems/rdf-1.1.16.1/bin/rdf:18:in `<top (required)>'
from /home/stain/.rvm/gems/ruby-2.1.7/bin/rdf:23:in `load'
from /home/stain/.rvm/gems/ruby-2.1.7/bin/rdf:23:in `<main>'
from /home/stain/.rvm/gems/ruby-2.1.7/bin/ruby_executable_hooks:15:in `eval'
from /home/stain/.rvm/gems/ruby-2.1.7/bin/ruby_executable_hooks:15:in `<main>'
Perhaps HTML5 is not detected? I get text/html
as the type used. Can HTML5 be forced?
Thanks for the example, I'll look at this shortly. I believe the processing mode can be specified using an option to the reader. You should also be able to reproduce by calling Nokogiri::HTML.parse directly.
I'm trying to extract and validate the RDFa from http://stain.github.io/bridgedb-vocabulary/ which uses HTML5.
Is there a way to tell Nokogiri to use HTML5 support so that
<tag>
is supported?Also on http://rdf.greggkellogg.net/distiller?uri=http://stain.github.io/bridgedb-vocabulary/ I get these errors: