ruby-rdf / rdf-rdfa

Ruby RDFa reader/writer for RDF.rb.
http://ruby-rdf.github.com/rdf-rdfa
The Unlicense
35 stars 11 forks source link

Unrecognized HTML 5 elements raise invalid tag errors #21

Closed csarven closed 7 years ago

csarven commented 8 years ago

Tested at http://rdf.greggkellogg.net/distiller

<data about="" property="http://example.org/foo" data="4">6</data>

throws Errors found during processing <>: Tag data invalid

Same error for some other elements eg source, track.

Reproduce using form input or URL: http://csarven.ca/this-paper-is-a-demo

gkellogg commented 8 years ago

Unfortunately, Nokogiri doesn't recognize HTML5 elements; I've added a filter for the data, source and track elements, in addition to others that were reported.

Released in 1.99.1 and 2.0.0.beta2.

csarven commented 8 years ago

Some more (inline SVG) you might want to consider adding:

Errors found during processing

<>: Tag svg invalid

Tag g invalid

Tag line invalid

Tag circle invalid

... I suppose anything under svg will throw an error.

csarven commented 7 years ago
<https://dokie.li/acm-sigproc-sp>: Tag math invalid
Tag mi invalid
Tag mrow invalid
Tag mo invalid
Tag mn invalid
Tag mtable invalid
Tag mtr invalid
Tag mtd invalid
Tag munder invalid
Tag mstyle invalid
Tag munderover invalid
Tag msub invalid
Tag msubsup invalid
Tag msup invalid
Tag mfrac invalid

I acknowledge that it is not particularly fun to catch these one by one. Would it be possible simpler to omit unrecognised tags?

gkellogg commented 7 years ago

Yeah, this is a pain. Probably what we need to do is to use the Nokogumbo gem when we're dealing with HTML5. I spent some time on this at one point, but got stuck.

The only other alternative is to enumerate all known good element names, or not generate any warnings for bad element names.

Where do these terms come from?

Note that this is only a problem when validating.

csarven commented 7 years ago

I'm primarily using http://rdf.greggkellogg.net/distiller to double check things on my HTML5 Polyglot documents. Nothing particularly fancy going on. Just running into some elements here and there and pinging here :) I presume https://github.com/ruby-rdf/rdf-rdfa/commit/8d469497777fd3a4b63e274b229b79b8bb44c9ee didn't yet make its way up there because I'm still getting the same errors. Thanks for the update.

gkellogg commented 7 years ago

Not yet, I'll update the distiller (and linter) shortly and report back on this issue.

gkellogg commented 7 years ago

Okay, updated now.

csarven commented 7 years ago

Looks good to me :+1:

csarven commented 7 years ago

Is there a regression bug? I'm seeing a bunch of "Tag x invalid"s eg., http://csarven.ca/linked-data-notifications (Input: RDFa)

gkellogg commented 7 years ago

@csarven No, the Gem is now using Nokogumbo for parsing HTML5, and when run on my machine it parses without error. However, it does generate an error on rdf.greggkellogg.net/distiller, which indicates that for some reason it's not picking up the Nokogumbo gem in that build environment, and is falling back to Nokogiri, which reports these tags as errors.

Note, however, that validating the doc at validator.w3.org does show markup errors. The Linter also shows a number of warnings, but not due to markup errors.

I'll investigate why the distiller isn't running properly.

gkellogg commented 7 years ago

There is some issue with Nokogumbo on Heroku which seems to be interfering with this. Runs fine when installed on my Mac. I mentioned in https://github.com/rubys/nokogumbo/issues/25#issuecomment-305652420

gkellogg commented 7 years ago

The fix was to downgrade nokogumbo. It's running better now, still getting some errors, which are likely legitimate HTML5 content model errors:

Errors found during processing
<http://csarven.ca/this-paper-is-a-demo>: @284:41: That tag isn't allowed here  Currently open tags: html, body, main, article, div, section, div, section, div, audio..
</source>
^
@287:41: That tag isn't allowed here  Currently open tags: html, body, main, article, div, section, div, section, div, audio..
</track>
^
@305:41: That tag isn't allowed here  Currently open tags: html, body, main, article, div, section, div, section, div, video..
</source>
^
@308:41: That tag isn't allowed here  Currently open tags: html, body, main, article, div, section, div, section, div, video..
</track>
csarven commented 7 years ago

Looks good, thank you!

source and target should not have end tags. Fixed.

Shouldn't the RDFa parser look the other way for the content model errors? Perhaps a warning is more suitable?

gkellogg commented 7 years ago

Errors are passed through from Nokogumbo when you set the validate option. The idea is that errors found in the markup may affect your output, there's not a reasonable way to filter these out, but validation is optional.