oslc-op / oslc-specs

OSLC OP specifications and notes
https://open-services.net/specifications/
25 stars 10 forks source link

ShapeChecker cannot process dcterms references #320

Open jamsden opened 4 years ago

jamsden commented 4 years ago

I've submitted a pull request for oslc_am and core changes: https://github.com/oslc-op/oslc-specs/pull/317 that is failing with these errors: see: https://app.circleci.com/pipelines/github/oslc-op/oslc-specs/309/workflows/a473724f-3155-447c-b668-ff4c061bb2ef/jobs/305

!/bin/bash -eo pipefail

cd tools/ShapeChecker && ./check-cm.sh [main] WARN org.apache.jena.riot - [line: 1, col: 7 ] {W104} Unqualified typed nodes are not allowed. Type treated as a relative URI. [main] WARN org.apache.jena.riot - [line: 1, col: 7 ] {W136} Relative URIs are not permitted in RDF: specifically [main] WARN org.apache.jena.riot - [line: 2, col: 7 ] {W104} Unqualified property elements are not allowed. Treated as a relative URI. [main] WARN org.apache.jena.riot - [line: 2, col: 7 ] {W136} Relative URIs are not permitted in RDF: specifically [main] WARN org.apache.jena.riot - [line: 2, col: 14] {W104} Unqualified typed nodes are not allowed. Type treated as a relative URI. [main] WARN org.apache.jena.riot - [line: 2, col: 14] {W136} Relative URIs are not permitted in RDF: specifically [main] ERROR org.apache.jena.riot - [line: 2, col: 36] {E202} Expecting XML start or end element(s). String data "308 Permanent Redirect" not allowed. Maybe there should be an rdf:parseType='Literal' for embedding mixed XML content in RDF. Maybe a striping error.</p> <p>This looks like an attempt to read HTML source as RDF source. </p> <p>check-cm.sh:</p> <p>build/install/ShapeChecker/bin/ShapeChecker \ -x <a rel="noreferrer nofollow" target="_blank" href="http://open-services.net/ns/core">http://open-services.net/ns/core</a> ${comment# See <a rel="noreferrer nofollow" target="_blank" href="https://github.com/oslc-op/oslc-specs/issues/40">https://github.com/oslc-op/oslc-specs/issues/40</a>} \ -x <a rel="noreferrer nofollow" target="_blank" href="http://open-services.net/ns/cm">http://open-services.net/ns/cm</a> ${comment# See <a rel="noreferrer nofollow" target="_blank" href="https://github.com/oslc-op/oslc-specs/issues/40">https://github.com/oslc-op/oslc-specs/issues/40</a>} \ -v ../../specs/core/vocab/core-vocab.ttl \ -v ../../specs/cm/change-mgt-vocab.ttl \ -s ../../specs/cm/change-mgt-shapes.ttl</p> <p>These .ttl files look ok. I see core doesn't have @base, but change-mgt-vocab.ttl does, and its old. Should these be removed?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jamsden"><img src="https://avatars.githubusercontent.com/u/1114794?v=4" />jamsden</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>I did some more investigating, with debug, and found that the errors are associated with dcterms references when checking resource shapes.</p> <p>The typical error is: Error on <a href="http://purl.org/dc/terms/source">http://purl.org/dc/terms/source</a>: The target resource cannot be fetched or parsed as RDF. (bad value org.apache.jena.riot.RiotException: [line: 2, col: 36] {E202} Expecting XML start or end element(s). String data "308 Permanent Redirect" not allowed. Maybe there should be an rdf:parseType='Literal' for embedding mixed XML content in RDF. Maybe a striping error.)</p> <p>when processing an oslc:Property with oslc:propertyDefinition dcterms:source. </p> <p>All the references to dcterms properties do this.</p> <p>Using debug shows: Parsing <a href="https://www.dublincore.org/2012/06/14/dcterms.rdf#source">https://www.dublincore.org/2012/06/14/dcterms.rdf#source</a> [main] WARN org.apache.jena.riot - [line: 1, col: 7 ] {W104} Unqualified typed nodes are not allowed. Type treated as a relative URI. [main] WARN org.apache.jena.riot - [line: 1, col: 7 ] {W136} Relative URIs are not permitted in RDF: specifically <html> [main] WARN org.apache.jena.riot - [line: 2, col: 7 ] {W104} Unqualified property elements are not allowed. Treated as a relative URI. [main] WARN org.apache.jena.riot - [line: 2, col: 7 ] {W136} Relative URIs are not permitted in RDF: specifically <head> [main] WARN org.apache.jena.riot - [line: 2, col: 14] {W104} Unqualified typed nodes are not allowed. Type treated as a relative URI. [main] WARN org.apache.jena.riot - [line: 2, col: 14] {W136} Relative URIs are not permitted in RDF: specifically <title> [main] ERROR org.apache.jena.riot - [line: 2, col: 36] {E202} Expecting XML start or end element(s). String data "308 Permanent Redirect" not allowed. Maybe there should be an rdf:parseType='Literal' for embedding mixed XML content in RDF. Maybe a striping error.</p> <p>So it appears that http:/purl.org/dc/terms/ is being redirected somehow to <a href="https://www.dublincore.org/2012/06/14/dcterms.rdf">https://www.dublincore.org/2012/06/14/dcterms.rdf</a>. That resource does exist, but accessing with no Accept header returns HTML while accessing with Accept=text/turtle gives RDF.</p> <p>Maybe Dublin Core has changed how they handle redirects and accept header?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/ndjc"><img src="https://avatars.githubusercontent.com/u/4313031?v=4" />ndjc</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>Yes, Dublin Core has strange redirects and non-standard handling of content negotiation. ShapeChecker already has a work-around in place for that, including doing the explicit redirection you mention. It appears that is not working for you. To avoid the issue, at least temporarily, add the command line option</p> <p><code>-x 'https?://purl.org/dc/terms.*'</code></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jamsden"><img src="https://avatars.githubusercontent.com/u/1114794?v=4" />jamsden</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>That would need to be done in the CircleCI build in order for the pull request to pass its checks.</p> <p>And that might miss some errors if the dcterms references are incorrect.</p> <p>is that -x above a regular expression?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/ndjc"><img src="https://avatars.githubusercontent.com/u/4313031?v=4" />ndjc</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>This issue has been fixed. Users of ShapeChecker should remove uses of the workaround using -x to suppress loading of Dublin Core.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/berezovskyi"><img src="https://avatars.githubusercontent.com/u/64734?v=4" />berezovskyi</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>@ndjc I am hitting this problem again. I don't think DCTerms vocab can be fetched in Turtle any more from PURL. Could you please point me to your fix? For now I am bringing back -x in some scripts.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/ndjc"><img src="https://avatars.githubusercontent.com/u/4313031?v=4" />ndjc</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>See this code in HttpHandler:</p> <pre><code> // Seems like Jena has a bug of ignoring the RDFParserBuilder Accept header, // and Dublin Core uses an arcane set of redirects including 308, not handled by Apache by default, // so we need to configure our HttpClient very carefully! Header rdfHeader = new BasicHeader(HttpHeaders.ACCEPT, RDF_CONTENT_TYPES); HttpClientBuilder builder = HttpClientBuilder .create() .setRedirectStrategy(redirect308()) .setDefaultHeaders(Collections.singletonList(rdfHeader)) .addInterceptorFirst((HttpRequestInterceptor) (request, context) -> request.addHeader(HttpHeaders.ACCEPT, RDF_CONTENT_TYPES));</code></pre> <p>and look at the redirect308() method.</p> <p>Note that you can run with debug levels > 2 (-D -D -D) to get more info about the http requests being sent and responses returned.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/berezovskyi"><img src="https://avatars.githubusercontent.com/u/64734?v=4" />berezovskyi</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>Thanks Nick! My plan is to intercept calls to the URIs listen on this page and fetch the Turtle from a completely different location: <a href="https://www.dublincore.org/schemas/rdfs/">https://www.dublincore.org/schemas/rdfs/</a></p> <p>Notably, we will fetch the Turtle representation for the <code>http://purl.org/dc/terms/</code> namespace from <a href="https://www.dublincore.org/specifications/dublin-core/dcmi-terms/dublin_core_terms.ttl">https://www.dublincore.org/specifications/dublin-core/dcmi-terms/dublin_core_terms.ttl</a></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/berezovskyi"><img src="https://avatars.githubusercontent.com/u/64734?v=4" />berezovskyi</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>To expand why, the original URI no longer supports conneg and seems to serve HTML no matter what, causing "RiotException: Triples not terminated by DOT". I will reply back here if I find a less intrusive workround.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/berezovskyi"><img src="https://avatars.githubusercontent.com/u/64734?v=4" />berezovskyi</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>Here is the response for posterity:</p> <pre><code class="language-html"><html> <head> <title>INetSim default HTML page</title> </head> <body> <p></p> <p align="center">This is the default HTML page for INetSim HTTP server fake mode.</p> <p align="center">This file is an HTML document.</p> </body> </html></code></pre> <p>Request export from Postman:</p> <pre><code class="language-sh">curl --location --request GET 'http://purl.org/dc/terms/' \ --header 'Accept: text/turtle;q=1.0,application/rdf+xml;q=0.9,application/n-triples;q=0.8,application/ld+json;q=0.3'</code></pre> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/berezovskyi"><img src="https://avatars.githubusercontent.com/u/64734?v=4" />berezovskyi</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>Suggestions from <a href="https://gitter.im/linkeddata/chat">https://gitter.im/linkeddata/chat</a>:</p> <ul> <li>use a repository of vocabularies that don't resolve <a href="https://github.com/zazuko/rdf-vocabularies/tree/master/ontologies">https://github.com/zazuko/rdf-vocabularies/tree/master/ontologies</a></li> <li>rely on Jena's mechanism to specify tables of such dereferencing alternatives: <a href="https://jena.apache.org/documentation/notes/stream-manager.html">https://jena.apache.org/documentation/notes/stream-manager.html</a></li> </ul> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>