Closed paregorios closed 2 years ago
Added by @ryanfb on Oct 25, 2013 04:44 PM It looks like rdflib (which looks to be what pleiades-rdf uses to serialize to turtle) expects URLs/URIs to be escaped before being passed in to the URIRef constructor: https://github.com/RDFLib/rdflib/issues/263
Added by @ryanfb on Oct 31, 2013 10:42 AM Diff of changes made to load/convert places in Jena rdfcat in one RDF dump snapshot: https://gist.github.com/ryanfb/18d7fa31d7561791ceb2
Added by @paregorios on Oct 31, 2013 10:49 AM data fix in place for: https://gist.github.com/rya[…]eiades-uri-escape-patch-L47
Added by @paregorios on Oct 31, 2013 10:53 AM data fix in place for: https://gist.github.com/rya[…]eiades-uri-escape-patch-L56
This problem has also been reported by @wouterbeek and Vincent Altenavan:
files
places-5.tll
andplaces-7.ttl
contain unencoded spaces in IRIs (which is not allowed), e.g.,:
<http://atlantides.org/capgrids/79#A2 > osgeo:extent <http://atlantides.org/capgrid/79#A2 -extent> .
<doi: 10.3368/lj.28.1.1>;
As @ryanfb pointed out above, it looks like the URIs need to have been urlencoded before being passed to URIRef()
in https://github.com/isawnyu/pleiades-rdf/blob/master/pleiades/rdf/common.py (and possibly elsewhere?).
priority:high because data corruption
@ryanfb I've looked back at the ttl spec and, unless I'm really clueless, I think the spec itself only requires that we're using UTF-8:
Literals and URIs may also contain escapes to encode surrounding syntax, non-printable characters and to encode Unicode characters by codepoint number (although they may also be given directly, encoded as UTF-8).
The syntax of Turtle is expressed over code points in Unicode [UNICODE]. The encoding is always UTF-8 [RFC3629].
Unicode code points may also be expressed using an \uXXXX (U+0 to U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a hexadecimal digit [0-9A-F]
The media type of Turtle is text/turtle (pre-registration media type application/x-turtle should be accepted). The content encoding of Turtle content is always UTF-8. Charset parameters on the mime type are required until such time as the text/ media type tree permits UTF-8 to be sent without a charset parameter.
Do you read this differently? I'm not opposed to escaping/encoding URIs as a convenience for downstream consumers that don't faithfully implement the spec, but I do need to assess priority of same in the light of limited resources. If, however, I'm not understanding something correctly, I'm prepared to keep the priority high.
Hi @paregorios,
My original complaint was about unescaped spaces appearing in some IRIs. E.g., the last character of the subject term and the character before the hyphen in the object term:
<http://atlantides.org/capgrids/79#A2 > osgeo:extent <http://atlantides.org/capgrid/79#A2 -extent> .
Notice that the main issue may not even be that the spaces are unescaped, but that they appear in the first place. Are they no simply the result of incorrect IRI creation?
I would be opposed to URI-encoding all non-ASCII Unicode characters that appear in IRIs, since this makes IRIs containing names from non-English languages (including ancient languages) far less readable, without resulting in a clear benefit.
Having reviewed the examples provided by @ryanfb and @wouterbeek I see that our problems are with URLs that are broken in some way, e.g.:
We probably need to do a few things here:
I've manually addressed the specific problems that were included in the gist from @ryanfb by checking in the reference changes he had submitted previously and by updating all references on the basis of our Zotero data.
As a consumer of Pleiades content in any serialization (html, json, rdf), I will never encounter malformed HTTP(s) URIs.
There are several contexts in which contributor users can enter URIs into data fields. We should validate to ensure:
The following contexts should have validation:
Looks like the URL validators used for these fields don't do much more than making certain there are no spaces in the URL. So, not sure where a space could be coming from.
I reviewed the .ttl files from pleiades downloads, and couldn't find URIs similar to those with errors (that is, no URis with space before closing bracket, or between name and -extent.
@cguardia There are still invalid IRIs in the latest version of the RDF download (I used http://atlantides.org/downloads/pleiades/rdf/pleiades-latest.tar.gz).
The first issue is that the caret character (^
) which appears unescaped multiple times in an IRI on line 60,524:
<http://www.persee.fr/web/revues/home/prescript/article/racf_0220-6617_1991_num_30_1_2657?luceneQuery=%28%2B%28content%3AAQUAE+title%3AAQUAE^2.0+fullContent%3AAQUAE^100.0+fullTitle%3AAQUAE^140.0+summary%3AAQUAE+authors%3AAQUAE^5.0+illustrations%3AAQUAE^4.0>
Please consider using an off-the-shelf solution that has been tested for years like uriparser library (https://uriparser.github.io/) in order to automatically perform such tests. Since this is such a widely used operation, every popular programming language has a good URI library that performs validity checks. (No need to re-invent the wheel.)
@wouterbeek I appreciate your efforts to help us sort this out and the investment of time that represents. What is unclear to us is which characters do need escaping and why. We have had previous discussion in this thread about requirements for same, concluding that there is no RDF/TTL requirement to do more than encode valid URIs using UTF-8. We believe we have addressed the specific issue you previously highlighted (trapped spaces in some URIs in the RDF).
Now you seem to be highlighting a set of issues related to RFC 3986 conformance. I am not a URI syntax expert; however, I have just looked through the RFC. I do not see the caret character listed as a reserved character. I recognize that I may not have detected provision(s) that might prohibit caret (and other characters) without specifically identifying them. That is why I make the following request. Can you please explain in more detail what settings/process you are using with the uriparser so that we can duplicate your findings in toto, verify that tool output is in fact conformant to the spec, and thereby ensure that we are producing quality data that meets standards, rather than just the idiosyncrasies of individual tool chains.
@paregorios No problem, URI/IRI syntax is (unfortunately) quite complex. I'll first dive into the syntax a bit and then give concrete examples on how to detect violations of the URI/IRI syntax.
The issues that I have observed seem to occur in the path component of IRIs. This is the part of the IRI that is separated by forward slashes. The parts in between the forward slashes are called 'segments'. They are component of characters that match the ipchar
rule. The relevant BNF rules from the RFC 3987 standard are as follows:
ipchar = iunreserved / pct-encoded / sub-delims / ":" / "@"
iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
^ Notice that the caret (^
) and the space are not part of the iunreserved
and sub-delims
rules, and are therefore not allowed. Another one is the pipe character (|
). (We can skip the rule for ucschar
since there are no issues with non-ASCII Unicode characters ATM.)
rapper
Rapper is a command-line tool for parsing most RDF serialization formats. It can be installed with apt install raptor2-utils
on Ubuntu. When I run it for the following Pleiades file I get an error:
$ rapper -i turtle places-3.ttl > /dev/null
rapper: Parsing URI file:///home/wouter/Downloads/pleiades-latest/places-3.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///home/wouter/Downloads/pleiades-latest/places-3.ttl:112694 - syntax error at '<'
rapper: Failed to parse file places-3.ttl turtle content
rapper: Parsing returned 109764 triples
serdi
Serdi is the command-line tool that comes with the Serd library. It can be installed with apt install serdi
on Ubuntu. When I run it for the same Pleiades file I get a similar error:
$ serdi -i turtle places-3.ttl > /dev/null
error: places-3.ttl:112694:87: invalid IRI character ` ' (escape %20)
liburiparser
In addition to command-line tools, it is possible to check the validity on a per-URI basis. The following function returns true
iff the entered string encodes a valid URI. (The example is in C++, but similar libraries exist for other languages as well.)
#include <uriparser/Uri.h>
auto is_valid(const std::string& input) const -> bool
{
UriUriA uri;
const char* errPos;
auto valid {uriParseSingleUriA(&uri, input.c_str(), &errPos) == URI_SUCCESS};
if (valid) uriFreeUriMembersA(&uri);
return valid;
}
I have just reviewed and verified that steps taken previously in response to this ticket have eliminated trailing spaces in URIs (see also https://github.com/isawnyu/pleiades-rdf/issues/6). I am now investigating current status of the other problems detailed by @wouterbeek above.
Our TTL files are still throwing errors when parsed with rapper
, for example:
rapper -i turtle data/rdf/places-1.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-1.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-1.ttl:77912 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-1.ttl turtle content
rapper: Parsing returned 73585 triples
For this example, the data at line 77912 (reported in the rapper error) reads:
<https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}>,
Indeed, unencoded curly braces can be seen in the URI, and these are unsafe URI characters. As discussed above, fully urlencoding all our URIs would render many helpfully readable URIs opaque. A better solution perhaps is to selectively encode only the unsafe characters: "{", "}", "|", "^", "~", "[", "]", and "`".
Accordingly I offer the following reformulation of the problem statement:
Description: Unsafe characters occur in URIs in Pleiades RDF/TTL exports.
Examples:
$ rapper -i turtle data/rdf/places-1.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-1.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-1.ttl:77912 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-1.ttl turtle content
rapper: Parsing returned 73585 triples
$ sed -n "77912p" data/rdf/places-1.ttl
<https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}>,
$ rapper -i turtle data/rdf/places-2.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-2.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-2.ttl:188460 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-2.ttl turtle content
rapper: Parsing returned 181597 triples
$ sed -n "188460p" data/rdf/places-2.ttl
<https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}>,
$ rapper -i turtle data/rdf/places-3.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-3.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-3.ttl:93863 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-3.ttl turtle content
rapper: Parsing returned 90766 triples
$ sed -n "93863p" data/rdf/places-3.ttl
<https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}>;
$ rapper -i turtle data/rdf/places-4.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-4.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-4.ttl:38258 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-4.ttl turtle content
rapper: Parsing returned 34621 triples
$ sed -n "38258p" data/rdf/places-4.ttl
cito:citesForInformation <http://atlantides.org/bibliography/c.html#Ciampoltrini-G.-1994.-"Aspetti-dell'insediamento-etrusco-nella-valle-del-Serchio:-il-V-sec.-a.C."-Studi-Etruschi-59.>,
$ rapper -i turtle data/rdf/places-5.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-5.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-5.ttl:89953 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-5.ttl turtle content
rapper: Parsing returned 84766 triples
$ sed -n "89953p" data/rdf/places-5.ttl
pleiades:hasLocation <https://pleiades.stoa.org/places/512650/OSM location of Aesopus Bridge>;
$ rapper -i turtle data/rdf/places-6.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-6.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-6.ttl:50477 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-6.ttl turtle content
rapper: Parsing returned 46367 triples
$ sed -n "50477p" data/rdf/places-6.ttl
<https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}>;
$ rapper -i turtle data/rdf/places-7.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-7.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-7.ttl:42984 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-7.ttl turtle content
rapper: Parsing returned 40089 triples
$ sed -n "42984p" data/rdf/places-7.ttl
<https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}>,
$ rapper -i turtle data/rdf/places-8.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-8.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-8.ttl:48163 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-8.ttl turtle content
rapper: Parsing returned 45263 triples
$ sed -n "48163p" data/rdf/places-8.ttl
<https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}>,
$ rapper -i turtle data/rdf/places-9.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-9.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-9.ttl:97760 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-9.ttl turtle content
rapper: Parsing returned 97358 triples
$ sed -n "97760p" data/rdf/places-9.ttl
cito:citesForInformation <http://atlantides.org/bibliography/d.html#"de-Jorio"/Russo-1829.-No.-118;-Schultz>;
Recommended fix: Either of the following options seems plausible:
Test to verify fix: Regenerate all TTL exports and ensure rapper runs clean on all of them.
Problem examples (try the "turtle" serializations for these places):
@jessesnyder The problematic URLs in those two examples are (respectively):
https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}
http://atlantides.org/bibliography/c.html#Ciampoltrini-G.-1994.-”Aspetti-dell’insediamento-etrusco-nella-valle-del-Serchio:-il-V-sec.-a.C.“-Studi-Etruschi-59.
@paregorios Is it the case that in the following example, we'd want to encode only the {
and }
characters, and not the "پارسی"?
https://www.example.com/پارسی/product?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}
@jessesnyder yes.
I made a little RDF/TTL file with only the curly braces encoded:
@prefix cito: <http://purl.org/spar/cito/> .
<https://pleiades.stoa.org/places/116722325> cito:citesForInformation <https://www.example.com/پارسی/product?id=%7b5B092BCB-89F4-4EA0-94FF-811F304B3B27%7d> .
I ran it through rapper
:
rapper -i turtle ~/scratch/satan.ttl
rapper: Parsing URI file:///Users/foo/bar/scratch/satan.ttl with parser turtle
rapper: Serializing with serializer ntriples
<https://pleiades.stoa.org/places/116722325> <http://purl.org/spar/cito/citesForInformation> <https://www.example.com/\u067E\u0627\u0631\u0633\u06CC/product?id=%7b5B092BCB-89F4-4EA0-94FF-811F304B3B27%7d> .
rapper: Parsing returned 1 triple
If rapper
is happy, then I'm happy.
@paregorios A possible solution is deployed to the staging server.
@jessesnyder In trying to resolve the example URI {STAGING}/places/116722325/turtle
, I got the same traceback I'm seeing when trying to test #471, which I have documented here.
The two examples of individual files I’d given (403281 and 116722325) now test clean in rapper. I still need to review/test the full RDF/TTL export files.
@paregorios I don't know what URL you'd access the archive at, so I downloaded it from the server: pleiades-20220323.tar.gz
@jesssesnyder: could be the result of a tar packaging error or of a problem upstream with the generation script or the code it invokes:
We are getting close. All but one of the dump files test clean with rapper
. The remaining (hopefully last?!?!) problem can be found in places-8.ttl, line 48206:
cito:citesForInformation <http://atlantides.org/bibliography/r.html%23RE:Hippika-ore VIII>,
This reference URL occurs in Pleiades place 834285 (Hippika? Ore).
@jessesnyder I propose I fix this URI manually on staging and production (because it's a mess anyway and better can be quickly provided), then re-run the dumps on staging and see if we pass the rapper
test at last.
A couple of tabs are sneaking through, so we should add TAB to the list of characters that we selectively URL-encode.
Tabs now escaped also, and tested successfully with rapper
.
This is working perfectly on staging. Ready to deploy to production.
Individual serializations on production are good. Will keep this open pending testing of complete export files after next daily dump script runs (tomorrow morning) and I have time to check it.
This morning's bulk RDF export ran to completion and all place files parse error-free in rapper
. This is done.
@ryanfb reports:
working with pleiades RDF dumps I get some issues with URL's in the ttl files not being URL-encoded that winds up needing hand-fixing before parsing in some things…e.g. angle brackets, square brackets, percent signs, etc. not being %-escaped
https://gist.github.com/ryanfb/6796a5e2d77a4c1423b4 for specific instances
Steps to reproduce: current flow encountering them is using jena rdfcat to concatenate the places ttl and convert to rdfxm
Migrated from http://pleiades.stoa.org/docs/site-issue-tracker/60