isawnyu / pleiades-gazetteer

This repository provides a home for tickets and other planning documents for the Pleiades gazetteer of ancient places. Code is kept in multiple other repositories.
https://pleiades.stoa.org
11 stars 0 forks source link

ensure URIs in TTL export files are valid and well-formed: 3pts #9

Closed paregorios closed 2 years ago

paregorios commented 9 years ago

@ryanfb reports:

working with pleiades RDF dumps I get some issues with URL's in the ttl files not being URL-encoded that winds up needing hand-fixing before parsing in some things…e.g. angle brackets, square brackets, percent signs, etc. not being %-escaped

https://gist.github.com/ryanfb/6796a5e2d77a4c1423b4 for specific instances

Steps to reproduce: current flow encountering them is using jena rdfcat to concatenate the places ttl and convert to rdfxm

Migrated from http://pleiades.stoa.org/docs/site-issue-tracker/60

paregorios commented 9 years ago

Added by @ryanfb on Oct 25, 2013 04:44 PM It looks like rdflib (which looks to be what pleiades-rdf uses to serialize to turtle) expects URLs/URIs to be escaped before being passed in to the URIRef constructor: https://github.com/RDFLib/rdflib/issues/263

paregorios commented 9 years ago

Added by @ryanfb on Oct 31, 2013 10:42 AM Diff of changes made to load/convert places in Jena rdfcat in one RDF dump snapshot: https://gist.github.com/ryanfb/18d7fa31d7561791ceb2

paregorios commented 9 years ago

Added by @paregorios on Oct 31, 2013 10:49 AM data fix in place for: https://gist.github.com/rya[…]eiades-uri-escape-patch-L47

paregorios commented 9 years ago

Added by @paregorios on Oct 31, 2013 10:53 AM data fix in place for: https://gist.github.com/rya[…]eiades-uri-escape-patch-L56

paregorios commented 6 years ago

This problem has also been reported by @wouterbeek and Vincent Altenavan:

files places-5.tll and places-7.ttl contain unencoded spaces in IRIs (which is not allowed), e.g.,:

<http://atlantides.org/capgrids/79#A2 > osgeo:extent <http://atlantides.org/capgrid/79#A2 -extent> . <doi: 10.3368/lj.28.1.1>;

paregorios commented 6 years ago

As @ryanfb pointed out above, it looks like the URIs need to have been urlencoded before being passed to URIRef() in https://github.com/isawnyu/pleiades-rdf/blob/master/pleiades/rdf/common.py (and possibly elsewhere?).

paregorios commented 5 years ago

priority:high because data corruption

paregorios commented 5 years ago

@ryanfb I've looked back at the ttl spec and, unless I'm really clueless, I think the spec itself only requires that we're using UTF-8:

Literals and URIs may also contain escapes to encode surrounding syntax, non-printable characters and to encode Unicode characters by codepoint number (although they may also be given directly, encoded as UTF-8).

The syntax of Turtle is expressed over code points in Unicode [UNICODE]. The encoding is always UTF-8 [RFC3629].

Unicode code points may also be expressed using an \uXXXX (U+0 to U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a hexadecimal digit [0-9A-F]

The media type of Turtle is text/turtle (pre-registration media type application/x-turtle should be accepted). The content encoding of Turtle content is always UTF-8. Charset parameters on the mime type are required until such time as the text/ media type tree permits UTF-8 to be sent without a charset parameter.

Do you read this differently? I'm not opposed to escaping/encoding URIs as a convenience for downstream consumers that don't faithfully implement the spec, but I do need to assess priority of same in the light of limited resources. If, however, I'm not understanding something correctly, I'm prepared to keep the priority high.

wouterbeek commented 5 years ago

Hi @paregorios,

My original complaint was about unescaped spaces appearing in some IRIs. E.g., the last character of the subject term and the character before the hyphen in the object term:

<http://atlantides.org/capgrids/79#A2 > osgeo:extent <http://atlantides.org/capgrid/79#A2 -extent> .

Notice that the main issue may not even be that the spaces are unescaped, but that they appear in the first place. Are they no simply the result of incorrect IRI creation?

I would be opposed to URI-encoding all non-ASCII Unicode characters that appear in IRIs, since this makes IRIs containing names from non-English languages (including ancient languages) far less readable, without resulting in a clear benefit.

paregorios commented 5 years ago

Having reviewed the examples provided by @ryanfb and @wouterbeek I see that our problems are with URLs that are broken in some way, e.g.:

We probably need to do a few things here:

  1. Hunt down the specific problems identified and clean them up in the CMS
  2. Add some URL format validation either at input (preferred) or at RDF serialization
  3. Hunt down other existing, invalid values and fix them
paregorios commented 5 years ago

I've manually addressed the specific problems that were included in the gist from @ryanfb by checking in the reference changes he had submitted previously and by updating all references on the basis of our Zotero data.

paregorios commented 5 years ago

As a consumer of Pleiades content in any serialization (html, json, rdf), I will never encounter malformed HTTP(s) URIs.

There are several contexts in which contributor users can enter URIs into data fields. We should validate to ensure:

The following contexts should have validation:

cguardia commented 5 years ago

Looks like the URL validators used for these fields don't do much more than making certain there are no spaces in the URL. So, not sure where a space could be coming from.

cguardia commented 5 years ago

I reviewed the .ttl files from pleiades downloads, and couldn't find URIs similar to those with errors (that is, no URis with space before closing bracket, or between name and -extent.

wouterbeek commented 5 years ago

@cguardia There are still invalid IRIs in the latest version of the RDF download (I used http://atlantides.org/downloads/pleiades/rdf/pleiades-latest.tar.gz).

The first issue is that the caret character (^) which appears unescaped multiple times in an IRI on line 60,524:

<http://www.persee.fr/web/revues/home/prescript/article/racf_0220-6617_1991_num_30_1_2657?luceneQuery=%28%2B%28content%3AAQUAE+title%3AAQUAE^2.0+fullContent%3AAQUAE^100.0+fullTitle%3AAQUAE^140.0+summary%3AAQUAE+authors%3AAQUAE^5.0+illustrations%3AAQUAE^4.0>

Please consider using an off-the-shelf solution that has been tested for years like uriparser library (https://uriparser.github.io/) in order to automatically perform such tests. Since this is such a widely used operation, every popular programming language has a good URI library that performs validity checks. (No need to re-invent the wheel.)

paregorios commented 5 years ago

@wouterbeek I appreciate your efforts to help us sort this out and the investment of time that represents. What is unclear to us is which characters do need escaping and why. We have had previous discussion in this thread about requirements for same, concluding that there is no RDF/TTL requirement to do more than encode valid URIs using UTF-8. We believe we have addressed the specific issue you previously highlighted (trapped spaces in some URIs in the RDF).

Now you seem to be highlighting a set of issues related to RFC 3986 conformance. I am not a URI syntax expert; however, I have just looked through the RFC. I do not see the caret character listed as a reserved character. I recognize that I may not have detected provision(s) that might prohibit caret (and other characters) without specifically identifying them. That is why I make the following request. Can you please explain in more detail what settings/process you are using with the uriparser so that we can duplicate your findings in toto, verify that tool output is in fact conformant to the spec, and thereby ensure that we are producing quality data that meets standards, rather than just the idiosyncrasies of individual tool chains.

wouterbeek commented 5 years ago

@paregorios No problem, URI/IRI syntax is (unfortunately) quite complex. I'll first dive into the syntax a bit and then give concrete examples on how to detect violations of the URI/IRI syntax.

Syntax

The issues that I have observed seem to occur in the path component of IRIs. This is the part of the IRI that is separated by forward slashes. The parts in between the forward slashes are called 'segments'. They are component of characters that match the ipchar rule. The relevant BNF rules from the RFC 3987 standard are as follows:

ipchar         = iunreserved / pct-encoded / sub-delims / ":" / "@"
iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
pct-encoded    = "%" HEXDIG HEXDIG
sub-delims     = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

^ Notice that the caret (^) and the space are not part of the iunreserved and sub-delims rules, and are therefore not allowed. Another one is the pipe character (|). (We can skip the rule for ucschar since there are no issues with non-ASCII Unicode characters ATM.)

Example with rapper

Rapper is a command-line tool for parsing most RDF serialization formats. It can be installed with apt install raptor2-utils on Ubuntu. When I run it for the following Pleiades file I get an error:

$ rapper -i turtle places-3.ttl > /dev/null 
rapper: Parsing URI file:///home/wouter/Downloads/pleiades-latest/places-3.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///home/wouter/Downloads/pleiades-latest/places-3.ttl:112694 - syntax error at '<'
rapper: Failed to parse file places-3.ttl turtle content
rapper: Parsing returned 109764 triples

Example with serdi

Serdi is the command-line tool that comes with the Serd library. It can be installed with apt install serdi on Ubuntu. When I run it for the same Pleiades file I get a similar error:

$ serdi -i turtle places-3.ttl > /dev/null 
error: places-3.ttl:112694:87: invalid IRI character ` ' (escape %20)

Example with liburiparser

In addition to command-line tools, it is possible to check the validity on a per-URI basis. The following function returns true iff the entered string encodes a valid URI. (The example is in C++, but similar libraries exist for other languages as well.)

#include <uriparser/Uri.h>

auto is_valid(const std::string& input) const -> bool
{
  UriUriA uri;
  const char* errPos;
  auto valid {uriParseSingleUriA(&uri, input.c_str(), &errPos) == URI_SUCCESS};
  if (valid) uriFreeUriMembersA(&uri);
  return valid;
}
paregorios commented 2 years ago

I have just reviewed and verified that steps taken previously in response to this ticket have eliminated trailing spaces in URIs (see also https://github.com/isawnyu/pleiades-rdf/issues/6). I am now investigating current status of the other problems detailed by @wouterbeek above.

paregorios commented 2 years ago

Our TTL files are still throwing errors when parsed with rapper, for example:

rapper -i turtle data/rdf/places-1.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-1.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-1.ttl:77912 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-1.ttl turtle content
rapper: Parsing returned 73585 triples

For this example, the data at line 77912 (reported in the rapper error) reads:

        <https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}>,

Indeed, unencoded curly braces can be seen in the URI, and these are unsafe URI characters. As discussed above, fully urlencoding all our URIs would render many helpfully readable URIs opaque. A better solution perhaps is to selectively encode only the unsafe characters: "{", "}", "|", "^", "~", "[", "]", and "`".

paregorios commented 2 years ago

Accordingly I offer the following reformulation of the problem statement:

Description: Unsafe characters occur in URIs in Pleiades RDF/TTL exports.

Examples:

$ rapper -i turtle data/rdf/places-1.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-1.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-1.ttl:77912 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-1.ttl turtle content
rapper: Parsing returned 73585 triples
$ sed -n "77912p" data/rdf/places-1.ttl 
        <https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}>,
$ rapper -i turtle data/rdf/places-2.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-2.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-2.ttl:188460 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-2.ttl turtle content
rapper: Parsing returned 181597 triples
$ sed -n "188460p" data/rdf/places-2.ttl 
        <https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}>,
$ rapper -i turtle data/rdf/places-3.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-3.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-3.ttl:93863 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-3.ttl turtle content
rapper: Parsing returned 90766 triples
$ sed -n "93863p" data/rdf/places-3.ttl 
        <https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}>;
$ rapper -i turtle data/rdf/places-4.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-4.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-4.ttl:38258 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-4.ttl turtle content
rapper: Parsing returned 34621 triples
$ sed -n "38258p" data/rdf/places-4.ttl 
    cito:citesForInformation <http://atlantides.org/bibliography/c.html#Ciampoltrini-G.-1994.-"Aspetti-dell'insediamento-etrusco-nella-valle-del-Serchio:-il-V-sec.-a.C."-Studi-Etruschi-59.>,
$ rapper -i turtle data/rdf/places-5.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-5.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-5.ttl:89953 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-5.ttl turtle content
rapper: Parsing returned 84766 triples
$ sed -n "89953p" data/rdf/places-5.ttl 
    pleiades:hasLocation <https://pleiades.stoa.org/places/512650/OSM location of Aesopus Bridge>;
$ rapper -i turtle data/rdf/places-6.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-6.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-6.ttl:50477 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-6.ttl turtle content
rapper: Parsing returned 46367 triples
$ sed -n "50477p" data/rdf/places-6.ttl 
        <https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}>;
$ rapper -i turtle data/rdf/places-7.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-7.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-7.ttl:42984 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-7.ttl turtle content
rapper: Parsing returned 40089 triples
$ sed -n "42984p" data/rdf/places-7.ttl 
        <https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}>,
$ rapper -i turtle data/rdf/places-8.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-8.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-8.ttl:48163 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-8.ttl turtle content
rapper: Parsing returned 45263 triples
$ sed -n "48163p" data/rdf/places-8.ttl 
        <https://www.archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}>,
$ rapper -i turtle data/rdf/places-9.ttl > /dev/null
rapper: Parsing URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-9.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Error - URI file:///Users/paregorios/Documents/files/P/pleiades.datasets/data/rdf/places-9.ttl:97760 - syntax error at '<'
rapper: Failed to parse file data/rdf/places-9.ttl turtle content
rapper: Parsing returned 97358 triples
$ sed -n "97760p" data/rdf/places-9.ttl 
    cito:citesForInformation <http://atlantides.org/bibliography/d.html#"de-Jorio"/Russo-1829.-No.-118;-Schultz>;

Recommended fix: Either of the following options seems plausible:

Test to verify fix: Regenerate all TTL exports and ensure rapper runs clean on all of them.

paregorios commented 2 years ago

Problem examples (try the "turtle" serializations for these places):

paregorios commented 2 years ago

@jessesnyder The problematic URLs in those two examples are (respectively):

jessesnyder commented 2 years ago

@paregorios Is it the case that in the following example, we'd want to encode only the { and } characters, and not the "پارسی"?

https://www.example.com/پارسی/product?id={5B092BCB-89F4-4EA0-94FF-811F304B3B27}
paregorios commented 2 years ago

@jessesnyder yes.

I made a little RDF/TTL file with only the curly braces encoded:

@prefix cito: <http://purl.org/spar/cito/> .
<https://pleiades.stoa.org/places/116722325> cito:citesForInformation <https://www.example.com/پارسی/product?id=%7b5B092BCB-89F4-4EA0-94FF-811F304B3B27%7d> .

I ran it through rapper:

rapper -i turtle ~/scratch/satan.ttl 
rapper: Parsing URI file:///Users/foo/bar/scratch/satan.ttl with parser turtle
rapper: Serializing with serializer ntriples
<https://pleiades.stoa.org/places/116722325> <http://purl.org/spar/cito/citesForInformation> <https://www.example.com/\u067E\u0627\u0631\u0633\u06CC/product?id=%7b5B092BCB-89F4-4EA0-94FF-811F304B3B27%7d> .
rapper: Parsing returned 1 triple

If rapper is happy, then I'm happy.

jessesnyder commented 2 years ago

@paregorios A possible solution is deployed to the staging server.

paregorios commented 2 years ago

@jessesnyder In trying to resolve the example URI {STAGING}/places/116722325/turtle, I got the same traceback I'm seeing when trying to test #471, which I have documented here.

paregorios commented 2 years ago

The two examples of individual files I’d given (403281 and 116722325) now test clean in rapper. I still need to review/test the full RDF/TTL export files.

jessesnyder commented 2 years ago

@paregorios I don't know what URL you'd access the archive at, so I downloaded it from the server: pleiades-20220323.tar.gz

paregorios commented 2 years ago

@jesssesnyder: could be the result of a tar packaging error or of a problem upstream with the generation script or the code it invokes:

paregorios commented 2 years ago

We are getting close. All but one of the dump files test clean with rapper. The remaining (hopefully last?!?!) problem can be found in places-8.ttl, line 48206:

    cito:citesForInformation <http://atlantides.org/bibliography/r.html%23RE:Hippika-ore    VIII>,

This reference URL occurs in Pleiades place 834285 (Hippika? Ore).

@jessesnyder I propose I fix this URI manually on staging and production (because it's a mess anyway and better can be quickly provided), then re-run the dumps on staging and see if we pass the rapper test at last.

paregorios commented 2 years ago

A couple of tabs are sneaking through, so we should add TAB to the list of characters that we selectively URL-encode.

jessesnyder commented 2 years ago

Tabs now escaped also, and tested successfully with rapper.

paregorios commented 2 years ago

This is working perfectly on staging. Ready to deploy to production.

paregorios commented 2 years ago

Individual serializations on production are good. Will keep this open pending testing of complete export files after next daily dump script runs (tomorrow morning) and I have time to check it.

paregorios commented 2 years ago

This morning's bulk RDF export ran to completion and all place files parse error-free in rapper. This is done.