Use IRIs for resource identification

jakubklimek commented 2 years ago

Is there a reason IRIs (Internationalized Resource Identifiers) (RFC3987) are not used for resource identification even though they are part of the RDF 1.1 specification and, in fact, they are already widely used?

I understand that on the level of HTTP, which may be the base of the Solid protocol, only URIs are used. However, I have the feeling the relationship of URIs and IRIs should be addressed somehow. In RFC3987 there is a mapping (percent-encoding) of IRIs to URIs and back. However, e.g. in RDF Turtle, IRI and its percent-encoded URI counterpart, are treated as two different IRIs, therefore some issues could arise.

Maybe this was already discussed, but I cannot see any mention of IRIs in the specification.

damooo commented 2 years ago

Currently CSS also assume identifiers as URIs. in .acl resources for a resource, CSS uses percent-encoded URI of that resource.

joachimvh commented 2 years ago

Currently CSS also assume identifiers as URIs. in .acl resources for a resource, CSS uses percent-encoded URI of that resource.

There is an open issue on what to do with this at https://github.com/solid/web-access-control-spec/issues/93

kjetilk commented 2 years ago

I don't know if it is actually the concern, but IRIs expand the attack surface for homograph attacks quite significantly. It makes sense to tread carefully in this area.

damooo commented 2 years ago

I don't know if it is actually the concern, but IRIs expand the attack surface for homograph attacks quite significantly. It makes sense to tread carefully in this area.

that is concern only for case of domain-names. not w.r.t resource identifiers in given domain. Even that case is not a technical vulnerability per se, but of naive usage. This need not prevent us in embracing iris, which enable international users name resources in their scripts.

elf-pavlik commented 2 years ago

which enable international users name resources in their scripts

Human-readable labels should be part of the resource description. IRI preferably stays machine-readable, especially everything after the domain name. Most mobile browsers will not even show anything else to the user.

kjetilk commented 2 years ago

that is concern only for case of domain-names. not w.r.t resource identifiers in given domain.

Is it? I agree that if there is only one storage per authority, or every storage has the same owner, it is a likely a marginal problem, but that is not necessarily the case.

Even that case is not a technical vulnerability per se, but of naive usage.

Yes, but as most attacks are not on technical vulnerabilities, I think it is an important concern nevertheless. The problem is that people tend to rely on a bunch of flawed heuristics to determine actions that have security implications, like looking at the URL, I see this in my own family members. We should ensure that they don't need to, they should be able to rely on a few security mechanisms that can be easily understood by lay people, but I believe we're not there, IMHO. Until we have that, homograph attacks look pretty scary to me, but admittedly, I do not have numbers to back that up.

This need not prevent us in embracing iris, which enable international users name resources in their scripts.

I believe there is nothing in the technical infrastructure that prevents us from adopting IRIs. It is actually not clear to me why we haven't.

damooo commented 2 years ago

which enable international users name resources in their scripts

Human-readable labels should be part of the resource description. IRI preferably stays machine-readable, especially everything after the domain name. Most mobile browsers will not even show anything else to the user.

IRIs are not about end-user friendly labels. But they enable to have identifiers too internationalized, which in some cases cannot be expressed in ergonomic uris. for example, if we want to have identifier for an entry in lexicon of a language, or concepts of certain culture, with out complexities of lossy-translation or transliteration, etc, or simply unicode-filenames. Of-cource we can use uris with percent-encoding in this case, but it gives above issues, and non-ergonomic.

damooo commented 2 years ago

@kjetilk

I believe there is nothing in the technical infrastructure that prevents us from adopting IRIs. It is actually not clear to me why we haven't.

There may be actually few things solid need to specify if to support iris. IRI support will be non-trivial for following reasons.

Solid is based on http. HTTP doesn't support IRIs as identifiers.
RDF identifier match is strictly literal comparison of iri-strings.

Say, in an identifier, we want to have a segment राम, whose percent-encoding is %E0%A4%B0%E0%A4%BE%E0%A4%AE. There are two ways prevalent in community.

one is direct iri, like <example:राम>. dbpedia, and many other datasets use this approach, in their international datasets. like <<http://hi.dbpedia.org/resource/राम>.
another practice is to, have percent-encoded uris as identifier. lexvo.org, and others uses this approach. lexvo's identifiers both for concept-resources, and information resources, follow this approach.

Now issue is, for both of above approaches, if we want those identifiers in http, and also dereferencable (as in case of solid), http uri will be same. Thus rdf mandates to distinguish between <example:राम>, and <example:%E0%A4%B0%E0%A4%BE%E0%A4%AE>`, both practices adopted by communities, but http identifiers cannot carry that distinction when we are dereferencing.

Thus if we want to PUT representation at <example/path/to/राम>, or <example/path/to/%E0%A4%B0%E0%A4%BE%E0%A4%AE>, solid-http-server gets PUT request for <example:path/to/%E0%A4%B0%E0%A4%BE%E0%A4%AE>, in both cases. we can't be sure of intended iri. we have to convey that loss of information through other mechanisms. Same issue on GET.

Solid is interesting case for iris, as it have to support both rdf, and http concerns for dereferencable-information-resources.

damooo commented 2 years ago

contd from above...

There seems two straight-forward behaviours possible to address this case.

Solid should specify, it only supports uris for information resources it handles, and it takes uris recieved through request literally with out any decoding. thus percent-encoded-uris will be only mechanism to represent unicode names.
Solid should specify, it supports iris for information resources it handles, and percent-decodes uris recieved through http request, to compute intended iri. If one want to have percent-encoded uri itself as identifier as in lexvo.org case, they should percent-encode already percent-encoded-identifier. This way has advantage of supporting both cases, and supports most prevelant case straight forwardly.

2nd behaviour seems better for me.

kjetilk commented 2 years ago

OK, thank you very much for your perspective, @damooo , it is very much appreciated. I believe we should address this in the future. It is very good that you already have concrete proposals to resolve this!

damooo commented 2 years ago

@kjetilk , @elf-pavlik

I believe we should address this in the future.

I am not sure if gravity of this issue is clearly understood. Solid doesn't specify whether to decode or not http uri, to compute identifier of resource. It has potential to cause solid based linked-data apps doesn't work same with all solid servers, and will fail in many basic tasks, when ever an unicode char enter an identifier.

I just tested with css, nss, ess. And indeed this is the case. They interpret spec differently, among themselves, and with-in themselves at different places. And few apps break with one or other server, when unicode char enter in identifier. For demo, in inrupt's ess create a container with unicode name, and then try to browse that folder in inrupt's pod-browser app. It crashes.

I give here PUT + GET request flow for a resource on both NSS and ESS, for demonstrating technicle issue.

let's say there is already a container with id <http://example.org/public/>, and we want to create a sub container with unicode-segment Germānus (note ā).

In NSS:

1. `PUT` `<http://example.org/public/Germānus/>` or `<http://example.org/public/Germ%C4%81nus/>` (both will be normalized to `<http://example.org/public/Germ%C4%81nus/>` by http client.

Request:

PUT /public/Germ%C4%81nus/ HTTP/1.1
Accept: text/turtle
Content-Type: text/turtle
Link: <http://www.w3.org/ns/ldp#BasicContainer>; rel="type"

Response:

HTTP/1.1 201 Created
X-Powered-By: solid-server/5.6.16
Content-Type: text/plain; charset=utf-8

2. `GET` on `<http://example.org/public/Germānus>` or `<http://example.org/public/Germ%C4%81nus/>` after above `PUT`

Request:

GET /public/Germ%C4%81nus/ HTTP/1.1

Response:

HTTP/1.1 200 OK
X-Powered-By: solid-server/5.6.16
Link: <.acl>; rel="acl", <.meta>; rel="describedBy", <http://www.w3.org/ns/ldp#Container>; rel="type", <http://www.w3.org/ns/ldp#BasicContainer>; rel="type"
Content-Type: text/turtle

@prefix : <#>.
@prefix dct: <http://purl.org/dc/terms/>.
@prefix ldp: <http://www.w3.org/ns/ldp#>.
@prefix stat: <http://www.w3.org/ns/posix/stat#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.

<http://example.org/public/Germ%C4%81nus/>:
    a ldp:BasicContainer, ldp:Container;
    dct:modified "2021-11-17T02:47:27Z"^^xsd:dateTime;
    stat:mtime 1637117247.421;
    stat:size 4096.

Here we can see NSS creates a resource with literal identifier http://example.org/public/Germ%C4%81nus/> All rdf descriptions uses that percent-encoded identifier.

In `ESS`:

1. `PUT` `<http://example.org/public/Germānus/>` or `<http://example.org/public/Germ%C4%81nus/>`

Request:

PUT /public/Germ%C4%81nus/ HTTP/1.1
Accept: text/turtle
Content-Type: text/turtle
Link: <http://www.w3.org/ns/ldp#BasicContainer>; rel="type"

Response:

HTTP/2 201 Created
content-location: https://pod.inrupt.com/damodara/public/Germ%C4%81nus/
link: <http://www.w3.org/ns/ldp#Resource>; rel="type"
link: <http://www.w3.org/ns/ldp#BasicContainer>; rel="type"
link: <http://www.w3.org/ns/ldp#RDFSource>; rel="type"
link: <http://www.w3.org/ns/ldp#Container>; rel="type"

Here it also sent content-location header, that resource identifier is https://pod.inrupt.com/damodara/public/Germ%C4%81nus/ (encoded-form)

2. `GET` on `<http://example.org/public/Germānus>` or `<http://example.org/public/Germ%C4%81nus/>` after above `PUT`

Request:

GET /public/Germ%C4%81nus/ HTTP/1.1

Response:

HTTP/2 200 OK
content-type: text/turtle; charset=UTF-8
link: <http://www.w3.org/ns/ldp#Resource>; rel="type"
link: <http://www.w3.org/ns/ldp#BasicContainer>; rel="type"
link: <http://www.w3.org/ns/ldp#RDFSource>; rel="type"/terms#podOwner"
link: <http://www.w3.org/ns/ldp#Container>; rel="type"

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ldp: <http://www.w3.org/ns/ldp#> .

<http://example.org/public/Germānus/>
        rdf:type  ldp:BasicContainer .

Here we can see though it uses iri <http://example.org/public/Germānus/> as identifier to resource created contrary to NSS behaviour of encoded-uri for same put request .

Thus different servers uses different identifiers in rdf-descriptions for a resource created and dereferencable by same http uri. apps cannot know which one to lookup as subject here after if they want to read rdf-description about them. Thus they will fail on one or other server for any such rdf usage. Issue get worse for further sub containers, contained resources, and for usage of relative-uris, etc.

It is too fundamental an aspect that solid must specify at very begining, evn if it is to constrain it to uris. As identifiers get persisted in rdf-docs, fixing this later will be impossible, as data has to be modified.

damooo commented 2 years ago

contd from above...

There seems two straight-forward behaviours possible to address this case.

1. Solid should specify, it only supports uris for information resources it handles, and it takes uris recieved through request **literally** with out any decoding. thus percent-encoded-uris will be only mechanism to represent unicode names.

2. Solid should specify, it supports iris for information resources it handles, and percent-decodes uris recieved through http request, to compute intended iri. If one want to have percent-encoded uri itself as identifier as in `lexvo.org` case, they should percent-encode already percent-encoded-identifier. This way has advantage of supporting both cases, and supports most prevelant case straight forwardly.

2nd behaviour seems better for me.

2nd behaviour to percent-decode uris complicates few other things. specifically resolving relative-identifiers, base-identifiers etc will be nightmare to specify. There may be many other issues that will popup from different standards and operations.

Thus solid should better go with first option, and must specify it only supports uris as identifiers for information-resources it manages. And should mandate to use literal uris as identifiers for resources it manages in rdf-docs. ESS goes against this as mentioned in above comment. NSS, CSS are inline.

kjetilk commented 2 years ago

My apologies, @damooo , for not giving this issue the attention it deserves. My mental queue is filled since we have a deadline for the current milestone today. So, please take the following as little more than loud-thinking:

On the backend, I suspect that IRIs would be prevalent, as RDF is defined in those terms. In the case of NSS, I found that it stores filenames on disc with UTF-8:

00000000   62 6C C3 A5  62 C3 A6 72  73 79 6C 74  65 74 C3 B8  79 0A                     bl..b..rsyltet..y.

("blåbærsyltetøy" = "blueberry jam" has all three non-ascii Norwegian characters in one word, and is thus my favorite word for looking into such problems :-) )

Thus, the entire problem seems to be in the upper layer of the server implementations. I don't have the bandwidth to understand the implications, but given that we could potentially have a SPARQL Endpoint towards the stored data, it doesn't seem quite attractive to me to only have percent-encoded URIs, but there is also the homograph attack problem...

However, since NSS is in line with option 1., and the short term goal for 0.9 is to describe NSS behavior, then it nevertheless seems like what we should do in the short term. Yet, there seems to be potential for a more sophisticated approach in longer term.

damooo commented 2 years ago

@kjetilk , thanks for response

On the backend, I suspect that IRIs would be prevalent, as RDF is defined in those terms. In the case of NSS, I found that it stores filenames on disc with UTF-8:

I want to clarify technical issue raised. It is not about implementation-detail of which name server used for a representation persistnce on disk. But is about which identifier server assigns and uses to refer that resource in same/other rdf-documents describing/pointing to that resource in their content. like in description of container/ in acl resource, in metadata aux-resources, and in general any other. That is well-part of their public api. They differ in api at fundamental layer of resource identification, naming thus fragmenting eco-system (irreversably-in-lonterm). not just about implementation.

kjetilk commented 2 years ago

It is not about implementation-detail of which name server used for a representation persistnce on disk.

Certainly! I just wanted to clarify that those implementation details are not what is holding us back.

csarven commented 2 years ago

All, great issue and feedback.

The motivation and applicability of RFC3987 is clear.
The INTERNATIONAL-SPECS includes recommendations for resources identifiers, identifiers in documents, and (potential) visibility of identifiers to users.
RDF11-CONCEPTS describes the use of IRIs.

All Solid specifications (not only SOLID-PROTOCOL) should clarify their requirements and considerations pertaining to internationalization. There is already some work on this, including spec content and open issues - for reviews - so improvements are very welcome.

As per AWWW, the "situation" in Solid Protocol is that while the Interaction component necessitates the use of URIs, the Identification and Data Formats components, for the most part, necessitates the use of IRIs.

I suggest that we clarify the situations in which converting IRIs to URIs, and vice versa, could happen and where new recommendations or considerations may be necessary. If so, there needs to be coherent round-tripping IRI->URI->IRI.

Note: The table below is to document and discuss. It is NOT an exhaustive list of situations and the notes are not necessarily correct (or implementable). Implementations may want to experiment or provide feedback. Test authors should ignore this.

✔: existing requirement. ?: potential requirement.

Situation	URI-to-IRI	IRI-to-URI
Sending HTTP request	x	`✔` RFC7230 client [1]
Process HTTP request identifier	x	`?` SOLID-PROTOCOL server [2]
Writing resource	x	`?` SOLID-PROTOCOL server [3]
Updating RDF (server)	x	`?` SOLID-PROTOCOL server [4]
Updating RDF (client)	x	`?` SOLID-PROTOCOL client [5]
Reading RDF	`?` SOLID-PROTOCOL server/client [6]
Responding HTTP request	x	`✔` RFC7230 server[7]

[1] Current transmission on the wire. [2] For canonical Identification. [3] Normalize to IRI. [4] Typical case is when updating containment statements. Servers should use the IRI form. [5] When a client requests to update a resource description, the server should not process the payload by converting the identifiers. Clients should use the IRI form when possible. [6] Do not convert. Read identifier-string as is. [7] Current transmission on the wire (e.g., Content-Location = absolute-URI / partial-URI)

woutermont commented 1 year ago

@csarven, I propose to take this issue (and the related https://github.com/solid/specification/issues/22) up in the milestone for v0.11. As long as we specify Solid's handling of identifiers, there should be no problem, so it is important to do so.

I will do a proposal based on the relevant specs and the idea above, to get the conversation going again.

csarven commented 1 year ago

Sure. The "idea above" being https://github.com/solid/specification/issues/347#issuecomment-1237167849 ?

solid / specification