Server Description - Githubissues

csarven commented 2 years ago

Background: there are a wide-range of use cases that needs to get a hold of authoritative information about the server. To date, there is no uniform way for agents and applications to find all information about a server.

General use case: Find authoritative information about a server.

Use cases:

Guinan wants to read the description of the server and understand its purpose.
Janeway wants to find the server's policies to review its data privacy and pledge for resource persistence.
Uhura wants to check the server's communication options in order to use a preferred notification protocol.
Seven of Nine wants to determine a suitable location to keep astrometric data from the server's available storages.
Picard wants to find the server's contact information to request a diplomatic meeting about security audits.
Torres wants her devices to use server's URI templates when naming particular kinds of resources.

There are many others.. which may be low-priority or already have existing standards, but could be captured in this data nevertheless, e.g., sitemap, Web syndication, robots control.

General requirement: Discovery mechanism and data model to express authoritative information about the server.

Specific requirement:

Server advertises the location of the authoritative information.
Server generates the authoritative information.
Server controls the authoritative information.

Considerations:

Accommodating various use cases that needs to get a hold of authoritative information about the server.
Should the authoritative information about the server be publicly discoverable?
Should the authoritative information about the server be publicly readable or access controlled?
Can specific class of agents (e.g., owners) claim/update authoritative information about the server?
Should the authoritative information about the server be discoverable from root path, storage, any resource, and/or somewhere else?
Caching.

Related issues (sample):

Notes: The authoritative information about the server would be a specialisation of #authoritative-information in https://github.com/solid/specification/pull/352 .

csarven commented 2 years ago

Web Host Metadata: https://datatracker.ietf.org/doc/html/rfc6415 "describes a method for locating host metadata as well as information about individual resources controlled by the host." This may be limited re XML-based schema for the description, and the lrdd relation type for resource-specific information doesn't cover the server-wide authoritative information.
Responses to the OPTIONS method are not cacheable.
.well-known/solid requires: IANA hop; Solid server to 1) have storage at / and 2) update/control /.well-known/solid. The Solid Protocol has no requirement on Solid server or any Storage to be deployed at /, so well-known URI is probably a no go. Same as reusing /.well-known/host-meta (RFC 6415) besides the limits on structure and resource type.
Using the Link header with http://www.w3.org/ns/solid/terms#serverDescription (for example) relation type targeting an RDF document may be most suitable and aligned with current methods of discovery+description in the Solid Protocol and elsewhere.

ThisIsMissEm commented 2 years ago

.well-known/solid requires: IANA hop; Solid server to 1) have storage at / and 2) update/control /.well-known/solid. The Solid Protocol has no requirement on Solid server or any Storage to be deployed at /, so well-known URI is probably a no go. Same as reusing /.well-known/host-meta (RFC 6415) besides the limits on structure and resource type.

It might be worth noting that keycloak stores it's /.well-known path at a nested level, so there is precedent for that (even if it's not strictly correct / useful according to spec)

langsamu commented 2 years ago

It might be worth noting that keycloak stores it's /.well-known path at a nested level

So does Cognito. Not that I think it's nice.

acoburn commented 2 years ago

Cognito and KeyCloak are following the OpenID spec requirement. There, the .well-known path starts not (necessarily) at the root of the server, rather it is appended to the issuer URL.

The OpenID examples are somewhat orthogonal to this discussion, though it does point to the possibility that a specification could define a discovery mechanism such that a .well-known resource is not strictly based on the root of the server. That said, the link header approach is arguably much more in line with linked data (follow-your-nose) principles than a .well-known that is discoverable only via familiarity with an external specification document.

langsamu commented 2 years ago

SPARQL Service Description is interesting:

SPARQL services made available via the SPARQL Protocol should return a service description document at the service endpoint when dereferenced using the HTTP GET operation without any query parameter strings provided.

-- https://www.w3.org/TR/sparql11-service-description/#accessing

One solution to this is the following.

Redirect to a Service Description without parameters:

$ curl -i https://api.parliament.uk/sparql
HTTP/1.1 302 Found
Location: https://api.parliament.uk/sparql/description

Execute query with parameters:

$ curl https://api.parliament.uk/sparql?query=SELECT+*+%7B%3Fs+%3Fp+%3Fo%7D+LIMIT+1
s,p,o
http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.w3.org/1999/02/22-rdf-syntax-ns#Property

Serve a UI to browsers:

$ curl -i https://api.parliament.uk/sparql -H "Accept: text/html"
HTTP/1.1 200 OK
Content-Type: text/html

<!DOCTYPE html> ...

ThisIsMissEm commented 2 years ago

Cognito and KeyCloak are following the OpenID spec requirement. There, the .well-known path starts not (necessarily) at the root of the server, rather it is appended to the issuer URL.

The OpenID examples are somewhat orthogonal to this discussion, though it does point to the possibility that a specification could define a discovery mechanism such that a .well-known resource is not strictly based on the root of the server. That said, the link header approach is arguably much more in line with linked data (follow-your-nose) principles than a .well-known that is discoverable only via familiarity with an external specification document.

Yeah, I'd be inclined to agree: a standard Link header for server-info is probably a better idea (though it does increase all response sizes, as I'd assume that header would be present on all responses (or at least all GET/HEAD/OPTIONS)

Though we could perhaps also make the .well-known for Solid be in turtle or another linked data format, instead of JSON or text.

kjetilk commented 2 years ago

I used to be a fan of OPTIONS * until @csarven pointed out it is not cachable... Hmpf, if anything should be cachable, this is information that would rarely change and where it is highly likely that server admins could look ahead and set a long max-age. :-(

I also think that .well-known is a bad design, exactly for the reason that @acoburn mentions, it requires out-of-band information, and we should have a mechanism that you only need one piece of spec knowledge (not a piece of knowledge from every spec), and everything else should be discoverable from there, like OPTIONS * could have been. I'm not sure that it is a show-stopper for .well-known that a Solid server doesn't cover /.well-known as it (alas) has become so prevalent that underlying servers, accommodates for it to be manageable by application servers (which a Solid server would be in that context).

I became wary of big default headers in my IoT days, as I saw too many instances of headers that provided little value but made big messages. LoRaWAN is an example of a protocol stack that back in that day became utterly unscalable because of this. So, yes, I wouldn't want that either, I'd prefer a solution where a small number of resources, preferably only one, needed this kind of information.

I'd like to reiterate the point I made in my only comment to notifications work, that we should be careful to name stuff that infringes on the authority of a storage (being roughly the same as a "pod", but I use the term storage, since that is what is specified) to control its URI space. This seems to turn into two distinct classes of things though: Those things that are within the URI space controlled by a storage and those that aren't. Those things that are within the space controlled by a storage should be discovered by interrogating the storage, but that leaves us with the requirement that storages must be really easy to discover, which they aren't now (#310). There should be a list of storages hosted by a server somewhere.

For the things that shouldn't be under the control of a storage, we also just require a list of those things, pointing to resources where they describe themselves in further detail.

Perhaps we could go back to OPTIONS *... The response to OPTIONS * could just be pointers to those two lists, one list of storages, and one list of non-storage-bound descriptions. We could probably live with that those two pointers being non-cachable. If the list is large enough, it really needs to be cachable, and so it would need to be a resource that can be GET.

For servers that host a single storage (or small number), we could probably live with that being non-cachable too, so in that case, it should be OK for OPTIONS * to list whatever it finds, so the semantics need to be clear on the case where either or both lists have been obtained by just OPTIONS * and when a client needs to GET more resources.

In principle, we could also let OPTIONS * just point to .well-known resources to satisfy both REST purists like myself and the more pragmatic people who like tend to look it up in a spec.

csarven commented 2 years ago

The Solid Protocol needs to be clear on whether all server instances can respond to OPTIONS *. I've created https://github.com/solid/specification/issues/356 to have a concrete response. My understanding of the current state of the specification and implementations is that a server may not be configured to respond to the asterisk-form, i.e., is the request ever reaching a Solid server? This is separate from whether the server has the capability.

csarven commented 2 years ago

I wonder if we should first focus on finding storage metadata - the use cases may even be more meaningful. We may still need server metadata, in order to, for example, to find available storages.

csarven commented 2 years ago

Mentioned in 2022-04-11 meeting and elsewhere: The Solid Protocol currently defines the notion of a storage resource but there is no notion of a server resource. Hence, the simplest path forward may indeed be focusing on the discovery of storage description/metadata.

After eliminating some of the options, the link relation approach may indeed be most suitable for discovery.

Enabling the discovery of the storage metadata from any resource (as per considerations) would require only one hop. Although the semantics of the relation needs to be clear, e.g., is it "storage metadata of a storage in which this resource is in" (and so introducing a new predicate) or can it alternatively be found via the existing describedby link relation type.

From the considerations, there is still the question of whether the storage metadata resource should be server-protected (read-only) and whether it is always included in an HTTP response (public-read). If server-protected/managed.. use cases where certain class of agents (e.g., owners) need to claim/update info will not be able to update the resource e.g., to change contact information, or policies. We need to weigh this out..

It'd be great to gather some feedback from server/client implementers on what use case they worked with, which option they've tried, what worked and didn't, what functionality was missing...

csarven commented 2 years ago

dokieli-storage-description

Continuing on issuecomment-1105226446, I'd like to share an implementation feedback and some additional thoughts:

As a user of an authoring/annotation tool (dokieli), the following use cases are of interest to me: description, policies, communication options, contact information.

In order to eat my own cooking, I've updated a local copy of NSS to include Link: </storage-description>; rel="http://www.w3.org/ns/solid/terms#storageDescription" in every HTTP response so that I can start focusing on one way to discover and use information about a storage in an application. /storage-description contains information about the storage, e.g., storage's name, description, owner, URI persistence policy, digital rights policies, notification channels.

To take an example, I will explain how some of the use cases are realised as part of dokieli's Save As feature. I will also provide some feedback on what functionality works or could work for both servers and an application category, and the minimal specification requirements necessary to make it so.

From a developer's perspective - speaking for myself - discovery of the storage description is not all that interesting. It didn't matter a whole lot if the information can be materialised and found in the storage resource, storage's description resource, or via a relation http://www.w3.org/ns/solid/terms#storageDescription (using solid:storageDescription here on end) from any resource to the storage's description resource (/storage-description). They are all equally possible and useful when all things are considered (with some caveats to each of course as previously discussed). Here, I've opted to exemplify with an independent resource (/storage-description) that can be discovered through any resource, including the storage resource itself. But the same information could just as well appear in the storage resource (self-descriptive is obvious) or its own description resource via the describedby link relation which happens to be the same resource: /storage-description. There are different discovery paths and costs.

What essentially needs to be described is our subject of interest, i.e., the storage (/). A simple, flexible, and reusable approach would be to maintain the root of the information at /. This way, the same data can be materialised anywhere, again, whether that's / or the /storage-description, or anywhere else for that matter. It is a graph at the end of the day.

It should come as a no surprise that an application (like dokieli) can have principles and opinions on how to get a hold of information besides what's written in some specification. They may simply fetch resources on a need to basis - whenever there is a user action. So, if the application already obtained a representation of /, it will try to see if the available information is sufficient for its needs. If not, it will look at other options, i.e., via the solid:storageDescription relation. It is of course still the case that there needs to be one sure way where a server can guarantee . But, if it didn't already obtain /, and it is currently looking at a resource like /foo, then it can simply go straight to /storage-description via solid:storageDescription, and even still, it could fetch the storage and get a hold of the same information all the same.

So, as far as some storage-centric information goes, / and /storage-description can have identical statements. To help applications, /storage-description can include foaf:primaryTopic to point at the storage resource and/or provide the subject's type with pim:Storage. In /, foaf:primaryTopic is not needed.

GET /storage-description
Content-Type: text/turtle

@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix pim: <http://www.w3.org/ns/pim/space#> .
@prefix solid: <http://www.w3.org/ns/solid/terms#> .

<> foaf:primaryTopic </> .

</>
  a pim:Storage .
  dcterms:title "okieli-dokieli" .
  solid:owner <https://csarven.ca/#i> .

I want to emphasise again that certainly there can be more advanced or interesting ways to model storage. This write-up is not dismissing those possibilities. This is only exemplifying a self-describing approach that a server can take, and that appears to be sufficient for an application (like dokieli).

Back to dokieli's Save As feature: In a nutshell, after user authenticates, dokieli fetches their WebID Profile Document and optionally WebID's preferences (via pim:preferencesFile). If pim:storage is found, it loads up the storage-browser feature with the value of pim:storage so that the user can navigate to a location where they want to create a copy of the resource. (Another storage location can be manually entered as well.) At this point, dokieli got a hold of storage's description at / and also found out about /storage-description from the solid:storageDescription link relation in the HTTP header of /. This initial subject of interest for Save As is / but the starting point could've been another resource. I note here that the same thing in the context of / could've been achieved through describedby.

dokieli now reveals "Storage details" where user can reveal more information about the storage. Details show the storage location, name, description, owner, URI persistence policy, digital rights policies (such as rules, actions, assigners), notification channels, and so forth.

Bonus use case: user wants to check whether the storage's digital rights policies are compatible with their own preferred policies. If there are any discrepancies, the user should be warned and given a chance to make a decision about available options.

The example demonstrated shows that dokieli is aware of the storage's offer with a possible permission of action to sell assets stored in that location and selling happens to be a prohibited action by the user.

Besides the point: In the example I've used the ODRL vocabulary. DPV could've been used when focusing processing of personal data. They categorically belong to the same policies use case.

Note that the way dokieli investigates the resources it comes across (based on user actions) to get a hold of useful information about the storage (either at the storage resource or storage description resource) is same as the information about a profile (either at the WebID Profile document or at a preferences resources). Again, discovery is about the primary subject of interest.

Information about storage's notification channels is for the most part of interest to the application as opposed to the user, and that can be accomplished through a property of the storage (having notify:notificationChannel). It is trivial to implement a switch in the UI to allow the user to subscribe to live updates (or go offline), but I didn't get to that. Some of the underlying code is already in place and I'll get that out soon to fulfill the implementation feedback for Notifications Protocol and some of its subscription types.

The functionality is out there live at a dokieli near you. Source: https://github.com/linkeddata/dokieli .

So there we have it. It is all sort of simple.

The UI demonstrated in dokieli is not presumed to be ideal in any way - I'll improve that as time allows. If you have strong opinions about that, PRs are welcome :) The main exercise was about the discovery of relevant information and putting that information to some use.

I squatted the name solid:storageDescription which seemed to be closest to the notion of Description Resource or "described by". I'll excuse myself from bikeshedding the name. As mentioned before, the semantics needs to be clear - I'm aware that "storage description of a storage in which this resource is in" while may not quite roll well off the tongue it is sensible enough.

I also squatted the name solid:preferredPolicy to refer to an instance of odrl:Policy. I am looking into if and how the notion of preferred policies (or rights, processing rules, purpose..) can be worked in ODRL and DPV ( https://github.com/w3c/odrl/issues/21 , https://github.com/w3c/dpv/issues/36 ). I simply re-purposed the existing policy model for the time being to express a policy preference. I think there is a huge potential here for Solid applications. If there are examples in the wild, please provide a link.

The requirement for the above would be along these lines for discovery:

The server MUST include the Link header with rel="http://www.w3.org/ns/solid/terms#storageDescription" targeting the URI of the storage description resource in the response of HTTP HEAD or GET requests.

Again, for the storage resource (/), that'd be equivalent to using the describedby link relation in the HTTP header.

A simple approach would dictate that the storage description resource should only be managed by the server. I can also see unprotected statements being modifiable by the owner, e.g., storage name, policies. It gets fuzzy very quick: I would argue that storage's notification channels should be protected by the server in that even the owner shouldn't be able to modify them. I do not wish to dwell on this further and so just say that the Protocol can non-normatively mention some of these considerations whether after the main set of requirements or later in #security-considerations or #privacy-considerations sections.

If the server makes certain storage description available at locations besides /storage-description and allows modifications, it'd be up to the server to ensure that it is working with the same source information, regardless of however that data makes its way into the representation.

For storage description:

Whatever the server or potentially the owner wants to, and some of which can be accomplished through existing or new vocabularies and specifications. solid:owner is a good candidate as we already require that in the HTTP header. The Solid Protocol should remain silent about description for the most part, and allow other (small but exciting) specifications to first work it out, showcase implementations, and if necessary, the Protocol can refer to them or incorporate them.

I am once again asking for implementation feedback :)

kjetilk commented 2 years ago

Thank you for a very extensive writeup, @csarven ! I previously only concluded you were far ahead of me and didn't read it thoroughly, but now I have, and I think this makes much sense. I cannot contribute concrete implementation experiences, but I have some other relevant experiences, that indicate this is a good direction to go in.

I have two lines of considerations:

We need to balance "header bloat" with the number of HTTP requests needed to perform a certain task. I think this design strikes that balance well, in that I would be concerned if all possible additions to the protocol would end up requiring new HTTP headers. With this design, there is a need to perform a GET request, yes, but the result of that could be cached and could satisfy a range of use cases, and so, it seems that single GET request is a good balance.
Furthermore, we need to balance the immediacy of HTTP headers with the expressiveness of RDF. Defining HTTP headers is a somewhat painful exercise, and my feeling is that we should move towards using RDF, so that we can take advantage of the flexibility and rich semantics of RDF for more use cases. This proposal does that too.

I concur with the suggestion that the requirement should be:

The server MUST include the Link header with rel="http://www.w3.org/ns/solid/terms#storageDescription" targeting the URI of the storage description resource in the response of HTTP HEAD or GET requests.

but I'd like to add to OPTIONS too.

For the discussion around whether this should be a server-managed resource, I see some attraction in that, since it makes it easier to define without having requirements around rejecting certain triples. It would be interesting to hear if anybody has experiences that indicate that it should be user-mutable.

I also believe that it should be required that the root container is identified in the storage description, so, e.g. the triple

   </> a pim:Storage .

must be there, so that users would not have to nest down the hierarchy (a potentially costly operation) to find the root container, it will never be more than two requests to find it.

edwardsph commented 1 year ago

I think the spec needs cleaning up in this area as the following statements don't sit well together:

Servers exposing the storage resource MUST advertise by including the HTTP Link header with rel="type" targeting http://www.w3.org/ns/pim/space#Storage when responding to storage’s request URI.

From my reading of this, a server does not have to expose the storage resource and furthermore an agent may not even have permissions to access it even if it was exposed. The link header declaring that a target is a pim:Storage is not therefore required.

Clients can discover a storage by making an HTTP GET request on the target URL to retrieve an RDF representation [RDF11-CONCEPTS], whose encoded RDF graph contains a relation of type http://www.w3.org/ns/pim/space#storage. The object of the relation is the storage (pim:Storage).

There is no requirement on a server to provide pim:storage in the RDF representation of a resource although to say a client can discover a storage implies that. Do we need a server requirement?

Servers MUST include the Link header with rel="http://www.w3.org/ns/solid/terms#storageDescription" targeting the URI of the storage description resource in the response of HTTP GET, HEAD and OPTIONS requests targeting a resource in a storage.

This implies all servers MUST expose the storage as later requirements mean the description will contain a triple identifying the storage. Doesn't that make the initial conditional clause in the first statement irrelevant. Can we simply say:

Servers MUST advertise the storage resource by including the HTTP Link header with rel="type" targeting http://www.w3.org/ns/pim/space#Storage when responding to storage’s request URI.

Lastly, aren't the following two statements now poor examples of how a client can identify the storage:

Clients can determine the storage of a resource by moving up the URI path hierarchy until the response includes a Link header with rel="type" targeting http://www.w3.org/ns/pim/space#Storage. Clients can discover a storage by making an HTTP GET request on the target URL to retrieve an RDF representation [RDF11-CONCEPTS], whose encoded RDF graph contains a relation of type http://www.w3.org/ns/pim/space#storage. The object of the relation is the storage (pim:Storage).

The first is true but inefficient. For the second, even a there were a requirement on the server to provide a pim:storage triple, this would be a strange way of finding the storage given the simplicity of the storage description method which MUST be provided. I would suggest deleting both statements.

edwardsph commented 1 year ago

I just realised I missed something. The following statement does not make declaring the storage mandatory at all so some of my argument may be invalid.

Servers MUST include statements about the storage as part of the storage description resource.

Storage description statements include the properties:

rdf:type A class whose URI is http://www.w3.org/ns/pim/space#Storage.

To summarize how I would read this:

the storage description MUST include some statements about the storage
such statements include </> rdf:type pim:Storage
it is possible to add other statements which are not defined in this spec
no statements are explicitly made mandatory so the description may include other statements and exclude </> rdf:type pim:Storage

Is this what was intended?

It is also not clear from this statement that the storage description is RDF - it could be a text document with statements in any form. Obviously it could be inferred from the one example of a statement that could be included but I imagine this needs specifying. Could the storage description be defined as an auxiliary resource on the storage (just with links to it on all resources) in which case that would constrain it to being an RDF document.

kjetilk commented 1 year ago

Yes, I agree that a fundamental discovery mechanism like this MUST be MUST.

csarven commented 1 year ago

To clarify discoverability and access controls of storage description resource in https://solidproject.org/ED/protocol , raised by @elf-pavlik in https://matrix.to/#/!QxZtVBYQfMeMTnespj:gitter.im/$ut5bwPyKmna0YvctMyB-rgsvpztIp5eXUwujfDaJ2_g

Q1: are there any requirement on access control to the storage description resource? Q2: MUST the link be included no matter what response code is being used, so will 401 responses also provide it?

kjetilk commented 1 year ago

Right, that's interesting! Did you add an index, or did you do the traversal? (my assumption is that nobody would implement it with traversal)

woutermont commented 10 months ago

[@RubenVerborgh:] The impact of this on a server implementation is that for every request targeting a resource, information about all parent resources higher up in the hierarchy will have to be queried to find the matching storage. Or an additional index of some sorts will have to be created internally that keeps track of all the storages.

[@RubenVerborgh:] New numerical evidence backing up the CSS performance degradation as a direct result of implementing storage descriptions.

@RubenVerborgh, can you elaborate on where the measured performance impact occurs? Was an approach with an index also tried?

woutermont commented 10 months ago

@RubenVerborgh am I missing some complex aspect of this?

The CSS makes abstraction of just about anything; particular examples include finding ACL and META resources, checking if a container is the root ... The relevant functions are all called regularly during the request-response loop. Each of those would be problematic for performance if the default implementation did not rest on conventions (adding .acl, comparing to the base url ...). I don't see why the same cannot be done for finding the storage(description), especially since a storage is in essence nothing more than a namespace. If the root container serves as a storage, that is the only one; if not, storages are typically created on a single hierarchical level (either as subdomains or as a first path segment); those two pieces of information should be enough to implement a constant time conventional implementation.

joachimvh commented 10 months ago

If the root container serves as a storage, that is the only one; if not, storages are typically created on a single hierarchical level (either as subdomains or as a first path segment); those two pieces of information should be enough to implement a constant time conventional implementation.

That is indeed how we currently solve the problem in the CSS: one of those 2 options can be configured to either assume roots are at the base URL or on pod level as CSS creates them. It solves the general case but prevents more complex situations should those be needed. E.g., someone who wants 3 roots on a server, on /a/, /b/c/ and /b/d/, which could by done by for example first creating the relevant metadata on disk for those resources and then starting a server on that folder. Question is of course if that is something that needs to be supported.

woutermont commented 10 months ago

Thanks for the confirmation, @joachimvh!

Re supporting more complex cases: (assuming that we do not want to support automated creation of storages on random paths) storage structure will always follow a pattern, which can most likely be expressed as a limited set of rules. Expressing that set in config before starting a server is i.m.o. not too much of a requirement.

hzbarcea commented 4 months ago

Removed Release 0.11.0 milestone per agreement at 2024-02-14 CG meeting.

solid / specification

Server Description #355