solid / specification

Solid Technical Reports
https://solidproject.org/TR/
MIT License
481 stars 43 forks source link

Specify container description #227

Open csarven opened 3 years ago

csarven commented 3 years ago

Background: to date, the Solid Protocol (including earlier drafts and issues) only required server-managed containment statements in the representation of a container. Additional information such as last modification, size, resource type etc. about the contained resources as part of the container representation was deemed to be optional or considered to be a best practice. Examples in the wild show that some servers do make this additional information available, meanwhile some other servers do not support it. Some applications do make use of the information if available or work around the limitation to get a hold of the information [Anecdotal Evidence]

General use case: Support navigation of the container and its contents.

Use cases:

Related UCs:

Scenarios to consider:

General requirement: Include descriptions about contained resources in container's description to further support navigation and application interaction.

Specific requirements:

  1. Any information (eg. human-readable label of resources) that may be client or server-managed.
  2. Server-managed (controlled) information (eg. last-modified, resource size, resource types for controlled interaction models)

Considerations:

Related issues:

Notes:

csarven commented 3 years ago

I find the use cases to include "basic" information about contained resources in the container description compelling. Applications can immediately provide simple functionality by keeping the number of requests/connections minimal. It'd be reasonable to require this level of support on container read operations from servers in order to enable "smart" enough applications to get off the ground without having to resort to more advanced mechanisms.

I would consider last modification and size to be "basic" information. Ditto human-readable label if available. And possibly the creator of the resource. Whether knowing a resource is a container or not (by reading the container description) is very useful, that information can be derived as per shared slash semantics, hence it is not absolutely necessary that the container description includes resource types of contained resources.

bourgeoa commented 3 years ago

Can you add any reference to http/1.1 server specification with the information that is to be available on server side.

acoburn commented 3 years ago

I would rephrase the question here to be something more like:

A client needs a mechanism for finding descriptions of contained resources to further support navigation and application interaction.

I disagree that container listing is the best way to do this. A query endpoint (e.g. triple pattern fragments) can achieve the very same end with (arguably) better scalability characteristics.

The basic problem with including this data in a container relates to authorization.

Consider, for example, a container with 100 child resources. A simple GET request to the container will require an access check at the container level. Then 100 subsequent checks would be needed for each child resource. What happens with 1,000 child resources? 10,000 child resources? This does not scale.

The only way this scales is by introducing a paging mechanism such that you limit the scope of authZ enforcement to a predictable window size, which is why I suggest TPF.

csarven commented 3 years ago

best way to do this

for whom? Agree from a server's point of view but not particularly attractive from an application's point of view. It is quite a burden for applications to fetch each resource to get a hold of what they need (along the lines of that's mentioned in the above use cases) in order to provide something usable.

I would consider having to collect the data through a query endpoint relatively more complex than getting it simply from the container representation. Moreover, servers are not required to provide a query endpoint - at the time of this writing - so the basic information wouldn't be consistently available to applications.

If your counter argument/proposal is to address the use cases above by querying, we need to introduce a query mechanism as a hard requirement. (Which would help to meet quite a bit of other needs but that's all besides the point).

This does not scale.

Generally agree but we need empirical data as mentioned. True that a container can theoretically hold infinite number of resources (I think). Are applications - with the understanding of hierarchical organisation of Solid storage - organising data such that containers with many resources is common (in the wild)? If at all, how is resource organisation or management factored in?

Servers may want to limit the number of members a container can have to a number it is comfortable with. Implementation detail.

Agree on needing pagination as a way to control the cost of a request/response which would be an alternative to above - server fixing the max number of resources allowed per container. Implementation detail.

acoburn commented 3 years ago

It is quite a burden for applications to fetch each resource to get a hold of what they need

This is not what I am suggesting. I agree that such an interaction is a non-starter: there are way too many HTTP round-trips. A query endpoint allows a client to retrieve all the information it needs in a single request.

This does not scale.

Generally agree but we need empirical data as mentioned.

Here is empirical data for a system that implements the "check every child resource" approach: https://wiki.lyrasis.org/display/FF/Many+Members+Performance+Testing You can see response times in the 60 second range for 10K child resources.

namedgraph commented 3 years ago

Our definition of a container is this RDFS class called dh:Container.

As you can see, there's a related property dh:select that a container resource has. It points to a SPARQL SELECT query that the client can use to select the children resources of the container. Usually it's an entry point to further client-side query building that sets modifiers (LIMIT/OFFSET/ORDER BY), wraps into DESCRIBE etc.

So for example (prefixes missing):


<photos/> a dh:Container ;
  dh:select <queries/select-children/#this> .

<queries/select-children/#this> a sp:Select ;
  sp:text "SELECT ?child WHERE { { ?child sioc:has_parent ?this } UNION { ?child sioc:has_container ?this } }". # ?this is a magic variable which binds to the request URI
jeff-zucker commented 3 years ago

Would it make any sense to have the listing of a container's contents follow the permissions on the container rather than the permissions on the contents? For example:

* private resource in a private container
   * unauthorized user can not view anything about the private resource
* private resource in a public container
   * unauthorized user can view size/last-modified/etc. but not GET content of the private resource

This would mean that the server never has to do a mass check of the permissions on its contents but the user would still have the option to hide the server-managed information when that is their intention.

csarven commented 3 years ago

@acoburn

This is not what I am suggesting.

I know. I said that as the current solution to meet the needs. Querying, pagination or something else is currently not possible (=unspecified).

Thanks re Fedora data, that is useful. It is not easy (for me) to break it down as there are a number of different dimensions with varying values. The test with ~60s is perhaps on the higher end ("perhaps postgres needs caching configured?") - if you can provide more insight on this, that'd be useful. There is a can of warms here re caching of access policies..

Is there something along those lines available for Trellis?

csarven commented 3 years ago

@namedgraph I presume you can filter based on authorization policy per resource? And the response time for request to /photos/ with different access controls on each contained item is marginally different to if each item is public-read?

csarven commented 3 years ago

@jeff-zucker

Would it make any sense to have the listing of a container's contents follow the permissions on the container rather than the permissions on the contents?

No because each resource (container or other) can have different access controls. System must not leak any information about contained resources when agent is unauthorized to read those resources - last modification, size etc. are indeed sensitive and should not be exposed. The most a read access on a container permits is the visibility of the containment statements (just references).

bblfish commented 3 years ago

@acoburn wrote

The only way this scales is by introducing a paging mechanism such that you limit the scope of authZ enforcement to a predictable window size, which is why I suggest TPF.

The LDP group worked quite hard on a spec for paging. See: https://www.w3.org/TR/ldp-paging/

acoburn commented 3 years ago

@csarven

Re: Trellis, that code works as described by @jeff-zucker (authZ decisions are made based on container permissions, not based on access to the child resource). Trellis also does not include any information about the child resources, so it just sidesteps this issue. Consequently, container retrieval is measured in milliseconds.

For Fedora, there was a huge amount of work done related to this issue, and ultimately, many users began finding various work-arounds that just avoided using LDP containment, e.g.:

In my own experience, the Fedora server just got really, really slow once you had more than a thousand child resources in a single container. There were various attempts to resolve this, but those efforts never really went anywhere with that tech stack. I don't know where things stand these days, but it led to a lot of people abandoning the project.

Re: Query -- I see paging and query as two ways of describing a very similar feature, and they are both really useful.

namedgraph commented 3 years ago

@namedgraph I presume you can filter based on authorization policy per resource? And the response time for request to /photos/ with different access controls on each contained item is marginally different to if each item is public-read?

No ACL for children resources, no (yes for containers themselves). Since client-side containers is just UI for certain SPARQL queries, and we don't have ACL for plain SPARQL -- only for Linked Data resources. Once you have SPARQL access, you can pretty much see all the data, so it's a privilege to have.

NoelDeMartin commented 3 years ago

I recently noticed that ESS does not include the modified time because it's not part of the spec, and that makes apps unusable for large collections. So I'm very happy to see this :). I think my use-case has already been covered in previous comments, but I'll go over it briefly in case it's useful to see it from an app developer's perspective.

What I want to do in my app is reduce the quantity (and size) of network requests. Given that querying is not supported, the solution I've arrived at is caching everything in the client. This makes the first session slower, but makes subsequent sessions faster. It also improves the overall responsiveness of the app, because it doesn't have to make network requests for reading data. However, all of this depends on being able to read only the updates at the start of every session. So far, that's what I've been using the modified time for, and without it I can't think of a way to improve the application start up.

Something else that would be useful is knowing the types of resources included in the documents. For example, reading the type index I can find containers that include the types of resources I'm interested in. But that doesn't mean that a container doesn't have other types of resources, and I'd like to avoid reading documents that are not relevant to my app.

I understand that doing this can have an impact on server performance, so I don't have strong opinions as to how this information should be retrieved. I think it would make sense to return only containment triples by default, and use some mechanism like headers to indicate what other types of information is relevant.

Re:pagination, I suppose for really large amounts of data it would be necessary. With my current approach it's actually better to get everything in one request, given that I'll want to read all the documents that are relevant to my application (I was actually using globbing before it was deprecated). Pagination would be useful with query support - at that point I may be able to avoid caching everything - but given the current status this is the only viable solution I found.

gibsonf1 commented 3 years ago

For the TrinPod server case in authenticating what RDF data to include in a container request:

We use a fully hierarchical authentication scheme that at the lowest level is a single statement, so our server first retrieves all the information that a request would have without authentication, then does an auth check on each statement that the authenticated user has access to to generate the final response. The hierarchical nature of the auth check in combination with the cached acls presents virtually no resource hit on the server side.

On the Application side, in creating our Files app which we are finishing now, we are arriving at the idea that a single request to a container should present enough information for the user to intelligently decide what they want to do next, such as expand a child branch of that container. So we would be very happy to support any proposed standards about what to include as part of a container request to improve the UX. I think the paging issue that @acoburn brings up is also very important, so a standard around that would be great too.

At the moment, as standards aren't yet in place, for TrinPod we are including in a request to a container: all the child nodes of the container with ldp:contains, and then the ldp:contains of those child nodes as well as the last event triples around the content in the requested container (such as any schma:UpdateAction around that content) of course all filtered by user access permissions.

csarven commented 3 years ago

https://www.w3.org/TR/ldp-paging/ https://www.w3.org/TR/activitystreams-core/#paging

Created issue for resource paging: https://github.com/solid/specification/issues/230

gibsonf1 commented 3 years ago

@csarven I vote to make those two specs part of the Solid standard - but I think also needed would be a recommendation for how many items to include in a given page

csarven commented 3 years ago

@gibsonf1 If paging is required, I can't see why more than one mechanism is needed. The number of items to include for a paged resource would either be a client preference included in the request in which a server a may agree to or simply use its own (implementation detail).

bblfish commented 3 years ago

It would be worth having a comparison between both.

kjetilk commented 3 years ago

I'm catching up here, and I appreciate that this is a summarization of several different things, and so I don't think it serves to pose this as a single question.

What I'm seeing here are at least these problems:

  1. Augment the data in the container with data to enable apps to present a summary view to the user.
  2. Augment the containment triples with minimal metadata that clients are likely to find useful to perform well.
  3. Ensure that the above data isn't exposed without authorization.

The first case is essentially a generalization of the Data Browser behavior where it looks for index.ttl to augment the view. I believe that this should be solved by having a predicate (e.g. rdfs:seeAlso or a subproperty thereof) in the container representation that points towards a resource that the client should get to do it. The applications will have to deal with authz so that no users gets data it shouldn't get, but I think that is the best solution anyway, as in many cases it may be OK to show a title and a thumbnail, but nothing more. We shouldn't place too many restrictions on this from the spec side.

Number 2 is essentially what we have referred to elsewhere as a File Scan operation. We haven't set down what a File Scan operation is, but in the context of Solid is pretty clear a File Scan operation is to read the contents of a container and it now requires read privileges on the container, and that should be adequate for now.

It is very interesting to read that @gibsonf1 has an implementation that performs well when checking access control for a tree, but in the interest of having a spec that many can implement, at least in the initial versions, I think it is correct to assume that it is rather hard to achieve that performance, as @acoburn has experienced. Thus, at least initially, we should make sure that a File Scan operation can be done with read privileges on the container only. Anything beyond that is not a File Scan operation.

Then, the question becomes what information a File Scan operation can legitimately expose. I think the above discussion and @acoburn 's comment in #116 makes it very clear that at least the containment triples are a part of the container representation, if you need the hidden file case, then you need to make a child container and then have other permissions on that.

My opinion, at least right now, is that there are some other attributes, like mtime, type and size are things that could be a part of the container representation in a File Scan operation. Again, if you need to protect those, make a container with different permissions.

There's also some precedence to this, Apache has a default index that exposes mtime and size by default.

In conclusion, number 2 above is the File Scan operation, which maps to a read operation on the container in Solid, which exposes containment triples, size, type and mtime as well as other server managed and client managed metadata.

But, there's more! ;-)

It could be argued that computing mtime and size is too heavy for most users, we shouldn't give that unless people ask for it. For that, I suggest we look into defining and registering a Prefer header preference. With this, clients could for example request the container with a Prefer: return=full, which would give them the full representation, including the mtime, size and type. Effectively, this would make it optional for servers to support it, but that's OK.

bblfish commented 3 years ago

@RubenVerborgh I understand that but I think it is also going to be quite process heavy.

Could you not instead bundle all those issues together and then categorise those in different ways: ldp:contains as that closest to the core, then group the others into major application areas, and find out who supports them. It would be good if there were a document that would at least give some ideas as to which parts fit together, and how widespread they are, which servers have implemented them. Then one would know who to ask regarding their implementation experience.

jeff-zucker commented 3 years ago

I would like to put in a word for mime-type support. Example use case : The databrowser looks at the triples of an NSS container and finds all contained resources whose media type matches iana/image and if some are found, presents a slideshow button to view them all. That's not possible on CSS which does not add such data to the container. It's hard to imagine that divulging that foo.png has a media-type containing iana/image would give away sensitive information.

kjetilk commented 3 years ago

I have two concerns:

There could be some security or performance concerns around adding certain types of metadata, so leaving it entirely up to the server without public consultation like we do here could be problematic. Also, this kind of variability could cause interop problems. At present, the ecosystem is rather small, so I believe we can do it for now.

Secondly, my favorite field is query evaluation across Solid data, and I believe that Computer Science simply does not have the empirical or epistemological strength to create generalizable knowledge that will clearly guide us in this area. So, the fallback is then to add stuff and cross fingers ;-)

I just came to think of that adding mtime will cause a cascade towards the root as all containers up to root will have to be updated with that mtime as a result. That's probably not behavior we'd like to encourage right now, so perhaps that wasn't such a great idea anyway.

In the interest of progress, I think we can go with containment triples as the only requirement for now, but that we say that servers MAY include other metadata. Then, we add a section in the security concerns about the metadata, and we start opening other issues about metadata that MUSTs, and I think that @jeff-zucker is right that media type is a good candidate for that.

We still have the issue of data augmentation to deal with here though, i.e. the old index.ttl mechanism, which I believe should be dealt with using the rdfs:seeAlso predicate.

bblfish commented 3 years ago

I have found a way in Java to get access to the metadata at the same efficiency as getting the file name listings in a direction https://stackoverflow.com/questions/66699379/how-to-get-streams-of-file-attributes-from-the-filesystem/66713743#66713743 (I think. It would be worth testing this out just to make absolutely sure that the speed is equivelent)

This is the data you can get access to: https://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/nio/file/attribute/BasicFileAttributes.html

dmitrizagidulin commented 3 years ago

@jeff-zucker

I would like to put in a word for mime-type support.

+1, I think that mime-type would be incredibly useful. (So, specifically, containment + mime-type being the minimum mandated fields.)

kjetilk commented 3 years ago

We did discuss this further in the Solid Editors meeting today, but we have noted that we haven't yet reached a rough consensus.

Two things seem very clear though:

  1. We can't require a server to look up access controls for child resources as that would complicate servers substantially.
  2. There seems to be quite broad agreement that metadata can leak information and so create a security concern.

We found that these two observations, by themselves do not require any changes to the current, but also does little to address the original concern of this issue.

This does suggest to me that it the container resource and the metadata needs to have potentially different authorizations, so that it is up to a client with Control privileges to decide whether an agent gets to see the metadata or not.

This again implies that it must either be configurable where the metadata goes, or it would need to go into a separate auxiliary resource. It makes sense to leave some variability here. For a server that does not share the security concern, it is OK to have the metadata in the container description itself, but it must then be aware that it can't be influenced by the client. I therefore think we should look in the direction of having a augmentation resource #144 that has its own access control, as suggested in #306 .

By default, metadata should go to such an augmentation resource, but it could be configurable to allow all or some to be present in the container.

That's my current opinion.

bblfish commented 3 years ago

This does suggest to me that it the container resource and the metadata needs to have potentially different authorizations, so that it is up to a client with Control privileges to decide whether an agent gets to see the metadata or not.

That sounds like a very good idea.

When a security problem appears for which there are use cases where it does not matter and indeed where information wants to be shared and also use cases where it does matter and information must be tightly controlled, make the settings configurable. Then one can also develop guidelines for different situations.

jeff-zucker commented 3 years ago

What I meant is that it isn't possible to get the information from reading the container in CSS, not that it is impossible in general.

bblfish commented 3 years ago

re efficiency optimization see my answer above. In Java I think one can get the following metadata as easily as the file listings. That should not be surprising, given that the OS will be storing the name of a file very close to where it has all that other information too. Also note I have seen it argued that solid state drives have completely transformed the relation between processor speed and disk speed to the point that disk speed is now faster than what processors can cope with. The point is optimization requirements are important, but they can also be evaluated empirically.

Modifier and Type Method Description
FileTime creationTime()
Returns the creation time.
Object fileKey()
Returns an object that uniquely identifies the given file, or null if a file key is not available.
boolean isDirectory()
Tells whether the file is a directory.
boolean isOther()
Tells whether the file is something other than a regular file, directory, or symbolic link.
boolean isRegularFile()
Tells whether the file is a regular file with opaque content.
boolean isSymbolicLink()
Tells whether the file is a symbolic link.
FileTime lastAccessTime()
Returns the time of last access.
FileTime lastModifiedTime()
Returns the time of last modification.
long size()
Returns the size of the file (in bytes).
bblfish commented 3 years ago

In Java I think one can get the following metadata as easily as the file listings.

Non-distributed filesystems are just one possible backend though; there are many others.

Perhaps we should list all of these, and work out what the properties that they can make available and at what cost. (I have just provided one data point, to argue that some metadata may not be that expensive on simple file systems that may be used in a very wide range of cases.)

Then one should consider is what Apps need such data, and why: what are their requirements? After all it is only good apps that will make the Solid ecosystem grow.

One should not forget that one could later make optimisations such as using SPARQL to query a container and that if the need is there one can optimise with indexes...

kjetilk commented 3 years ago

OK, this is progress! In the interest of further progress, my main concern now is to identify the areas where we can say that we have rough consensus and where we need more work. If there are issues that we can open that can clearly encapsulate the open issues, that's great if we then can find rough consensus on this issue.

I think there is consensus that metadata is needed, but also that we need to have a mechanism that ensures that an agent with Control can control access to that metadata.

We can then open issues to discuss exactly what metadata that  {may, should, must} be available.

In this issue, we need to come to a consensus around the mechanisms for exposing that information. How about something like

A server MUST support an auxiliary resource type ContainerListMetadata (or something). The server SHOULD maintain a representation of metadata about child resources in either that auxiliary resource or the container representation, depending on configuration. When the auxiliary resource is readable to the client, the server MUST link to it from the container.

In security concerns, we say something to the tune of

Servers should make sure that the child resource metadata does not expose information that an agent with Read access to the container must not be able to retrieve. The server should use a ContainerListMetadata auxiliary resource with different authorizations if required. It is encouraged to make it configurable to decide if metadata is represented in the container or the auxiliary resource.

Then, we need to define the aux type ContainerListMetadata with its own AC (which is a separate issue).

And then we need to define what metadata is useful, which can also be separate issues.

justinwb commented 3 years ago

I is not impossible, just slower; you can always HEAD every single item. I know that it is not practical etc., but we need to distinguish between "impossible" and "performance optimization". For instance, not listing children makes them undiscoverable and thus in some cases literally impossible to access.

☝️ this. We can agree that having metadata about a container listing is useful, but that shouldn't automatically mean that the server is responsible for maintaining it.

For example, if we look at the MacOS operating system, most Mac users would be familiar with the .DS_Store folder, if for no other reason than you have to add it to your .gitignore all the time.

From Wikipedia:

The file .DS_Store is created in any directory (folder) accessed by the Finder application

It is not maintained by the operating system, but by the client application (Finder) the first time it accesses the folder, as an optimization.

So while I think it could be worthwhile to define an auxiliary resource that can be used to store container list metadata in an agreed upon schema, I'm not fully convinced the server should manage it.

acoburn commented 3 years ago

If the container resource and the container description are separated into two distinct resources with different URIs, would a user be prohibited from adding ldp:contains triples to the description resource?

kjetilk commented 3 years ago

@acoburn :

If the container resource and the container description are separated into two distinct resources with different URIs, would a user be prohibited from adding ldp:contains triples to the description resource?

Good question, the answer must certainly be yes.

kjetilk commented 3 years ago

@justinwb :

So while I think it could be worthwhile to define an auxiliary resource that can be used to store container list metadata in an agreed upon schema, I'm not fully convinced the server should manage it.

That's an interesting data point!

First of all, the client is free to add whatever data it wants to a resource it is authorized to (and subject to shape constraints), so that can already be done. We do not need to add anything for that.

What has convinced me about this is that for Solid apps, there are many different apps that is likely to need the same data, and for very different purposes.

On one hand, this provides a way forward that we do not need to be ahead of: If we see data that a lot of apps are requiring, then we add it as a server requirement as a result of that observation.

On the other, it might also be the case that some apps cannot be realized before the data is already common, we may never know.

I therefore think that at the very least, we need a mechanism in place to allow container metadata to be added as a server requirement, we shouldn't need to have a long discussion when the requirement arises, that might make it too late.

csarven commented 2 years ago

@jeff-zucker

I would like to put in a word for mime-type support.

There is indeed utility but it competes with information exposure.

It's hard to imagine that divulging that foo.png has a media-type containing iana/image would give away sensitive information.

"foo.png" is a resource name allocated by server or requested by client -- that's fine. It is true that with the common extension one could interpret that as an image. However, it doesn't work the same way for names that are chosen to be opaque eg. /{uuid} -- and so no need to give away that it is an image.


@kjetilk

I think there is consensus that metadata is needed, but also that we need to have a mechanism that ensures that an agent with Control can control access to that metadata.

I don't see how that conclusion; agent needing to control access to that metadata, is reached. That seems to further complicate the problem?


@justinwb @kjetilk

I is not impossible, just slower; you can always HEAD every single item. I know that it is not practical etc., but we need to distinguish between "impossible" and "performance optimization". For instance, not listing children makes them undiscoverable and thus in some cases literally impossible to access.

point_up this. We can agree that having metadata about a container listing is useful, but that shouldn't automatically mean that the server is responsible for maintaining it.

We should not conflate server's claim of its own resources and information that an application maintains about resources.

Applications should place their trust on the server about the resources and not a third-party (i.e. another application).

AFAICT, .DS_Store is only used?/maintained by Finder or some alternative app? They all ultimately rely on what the filesystem provides.

Applications maintaining information about resources as well as content-level annotations can still co-exist.


From /TR/protocol's security considerations:

Servers are strongly discouraged from exposing information beyond the minimum amount necessary to enable a feature.

Minimum is the containment statements. Server-driven container description stating anything about the state of contained resource would be exposure without proper authorization.


@acoburn

If the container resource and the container description are separated into two distinct resources with different URIs,

I presume you mean that there is a container resource and then there is a containment-statements resource.

Noted in https://github.com/solid/specification/issues/227#issue-801334244 :

Instead of the container resource, the associated description resource of a container (ie. target of describedby) could include information about the contained resources. Doesn't violate best practice on self-describing documents per se but it is perhaps not the most intuitive place to look for additional information about the contained resources.

There could be a separate property for the ContainmentStatements resource but this may be diverging from what we have and LDPBC.

would a user be prohibited from adding ldp:contains triples to the description resource?

Right. To me this is similar to the way pagination works in that a client can't or no obvious reason to update paged resources which lists the containments -- discussed in https://github.com/solid/specification/issues/230#issuecomment-774791386


We need to be clear on the core aspects of container description at least from the perspective of the server. Containment statements are taken to be part of the description -- what Protocol / LDP uses. We may need solid:Resource/Container (see also https://github.com/solid/specification/issues/194#issuecomment-694828342 ) if we want to say something different than LDP(B)C. If that full description includes additional information about contained resources, it is straight forward to use the Prefer header or the profile parameter to return a subset of the description which can be influenced by permissions. That way the default description could remain as the minimal contained resource. (I realise this is sketchy). If the description doesn't include anything beyond statements, Prefer/profile may not be the right way further as that would try to generate a representation that is a superset.

jeff-zucker commented 2 years ago

If the user has no permission on the container, there is no problem with listing the media-type in the container - they won't see it anyway.  If the user has permission on the resource, there is no problem listing the media-type in the container because they can easily get the information in a HEAD so nothing is being protected.  The same reasoning If the resource was created with an informative extension using PUT or PATCH - the information is already there.

So, it seems to me there is only one case in which listing the media-type in the container presents a problem - when all of these conditions are met: 1) the person uses POST and/or omits an informative file extensiono 2) to create a private resource 3) in a public container and 4) does not want to divulge its media-type.

I (perhaps naively) think there ought to be a way to handle that edge case without penalizing every app that wants to know the media types of a container's contents.

@csarven could you point me somewhere that explains what you mean by "Prefer/profile"?

jeff-zucker commented 2 years ago

Thanks for keeping reminding me of the difference between a resource and a representation. I do understand the difference even if I sometimes confuse them when I write. You're right that media-type is not the relevant thing to know about RDF resources and containers. What makes sense to me is to list the media-type and also these types : ldp:RDFSource, ldp:NonRDFSource, ldp:BasicContainer, etc..

jeff-zucker commented 2 years ago

And I should also add that it is not media-type specifically that I am interested in. I am interested in triples that tell me what type of thing I am looking at. That might be media-type, it might be dc:format as ESS uses, it might be an ldp type, or other things.

kjetilk commented 2 years ago

@kjetilk

I think there is consensus that metadata is needed, but also that we need to have a mechanism that ensures that an agent with Control can control access to that metadata.

I don't see how that conclusion; agent needing to control access to that metadata, is reached. That seems to further complicate the problem?

Perhaps it was awkwardly formulated. You do agree that from this discussion, it appears to be consensus that we need to make sure different access controls can apply to the metadata than to the container representation, right? Naively, we can formulate that as that it must be possible to put that metadata in an aux resource that resource has its own ACL, right? So, saying that an agent with Control can control access to the metadata is just saying that without saying that it is an aux resource with an ACL, because those points are a solution to the problem, but I tried to formulate it in more abstract terms, but that apparently failed to be sufficiently clear.

kjetilk commented 2 years ago

@csarven could you point me somewhere that explains what you mean by "Prefer/profile"?

I can do it, @jeff-zucker : :-) There's a pretty nice section about the use of Prefer headers in the the LDP Spec. We could adopt that. There are also various other proposals, but short-term, this is the most important one.

jeff-zucker commented 2 years ago

@kjetilk - If I'm reading that correctly, it is up to the app developer to learn how to send prefer headers and to use them to request particular as yet undefined subsets of container information. @csarven metntioned "Prefer/profile" - how does this relate to a profile? How does either relate to the question of whether a GET on a container should return information on the type of contained resources (including but not limited to media-types)?

kjetilk commented 2 years ago

Ah, now I think I understand your concern, @jeff-zucker , good question. Lets see if I can provide some clarity (hard, because this isn't entirely clear to me either).

Basically, there are several mechanisms for providing a subset of the full representation of a resource, Triple Pattern Fragments, Prefer headers, a "profile" (see a comment further up by Ruben for one take on this).

The most immediate use for this mechanism is to list only the containment triples, since LDP has specified a mechanism for this. Then, one could argue (as I have elsewhere), that you could have an access mode that allows only that operation, because that is an operation that are common in access control systems, the "File Scan".

Conceivably, we could also say that the representation returned without a Prefer header, i.e. a "default representation", is not the full representation, you'd need add a Prefer header to get it all, and then metadata could be in the full representation but not in the "default representation".

Therefore, we have discussed that idea a bit, and it may be that @csarven is alluding to. I didn't catch that at first.

But let me be the first to say "OMG!", because I certainly did not intend to say that we should use Prefer in connection with access control as a general pattern, I don't think that would be appropriate at all. I think that is an interesting mechanism only in the case of containment triples because it is already well defined by LDP and as a common "File Scan" access mode.

Going further in that direction would open the whole Selective Attribute Disclosure issue, which something we certainly have to open, but only with extreme care. For now, I think the only reasonable direction we could take is that metadata that needs protection go into a separate resource and that resource has its own access control resource. We currently have resource-centric access control, and that's what we should use now.

jeff-zucker commented 2 years ago

Apologies if this has already been covered, but is there a reason we can't simply say that a user's view of a Container's containees is restricted to ones they have Read access to? Why list them at all if the user can't access them? Why divulge their existence and their URIs? If only people with Read permissions can see the Container listing of the resource, there is no longer a privacy problem with listing things like media-type and no need for complex requests to get information that should be easy to get.

jeff-zucker commented 2 years ago

Okay, then is the current thinking that containment triples are a MUST and everything else is a MAY; that if there is eventually more included, apps can send prefer headers to limited it to containment and that eventually there will be possibility to specify an auxilliary user-controlled resource which contains a profile of the information to be displayed by the container?

bourgeoa commented 2 years ago

It seems that we are on the way to saying : any information is bad. I do not see this as solid purpose.

I see a regression last 2 years Specification is about sharing and not everything is a risk or this is a bank protocol.

kjetilk commented 2 years ago

So, my opinion, just to state that clearly, is that the resolution of this issue is that we design that aux resource type now, specifically an aux resource type that has its own access control, so that we have that mechanism in place as we need it. Then, we can discuss what data the server should put into that resource.

csarven commented 2 years ago

@kjetilk

Perhaps it was awkwardly formulated. You do agree that from this discussion, it appears to be consensus that we need to make sure different access controls can apply to the metadata than to the container representation, right?

AFAICT, different approaches are being studied. Do you feel that there is "consensus" on the note item https://github.com/solid/specification/issues/227#issue-801334244 :

Associated resource for contained resources
Instead of the container resource, the associated description resource of a container (ie. target of describedby) could include information about the contained resources. Doesn't violate best practice on self-describing documents per se but it is perhaps not the most intuitive place to look for additional information about the contained resources.

(s/describedby/solid:containsMeta or whatever)

Naively, we can formulate that as that it must be possible to put that metadata in an aux resource that resource has its own ACL, right?

Maybe. Need to consider what it can/should contain first and then what the access controls can be is easy to answer for:

Specific requirements
* Any information (eg. human-readable label of resources) that may be client or server-managed. * Server-managed (controlled) information (eg. last-modified, resource size, resource types for controlled interaction models)

To me the key thing is whether what's being described is client's annotations or server's description of its own resources i.e., info we don't want anyone (even "owners") to be able to alter:

Authoritative resource metadata
Applications should place their trust on the server about the resources and not a third-party (i.e. another application).

As for:

So, saying that an agent with Control can control access to the metadata is just saying that without saying that it is an aux resource with an ACL, because those points are a solution to the problem, but I tried to formulate it in more abstract terms, but that apparently failed to be sufficiently clear.

Ack. I still think the use cases are different for who/what should "control" what. To divide and conquer:

I propose that we first acknowledge authoritative resource metadata because it influences the default/basic behaviour for clients. The UCRs in the first comment highlight that need. This information about contained resources (possibly in an associated resource) MUST only be controlled by the server. The information that is generated depends on recipients' access privileges on the contained resources, and can be read by the recipient that is properly authorized. So, for example, if recipient has Read on /foo , /bar , but not on /baz, the associated resource will not expose any information about /baz.

I said this "doesn't violate best practice on self-describing documents per se" because it is part of the associated resource as opposed to container's description. If we "violate" that, which is not (really) the end of the world, it can be part of container's description - and so that eliminates an additional resource type.

Aside: I think this is similar to Trellis' trellis:PreferServerManaged.


Principle: Server-managed information exposure.

Good practice: Servers are strongly discouraged from exposing information about resources to recipients that are not properly authorized.

Developer translation: information about resources exposed to recipients anywhere in the system should not be more than the information exposed to them in a 401/403 response when those resources are requested.


Second: client-managed data is typically either part of resource description or associated resource targeted with the describedby relation (or some other specific type.) The first option goes with self-describing documents, the second option is for general annotations. So including a human-readable for a resource (including a container) is in the description of the resource itself. We might want to consider if statements including certain client-managed properties in contained resources should be exposed in container's associated resource. I suggest we hold off a bit on solving this until other stuff is in place because it is "complicated". So, coming back to "an agent with Control can control access to the metadata": I don't see a clear need for this just yet.


Consideration:

Principle: Separation of resource management.

Good practice: When defining new resource types, avoid inconsistent resource management by not allowing both servers and clients to update different parts of the resource state.

It would definitely simplify access controls. Unfortunately that ship has sailed for some of the existing stuff e.g., a container can include both server and client-managed data. Solid could come up with a new requirement that makes a container non-updatable by clients. That would however restrict simple things like human-readable labels for containers in a self-described way. So, it would really have to be property-based e.g., client can update specific properties like rdfs:label but not ldp:contains, dcterms:creator.


Can we put more effort into having specific answers to the questions/issues raised in the first comment and elsewhere before calling "consensus"?

elf-pavlik commented 2 years ago

Principle: Separation of resource management

Good practice: When defining new resource types, avoid inconsistent resource management by not allowing both servers and clients to update different parts of the resource state.

It would definitely simplify access controls. Unfortunately that ship has sailed for some of the existing stuff e.g., a container can include both server and client-managed data.

I already mentioned it in https://github.com/solid/authorization-panel/issues/253#issuecomment-910293611 and @kjetilk response confirmed the 'ship has sailed' status. Did we really pass the point of no return on that? If we would ever want to reconsider decision of mixing server and client managed triples in a container description it might be worth to give it another evaluation now and weigh in all the pros and cons which came up from the moment that decision was made.

Having those two separate might be much better starting point. If multiple requests (even with HTTP/2 broadly available) is a disadvantage. Dataset based content types could nicely combine the two in to a single response. I think having server managed statements as auxiliary resource would make overall design more consistent, while still keeping paths open for various optimizations. Especially while, to my understanding, we make assumption of resource level access control.

acoburn commented 2 years ago

Even though we have historically included containment triples in the container resource itself, I can see huge advantages to putting all the sever-managed containment triples into an auxiliary resource. I would be especially interested in the possibilities that structure enables for paging and sorting.

I realize that such a change would significantly affect server implementations and client tooling, but we should at least consider that possibility. If the specifications do change in this direction, making the change now would be much less disruptive than making that change after the ~TR stage of the protocol spec.