inrupt / wac-ldp

A central component for Solid servers, handles Web Access Control and Linked Data Platform concerns.
MIT License
12 stars 5 forks source link

Concern about conflating WAC and LDP #30

Closed pmcb55 closed 5 years ago

pmcb55 commented 5 years ago

Both myself and Aaron Coburn have concerns about explicitly coupling WAC and LDP into a single component/module/repo. Basically I just don't see the need or advantage in doing so (an answer of 'performance' would I think be a classic example of premature optimization, so I assume there may be other reasons...?)

I think Aaron expresses it perfectly in relation to Trellis:

In the OAuth2 model, Trellis is a pure and simple "resource server", so it assumes that authN happens externally. I tend to be a big fan of externalizing an IdP, because it means that you can (1) federate authN over multiple providers, (2) choose your own authN stack and (3) avoid storing user secrets. That said, having a particular IdP system for testing and/or deployment could be good. Just as an example, for cloud-based deployments, I'd want to make sure that Trellis can fit right into whatever provider's IAM system already exists (e.g. AWS Cognito). And for enterprise deployments, companies will have strong opinions about particular IdP tech.

michielbdejong commented 5 years ago

Thanks for bringing this up Pat, I can see your point, so let's explore what it would take to split this module into two modules, one for just wac, and one the rest.

Suppose we would start by splitting out the src/lib/auth/ folder to a separate git repo / npm module, would that be an improvement, you think?

The reason I hadn't done so thus far is (I think) obvious when you look at what the interface of that WAC module would be; four functions for which it feels unlikely that anybody apart from this module would reuse them:

function determineWebId (bearerToken: string, audience: string): Promise<string | undefined>
function readAcl (resourcePath: Path, storage: BlobTree)
function determineAllowedModesForAgent (task: AgentCheckTask): Promise<AccessModes>
function determineAllowedModesForOrigin (task: OriginCheckTask): Promise<AccessModes>

This is the 'auth' component in https://github.com/inrupt/wac-ldp/blob/wac-parts-rebased/README.md#code-structure

michielbdejong commented 5 years ago

Another option would be to basically use a http server as the storage, which was the idea we started out with 6 weeks ago, and which we iterated on several times to arrive at the current architecture.

The reason we departed from that idea was that there are a number of things that need to live "above" the storage:

Things we could move from core to storage:

And by that time, we would have split this module into two on a rather arbitrary line, because both resulting modules would be aware of some business logic concerns but not of other. A storage layer should not be aware of ACLs, but then by that same line of thinking, it should probably not be aware of ETags either, right?

To make the http server more useful we could add a custom header so that it does execute the PATCH command but can be instructed to refuse non-append operations, that would definitely help to make the split-line between the two halves more reasonable.

michielbdejong commented 5 years ago

I'm willing to do this refactor if there are enough requests for it, and if there are not too many other refactor requests from people, so will wait for a bit to see what else comes in, and then decide. Again, thanks for submitting, Pat!

RubenVerborgh commented 5 years ago

See also https://github.com/solid/solid-architecture/blob/master/server/request-flow.md for data-level separation and interface between auth and LDP.

dsebastien commented 5 years ago

I also agree about the need to separate those concerns.

Just a few ideas around this:

I think that Ruben's request flow can be a good basis for the layering of the architecture. Each step in the request handling flow could potentially be handled by a separate logical layer in the system, communicating through clearly defined interfaces.

Actually to me, step 1 and 2 can be part of the same logical module, but in any case I think it is important to leave HTTP and work only with internal data structures as soon as we leave those upper layers. This is beneficial because then any other external facing interface that we could later plug next to the HTTP layer could work in the same way.

After step 2, we can already pass down an internal data structure, independent of the incoming HTTP request. This structure is still conceptually a "command" in the sense that it hasn't been accepted yet by our system.

Once step 3 is passed, then the command has been accepted and is passed down to the persistence layer.

Once step 4 is done, the command has been executed (successfully or not) and we can move back up the layers and respond to the initial request (back to HTTP and out).

While moving upwards through the layers, we're actually returning events stating what has happened with the command that was just processed, not having to know what/who will care.

BTW, while moving downwards we can also generate events for auditing/logging purposes (e.g., incoming request received in step 1, parsing failures in step 2, authc checks in step 3, etc).

We can also make the architecture modular by declaratively configuring the chain (e.g., through DI/IoC), a message bus/queue,etc.

For example, the "module" handling step 1 will need an implementation of some PersonalDataStoreRequestHandler. In step 2, the PersonalDataStoreRequestHandler will need an AuthenticationService and an AuthorizationService. It can use those to perform the necessary checks before passing the command to the PersonalDataStoreService.

With this in place, different implementations of each layer could be plugged in easily without impacting the system. Each could integrate with external systems if needed (e.g., the authc/authz parts).

We could very well imagine having a dedicated package for each of those layers and a Solid server just assembling those by configuring the dependency chain.

michielbdejong commented 5 years ago

Yes, that's pretty close to what the current implementation looks like. All http-api concerns are in the src/lib/api/http/ component, and auth concerns are in the src/lib/auth/ folder. The internal "command representation" is WacLdpTask. See https://github.com/inrupt/wac-ldp/tree/wac-parts-rebased/src/lib

dmitrizagidulin commented 5 years ago

I also think that the separation of the two concerns into just separate folders (rather than separate repos) is sufficient.

Cause those are two separate issues, right. One is - should the layers be cleanly separated, architecturally? Absolutely (barring the couple of annoying parts where they mix slightly, like the glob or PATCH semantics). (And that includes the ability to turn off the access control part, and just use the LDP functionality.)

And the second one is - should that separation be modeled as "keep the code in separate repos", or will separate folders be sufficient? And there, I definitely think that at this stage, separate folders is enough. Splitting them out to a separate library may make sense some time in the future (when multiple libraries will want the ACL-handling stuff)? But for the moment, will just make changing code and making releases unnecessarily difficult (we ran into that with the multiple auth modules in NSS).

michielbdejong commented 5 years ago

Great! So that would mean we can keep the current code structure, then. @RubenVerborgh and @pmcb55 do you agree?

RubenVerborgh commented 5 years ago

No concerns from me. I mentioned in another issue (which I can't find) that the interfaces between the modules are much more important than their code structure, so similar to what @dmitrizagidulin remarks above.

pmcb55 commented 5 years ago

So I guess I'm fine with just separating the clearly different concerns of WAC and LDP by placing their source code in different folders, but really my underlying concern is reflected in things that I'm seeing directly in the code, such as the use of interfaces named 'WacLdpResponse' and 'WacLdpTask', or a property named 'wacLdpTaskType' (all of which are referenced in HttpParser (https://github.com/inrupt/wac-ldp/blob/master/src/lib/api/http/HttpParser.ts)).

In other words, how is an interface named 'WacLdpResponse' demonstrating a clear interface separation between WAC and LDP...? In other words, if I want to use this code to host just an LDP server without any notion of WAC, would I still need to be working with 'WacLdpResponse' instances in my code?

And even more generally, why is a 'HTTP Parser' class even aware of either LDP or WAC (which I would consider far 'higher-level' concepts? i.e. shouldn't it just be parsing a HTTP request into an object that only represents the things a HTTP request can contain (like the body, the headers, the query param, the cookies, etc.)...?

(Please forgive me if I'm totally missing a broader context - I'm just scanning the code quickly here :) !)

michielbdejong commented 5 years ago

if I want to use this code to host just an LDP server without any notion of WAC, would I still need to be working with 'WacLdpResponse' instances in my code?

Good point! No, WacLdpResponse is not exposed by the module, only the following interfaces are:

function makeHandler (storage: BlobTree, aud: string, skipWac: boolean): void
interface AccessCheckTask {
  path: Path,
  isContainer: boolean,
  webId: string,
  origin: string,
  wacLdpTaskType: TaskType
  storage: BlobTree
}

async function checkAccess (task: AccessCheckTask): boolean

async function determineWebId (bearerToken: string, audience: string): Promise<string | undefined>

class Path {
  constructor (segments: Array<string>)
  toString (): string
  toContainerPathPrefix (): string
  toChild (segment: string)
  isRoot (): boolean
  toParent (): Path
  hasSuffix (suffix: string): boolean
  removeSuffix (suffix: string): void
}

export interface BlobTree extends events.EventEmitter {
  getContainer (path: Path): Container
  getBlob (path: Path): Blob
}

export class BlobTreeInMem extends events.EventEmitter {
  getContainer (path: Path): Container
  getBlob (path: Path): Blob
}

enum TaskType {
  containerRead,
  containerMemberAdd,
  containerDelete,
  globRead,
  blobRead,
  blobWrite,
  blobUpdate,
  blobDelete,
  getOptions,
  unknown
}

parsing a HTTP request into an object that only represents the things a HTTP request can contain (like the body, the headers, the query param, the cookies, etc.).

We have that kind of object too, but it's http.IncomingMessage, which is part of NodeJS, and it's the input (not the output) of the HttpParser. So that part of the 'http parsing' (just like the https TLS handling etc.) is provided by NodeJS out of the box.

But what you're saying is a function like updateBlob (which executes a PATCH query on a blob) should take as its input an http.IncomingMessage object, and produce an http.ServerResponse object, instead of going from the more abstract objects WacLdpTask to WacLdpResponse? To me it makes sense to have the src/lib/api/http/ component handle all the interaction with the 'outside world of http', and for instance handling a request like:

GET /data/?query=SELECT%20*%20WHERE%20%7B%20%3Fs%20%3Fp%20%3Fo%20.%20%7D HTTP/1.1
Host: example.org

Goes in 5 steps: step 1 (NodeJS native): read the raw http from the wire, into a http.IncomingMessage object, with method: 'GET' and url: '/data/?query=SELECT%20*%20WHERE%20%7B%20%3Fs%20%3Fp%20%3Fo%20.%20%7D' step 2 (http-api component): parse from http.IncomingMessage to wacLdpTask, with taskType: TaskType.updateBlob and sparqlQuery: 'SELECT * WHERE { ?s ?p ?o . }' step 3 (operations component): load the current blob contents, parse content, execute sparql, store new state, report result in a WacLdpResponse with resultType: OkayWithBody and body: (the sparql query result) step 4 (http-api component): convert the WacLdpResponse into a http.ServerResponse with status: 200, headers: { ... }, body: '...'. step 5 (NodeJS native): write the ServerResponse onto the wire, in http format.

michielbdejong commented 5 years ago

@acoburn what are your thoughts on this? Since this issue started with:

Both myself and Aaron Coburn have concerns about explicitly coupling WAC and LDP into a single component/module/repo.

TL;DR: this module is split into components, and there is an auth component which can be seen as the 'WAC' part, but the 'LDP' part is split out over the core, api::http, and operations components.

The auth component only makes sense through the way it plugs into the core component, so that's why we think it doesn't make much sense to split it out into a separate repository.

If you want to do LDP without WAC, there's a SKIP_WAC environment variable that will just make it skip the call to checkAccess, and then the behaviour is just LDP without WAC.

If you want to do WAC without LDP then there's no real way to do that, because access checking only makes sense in the context of the various operations and the access modes they require. Access checking is done lazily, so if the LDP request only requires read, and an ACL rule that provide read access is quickly found, then the rest of the ACL rules are not parsed. Putting the access check into a separate step would mean that for each request (even an OPTIONS request, if you really want to separate it), it would have to determine which access modes are allowed, and pass all that information to the LDP layer, which would discard most of that information in many cases. This is especially expensive in the light of https://github.com/solid/web-access-control-spec/issues/43#issuecomment-488289734 so I would like to keep the current code structure, or if needed, only change it in a way that doesn't make things too horrible.

RubenVerborgh commented 5 years ago

Let's solve this at the architectural/interface level, with a sufficiently detailed architectural diagram.

acoburn commented 5 years ago

@michielbdejong the way I see it, there are three components:

  1. The LDP resource server
  2. The WebACL authorization module
  3. The Identity Provider

Assuming that the IdP is an entirely separate concern, possibly (likely?) running on a separate server with its own tech stack, it's really a question of the interaction between LDP and authorization. I see these two as separate concerns but with a lot of overlap.

First of all, an ACL resource (as exposed over HTTP) is -- or at least can be considered -- an LDP resource. And for those resources, it makes sense to me to offer the same interaction patterns that LDP defines for all of the other, non-ACL resources: PATCH, PUT, etc. The authorization algorithm also needs to know something about LDP containment: if a resource doesn't have an ACL, the parent container needs to be checked for an ACL, etc. Furthermore, those ACL resources will need to be persisted somewhere, and unless you choose to create an entirely separate persistence mechanism, which would have its own overhead, I have found it to be exceedingly convenient to store the ACL graph within the same persistence structure as the other resources (i.e. in Trellis, the RDBMS persistence stores ACL resources in their own table but uses the same database; likewise the triplestore backend uses the same triplestore but separate named graphs for ACL resources).

On the other hand, resource management and authorization are indeed separate concerns. What I do in Trellis is the following (and it relies heavily on the JavaEE (JAX-RS) features that come with the platform):

  1. Lowest level: there is a ResourceService that manages access and updates to resources; resources, at this low level, are partitioned (via the Java type system) into various named graphs, of which the ACL resource is just one (LDP containment is another, as is LDP membership). This layer knows nothing of HTTP

2a. WebAC layer: the authorization layer is a completely separate module, though it uses the ResourceService from above. Its interface is basically: given this resource IRI and this webID, what access modes are available. The implementation also uses an internal cache to make these lookups fast -- there are significant performance implications here.

2b. LDP layer: this sits adjacent to the WebAC layer, and it brokers all of the LDP interactions with resources using HTTP (GET, PUT, POST, PATCH, etc), including WebAC resources, so it, too, depends on the ResourceService.

  1. HTTP runtime: this brings together both the WebAC layer and the LDP layer. The WebAC layer is implemented as a separate "filter" in the JavaEE system, so it is executed before the LDP layer. Given an HTTP method, a WebID (this could be an unauthenticated user) and a resource IRI, this filter just determines whether the operation is permitted (true or false). If the operation is permitted, only then does the LDP layer do its thing; otherwise the entire request fails with a 401 or 403.

So, in Trellis, WebAC and LDP are separate modules, but they are very, very related. I can imagine that it would be possible to separate them more or bring them together more, but the structure that I'm using seems to work pretty well. And given that WebAC is an independent "filter" in the JAX-RS runtime, it is also really easy to just turn on or off.

michielbdejong commented 5 years ago

OK, fixed, the auth module is split out now with a well-defined Auth Interface, see https://github.com/inrupt/wac-ldp#wac-ldp

kjetilk commented 5 years ago

Yeah, so I'm noted as "con" and indeed, this is how I see it: There shouldn't be a "The Server", it should be something that can and will be pillaged for parts, even though it is complete. So, if I'm implementing Solid for "MyCode" framework, I wouldn't have to write the whole Solid thing, at least not at once. I could simply take the other parts. Perhaps I could do a ResourceStore, perhaps I have enough to just grab the WAC, and make some changes so that LDP is supported, and that's all. "MyCode" here could be Nextcloud for example. Or it could be my Attean-based implementation, or any of the gazillion things that people out there might want to do.

I wouldn't want to run the whole thing, I would want to install a well-defined package that only implements the stuff that I need. Implemented as a service, I would probably have to code an interface around it, implemented as a proxy, it would fit in my pipeline without modification. The interface I might need to make might even be that proxy.

So, indeed, I'd like a more radical separation of components.

michielbdejong commented 5 years ago

Yes, that is indeed a valid reason to be in favour of your Auth*n Proxy, above Ruben's Authorizer Module from https://rubenverborgh.github.io/solid-server-architecture/solid-architecture-v1-0-0.pdf

RubenVerborgh commented 5 years ago

See detailed discussion at https://github.com/RubenVerborgh/solid-server-architecture/issues/11.