solid / specification

Solid Technical Reports
https://solidproject.org/TR/
MIT License
482 stars 44 forks source link

Act on the latest version of the resource state #91

Closed michielbdejong closed 3 years ago

michielbdejong commented 4 years ago

Just had a chat about this with @acoburn, I think there are a number of points where a Solid server should support concurrency of http requests. A naive but valid way of dealing with this is to queue up all http requests in a single queue at the gate, and only ever execute one at a time, to completion, before starting the next one. But of course that would be a bottleneck and most server implementations will probably want to do better than that. So another model is to have one such queue per resource, instead of a single queue for the whole server. Or maybe one queue for each pod / user.

One restriction we should probably specify is that the server should only report success if the authoritative state of a resource was deterministically updated. It doesn't mean that all cached copies have to be updated before the write request is completed, but they do need to update to the new authoritative state eventually.

Reads are pretty easy, they don't necessarily need to report the latest state of the server, as long as they report a state that is consistent with itself. So for instance if I PUT /foo, then PUT /bar, then GET /, then <> ldp:contains </foo> is ok even though it's outdated, but <> ldp:contains </bar> is incorrect.

There needs to be a single queue of operations that affect a given container, they can be:

There are also requests that affect more than one resource. I think with the deprecation of globbing and the rejection of recursive deletes, the remaining ones will be:

Note that mkdir-p might also conflict with PUT of a non-container.

For the same non-container resource, if PUT, PATCH, and DELETE requests can be handled concurrently, instead of having to handle those one-by-one from a single-file queue, that could greatly increase performance.

I think it's inevitable to introduce some concept of subtree locking to support all of this including mkdir-p. If we drop the requirement for server-side mkdir-p (move it to the client, basically), then concurrency becomes easier, and except for container delete (see below) you only ever have to lock either one resource, or two: a resource and the containment triple for it, in the container that directly contains it. Note that you don't have to lock the whole container, just the triple that pertains to the contained resource in question.

This leads to an attractive property: if two requests affect a different URL, then they can be handled concurrently without coordination.

For container deletion, we could say: If the DELETE of a container starts at a where that container is not empty, then the server may refuse the request.

If the DELETE of a container starts at a where that container is empty, then if any concurrent requests create resource that make it non-empty, then the delete should overrule those new creations, and in that sense act as a recursive delete for those resources, without taking into account any ACL restrictions. So basically if a POST and a DELETE happen concurrently, both may report success to the client, but the DELETE will prevail.

Or if we don't make those two changes to the spec (removing the mkdir-p feature and relaxing the situation around container-deletion) then we need to state in the implementation guide that all POST, PUT and DELETE requests should lock not only the resource but also the containment triples they affect.

@acoburn I think that covers all the points of concurrency in the current spec, did I miss anything?

acoburn commented 4 years ago

I would just add that concurrency over a distributed system leads to a lot of complexity, and one should be very careful about mandating certain specific behaviors in normative text. For example:

Reads are pretty easy, they don't necessarily need to report the latest state of the server, as long as they report a state that is consistent with itself. So for instance if I PUT /foo, then PUT /bar, then GET /, then <> ldp:contains is ok even though it's outdated, but <> ldp:contains is incorrect.

I would not agree with that assertion. A server could implement the propagation of containment triples asynchronously and in an async system, strict ordering is not guaranteed (I wrote such a system a while ago). So it would be entirely possible to see <> ldp:contains </bar> in the container before <> ldp:contains </foo> appears.

I think it's inevitable to introduce some concept of subtree locking to support all of this including mkdir-p

Mandating locking at the specification level would be highly problematic for any sort of distributed/high-performance system.

kjetilk commented 4 years ago

This sounds like a good topic for a master's or a ph.d. paper to me...

michielbdejong commented 4 years ago

Mandating locking at the specification level would be highly problematic for any sort of distributed/high-performance system.

Well, not mandating it directly, but would you agree with the statements that 1) we're currently mandating mkdir-p (unless there has been a decision to deprecate that?) and 2) that probably requires some sort of semaphores or locking right?

dmitrizagidulin commented 4 years ago

@michielbdejong I'm probably missing something obvious - why does mkdir -p require locking?

michielbdejong commented 4 years ago

propagation of containment triples asynchronously

interesting, ok I can go with that. especially if all the containers exist, it's easy. The only difficulty then is probably container deletion. What if you simultaneously delete an empty container, and also add an item into it.

dmitrizagidulin commented 4 years ago

@michielbdejong

What if you simultaneously delete an empty container, and also add an item into it.

So that question often comes up in any sort of eventual consistency conflict resolution system (like CRDTs). And basically, the policy there is - your conflict resolution strategy explicitly states what happens. Like, "if a delete and an add conflict, the add wins". (which results in less information loss than the other way around, incidentally.)

michielbdejong commented 4 years ago

@dmitrizagidulin suppose on an empty domain, you simultaneously add /foo and /foo/bar.

acoburn commented 4 years ago

we're currently mandating mkdir-p

In the 0.8 solid spec document perhaps, but it is an open question whether such behavior would be required for the 1.0 document. See #68

that probably requires some sort of semaphores or locking right?

Not necessarily. The basic problem here is that mkdir -p is not RESTful (multiple resources are changed with a single operation), and so the question relates to atomicity. Some backend systems can guarantee that level of atomicity across multiple resources (e.g. a database transaction). Other systems simply do not have that feature, so then the issue becomes one of conflict resolution

Regarding https://github.com/solid/specification/issues/91#issuecomment-539530492, there are ways to handle that situation that does not lead to errors. E.g. if /foo is not an LDP-C, then ipso facto it cannot contain ldp:contains triples.

michielbdejong commented 4 years ago

"if a delete and an add conflict, the add wins"

OK but what if you do POST /foo/ and DELETE /foo/ simultaneously, and both requests had an If-Match header for the same ETag. Then you would be seen to execute the POST in violation of that condition. That's not the behaviour I would hope to see from a server.

michielbdejong commented 4 years ago

Wait, wow, so it sounds like you're both saying that we should allow servers to basically report success on all requests, even if some of them end up just not having been executed in the post-hoc view. I had always thought we wanted to be much stricter than that.

acoburn commented 4 years ago

Best case scenario: if the persistence layer handles conflict resolution (there are many potential mechanisms for this), then the first request would succeed and the second would fail.

Worst case scenario: you end up with a disconnected graph.

michielbdejong commented 4 years ago

Oh wow, that's definitely a much lower bar than the one I would like us to set. Good that we're having this discussion, we should at least make it explicit in the documentation whether clients can expect any kind of guarantee or whether it's all on a "best effort" basis.

To make it specific, let's imagine:

We could allow the server to create two contradicting universes, in which both requests succeed, and then later we silently destroy one of those universes. So both would get a 2xx response, but then when they check later, they may find out that actually maybe their operation didn't have the desired effect at all. So despite the 2xx response that was given at the time, the change was later deselected. I call that erasing things from the history books, and I think a Solid server should not be allowed to do it.

michielbdejong commented 4 years ago

the first request would succeed and the second would fail.

That scenario is fine, and I think it's what the spec should demand.

dmitrizagidulin commented 4 years ago

@michielbdejong This argument is essentially "do we require strong consistency" in our servers (which would dictate some very specific architectural choices), or do we allow eventual consistency. (And I definitely think it should be the latter.)

michielbdejong commented 4 years ago

but at least leave the requests pending until we know whether they succeeded, right?

michielbdejong commented 4 years ago

Note that with offline-first you would still have client-side cached copies that would sync with the single server-side copy in a CRDT-like way. Just that inside the pod, there should be no room for confusion.

michielbdejong commented 4 years ago

See also http://unhosted.org/practice/28/Synchronizing-browser-storage-with-server-storage.html

dmitrizagidulin commented 4 years ago

Again, though, what you're describing is just one conflict resolution strategy (and others are possible). What you're describing is something like "when there is an un-resolvable conflict, let the timestamp win (and if the timestamp is the same, pick one at random)". But there are other valid resolution strategies (which should be up to the server operator).

dmitrizagidulin commented 4 years ago

(Excellent unhosted link, btw, very cogent analysis. I assume you wrote that?)

michielbdejong commented 4 years ago

Thanks! Yes. I particularly wanted to refer to the star-shaped versioning where the server keeps a non-regressing history. If it helps, we could say the server acts as a ledger at the document level (although we don't need to require it to act as a single ledger at the server level)

(Edited)

kjetilk commented 4 years ago

With regards to LDP's requirements of manipulating several resources (a feature that has always rubbed me the wrong way)...: I'm possibly not thinking clearly enough, but isn't the containment just a temporarily varying membership function? Wouldn't it be helpful to not think of it as materialized, but only computed upon request? Wouldn't that relieve of us of the burden of atomically updating containment triples when members are deleted or added?

acoburn commented 4 years ago

@kjetilk that is precisely how Trellis tends to implement containment triples: they are dynamic and materialized only on request.

michielbdejong commented 4 years ago

@acoburn but there are backends that will make Trellis alter the request history post-hoc right? So that's a problem if we want to prohibit that.

michielbdejong commented 4 years ago

E.g. decide post-hoc that a container deletion never happened, even if at the time a 2xx response was given.

acoburn commented 4 years ago

@michielbdejong that is not accurate. The state of a resource is strictly linearizable, though that state may not be immediately available to all clients: hence, eventual consistency.

michielbdejong commented 4 years ago

"state may not be immediately available to all clients" -> that's ok if the request hangs. It's not ok if the request is responded to by the server (other than a response that says service unavailable). Specifically, a 2xx response to a container DELETE request is not ok unless that delete operation really makes it into the resource's canonical linear state history.

michielbdejong commented 4 years ago

For read it's ok to act on outdated state. For write it is not.

TallTed commented 4 years ago

This discussion is re-discovering wiki edit conflicts, and source code change management, and ACID-style databasing, among other things. As developers in most of these areas have discovered, no single answer works for all applications, even within a single usage space.

For read it's ok to act on outdated state. For write it is not.

This is true for some usage scenarios, certainly not for all.

There are a few commonly understood isolation levels.

Typically, though the server may have a default isolation, it's best to let the user modify that. This might be done with a session switch (which might persist until changed), or an interactive decision on each collision, some mix of these, or some entirely different method.

michielbdejong commented 4 years ago

Good points! I think at the very least we want the spec to prescribe how the server should deal with outdated state.

So then the question is what should it prescribe. My proposal is it should prescribe what boils down to "For read it's ok to act on outdated state. For write it is not."

I really like your analysis there, but I'm a bit hesitant that your proposal might be more versatile, but also more complex than my one. For instance, when there is 1 Solid pod and n Solid apps accessing it, who would you say is the 'user' who can modify isolation levels and/or take interactive decisions?

TallTed commented 4 years ago

@michielbdejong - When there is 1 DBMS instance, and n ODBC/JDBC/OLE DB/dotNET/native apps accessing it, each app (and sometimes each user's session of that app) can modify isolation levels and/or take interactive decisions. This is application logic, not data-store logic.

Yes, my proposal/suggestion has some complexity. Simple implementations have limited utility, and either become more complex over time, as usage and demands increase, or cannot deliver everything that users/developers want (including many things discussed in this and neighboring threads).

acoburn commented 4 years ago

So then the question is what should it prescribe

I would urge caution here. As has been described already, concurrency is complicated (especially in distributed environments) and there is no single concurrency model that fits every usage scenario.

If a particular concurrency model is prescribed by the Solid specification, any normative text should consider at least:

michielbdejong commented 4 years ago

we would be prescribing consistency requirements, but yes, the multi-client concurrency model would be roughly implied by (and have led to) those requirements.

regarding testing, it will be hard to catch all glitches, but the least the test-suite could do is send a lot of concurrent conditional POST and DELETE requests, and seeing if any of the POST requests are ever successful while at the same time a DELETE request is also successful.

if some backend technology stack are not suitable for building a Solid server then that may be a reason to lower the bar, yes.

i don't think it will affect horizontally-distributed implementation if you shard by pod (or more generally, by container subtree). If your goal is to build a scalable pod hosting service and you don't shard by pod then you're probably doing it wrong. :) We should definitely add some implementation guidelines to help server developers on their way with that.

But here too, if we lower the bar then it becomes easier to build a server that meets our requirements. So in particular, yes, there may be external reasons not to shard by container. For instance because the backend has additional interfaces apart from the Solid one, and it needs to be queriable across pods. Maybe you have an external requirement that makes it very attractive to shard alphabetically by resource filename, or even to store the different triples from a single document on different server instances. Such requirements would definitely be easier to combine if we lower Solid's own requirements.

But will enough be left if we set Solid's own bar so low? I think it would be quite a burden on the client to deal with such a fundamental lack of consistency. For instance, I think it was @dmitrizagidulin who proposed the option where in case of conflict, a POST always wins from a DELETE. But what if the POST request only had append access? Should it then be allowed to undo the folder deletion, even if it doesn't have permissions to re-create it? And what about If-Match headers? If the POST request was conditional, but gets applied post-hoc, wouldn't it be a violation of that condition?

kjetilk commented 4 years ago

As far as I can tell, there are a number of different issues here.

In the completely generic case where there is a requirement to manipulate several resources atomically, I think we will not address that in the current spec, as that was a subject for WebDAV, and saw the introduction of the LOCK and UNLOCK verbs, etc. There exists papers that analyze architectural styles used in the construction of WebDAV, DeltaV, Subversion, etc that we could look at, but that I think is well beyond the scope of the current specification effort. Thus my initial comment on this being a fine master's :-)

More constrained use cases that can be addressed with etags, etc in line with this old note is something we could address.

The case of mkdir -p has become much clearer with the proposed resolution of #98 , as we have clarified that Solid is strictly hierarchical. Things become much easier by shifting the frame of mind from "create a container hierarchy" (which consists of several steps to create stuff and thus the requirement of atomic updates), to "enumerate which containers that exist when needed".

Once the hierarchy is strict, then we know that if /foo/bar/baz exists, then /foo/bar/ is a container, and dereferencing it will return the containment triple

</foo/bar/> ldp:contains </foo/bar/baz> .

(make other relative URIs as appropriate :-) ).

The containment hierarchy emerges as a side effect (which is OK with HTTP, but not pretty considered in terms of REST), but it is also required by the strict interpretation of the hierarchy.

The good thing about it is that it resolves a lot of the problems of atomic updates, it isn't a list of write operations that has to be done atomically, but a temporarily varying membership function, where the membership is determined when a container is dereferenced.

The simplest implementations may therefore decide to never materialize the containment triples and then will not need to be concerned with atomicity, but more advanced implementation may want to do it anyway, but for them, atomic updates of the containment triples is an implementation detail they will have to resolve themselves.

I think this strict interpretation that we proposed in #98 simplifies this field a whole lot.

michielbdejong commented 4 years ago

Very insightful! So given that a server may choose between generating containment triples either on read or on write, what I'm interested in is how information would propagate through a cluster of server nodes. i.e. the node handling a certain request may not have received the latest information yet from other nodes in the cluster, and may be acting on outdated data.

Is it allowed for a server to act on outdated data while producing a container representation? And is it allowed for a server to act on outdated data while processing a container deletion?

kjetilk commented 4 years ago

Is it allowed for a server to act on outdated data while producing a container representation? And is it allowed for a server to act on outdated data while processing a container deletion?

Yeah, so they are orthogonal questions to the hierarchy discussion. I think we need to address them, and I'm open either way. Perhaps it is a best-practice question?

michielbdejong commented 4 years ago

That matches what @csarven told me in Ghent, it is best practice, but a server MAY respond with a 202, and it's up to apps to deal with that. In fact, according to Sarven apps should already deal with servers that may respond with 202 since the http spec offers it and the Solid spec doesn't prohibit it (if I'm conveying his words correctly).

kjetilk commented 4 years ago

That's a good point.

OTOH, a case could also be made for that since it is simple to achieve atomicity and consistency in the simplest case, the burden of ensuring ACI(D) for the larger clustered server implementations would not be an undue one.

michielbdejong commented 3 years ago

There is a technique which one could call 'ETag-chaining' (there may be other names for it, I don't know), where in each request you set an If-Match request header, matching the ETag response header from the previous request. If a request creates a resource that previously didn't exist, you set If-None-Match: *. With this technique, a client can be sure that if another client is attempting to edit the same resource, one of them will receive a 412.

This technique will only work with servers that act on the latest version of the resource state, right?

For servers that only offer eventual consistency, does a similar method exist that clients can use to avoid overwriting edits of other clients?

And is there a way for a client to know whether the server they are interacting with can guarantee the intended safety guarantees of ETag-chaining?

elf-pavlik commented 3 years ago

MDN has decent documentation on it

Avoiding the lost update problem with optimistic locking

image

Dealing with the first upload of a resource

image

kjetilk commented 3 years ago

The tricky part is when you're not quite as fortunate as they are in the MDN article :-) The behavior there follows from our requirement to support conditional request, but if you modify the latter diagram to have the situation (mermaid doesn't work in comments, right?) that Client 2 sends their request before the server has the chance to return 201 to the first client. Then what? Do you linearize, do you let the first win, the last win? How do you determine who's first and last?

I'm not sure we are ready to tackle that question on a spec level, we need more implementation experience.

elf-pavlik commented 3 years ago

mermaid doesn't work in comments, right?

It doesn't I sometimes just open mermaid live editor, take screenshot and paste it in comment, optionally also adding link to live editor version of the diagram if someone wants to modify it.

Client 2 sends their request before the server has the chance to return 201 to the first client. Then what? Do you linearize, do you let the first win, the last win? How do you determine who's first and last?

Do you think spec should address this specific case? I think this could be left to implementation. As long as one request get 201 and the other one 412 implementations could decide which one succeeds.

michielbdejong commented 3 years ago

Indeed, there is no way for Client 2 to know if Client 1 has finished yet. In a single+client world, it is OK to ask the one client to not start a new request if the previous one is not finished yet.

Note that NSS, PSS and Nextcloud all support @kjetilk's concurrent scenario as well.

I'll make a note in the test suite that not all servers support multiple concurrent clients, since the spec does not (yet) require it.

kjetilk commented 3 years ago

Yes, I think it should be left to implementation, but implementations should document carefully how they behave.

Personally, I would have implemented it by locking the resource until the change was committed and return 412 in that period, but there are many different considerations, so it makes sense to leave it to implementation. We should consider adding a note about it though.