ETag validity on composite resource

Dragomir-Ivanov commented 7 years ago

Lets suppose we have this request:

GET http://192.168.0.103:3001/api/clients/59a85126eaba9ebd0180ac55?fields=*,services,visits
If-None-Match: "11b22a28b07ddf482f8269050e9691d9"

If I execute this in a row, first time I get 200 then 304, which is as it is supposed to be. But then I change some field values of services and visits behind the scenes.

Then retry the query above, I still get 304, although the result of the query IS CHANGED.

ETag comparison has its place on updating concrete resources, but I think we should leave generation of ETags for compound resources, and for resouseLists to be generated in a middleware function executed after rest-layer.

rs commented 7 years ago

Do you mean *,services{*},visits{*}? Because without {} you should just get the object ids.

This is a very valid point. What about computing a composite ETag?

cc @smyrman

Dragomir-Ivanov commented 7 years ago

@rs services and visits are bound to clients by schema.Reference and not schema.Connection so *,services,visits results in an arrays of services:[] and visits:[] attached to the returned clients item.

Composite ETag can't be cached in the DB, so we need to recompute it on the fly, compare with the supplied ETag and return 304 if they match. Is all this work related to rest-layer at all, aren't we putting too many responsibilities on it? Other frameworks will do this in a middle-ware function executed after the frameworks.

I guess if we have some context ( but not Go context.Context) exported in the request, and we put some important values there to let the following middle-ware functions make decisions based on them. I am against context.Context because it is hierarchical, and adding values in a child ctx doesn't propagate to parent ctx.

Dragomir-Ivanov commented 7 years ago

Okay, I think now I am understanding why middle-ware path is not considered here. In Go http.ResponseWriter is single shot, and once you Write() there is no turning back. That is why Alice middle-ware can be used only for modifying the http.Request object, but not the response. Please confirm that my understanding is correct.

rs commented 7 years ago

Yes. That’s why you can set a custom ResponseSender on rest.Handler.

Computing a checksum of all the Etags shouldn’t be too expensive.

Dragomir-Ivanov commented 7 years ago

@rs Thanks, then customResponseSender is the way to go.

Computing a checksum of all the Etags shouldn’t be too expensive.

What do you mean by this? I thought the ETags are check-sums themselves. My thinking is computing ETag for composite and list resources will be the right way ( and is cheap, compared to network latency), and this might be done in the ResponseSender mentioned above. However we shouldn't enforce that, but maybe put an example for this particular case in the examples.

rs commented 7 years ago

Etag is an opac string. We can combine all etags of the (sub-)resources using the hash algo of our choosing to create a composite etag.

Why do you want to fix this issue outside of rest layer? It is a real problem that need to be addressed.

Dragomir-Ivanov commented 7 years ago

Well, now that it is clear we can't make post rest-layer middle-ware function to modify the response, we have no choice. I wanted because I have seen some Node frameworks doing in middle-ware functions, and it seemed self-contained problem.

Now, why would you want to combine all ETags, is it speed the only concern? Wouldn't be the same if you only calculate ETag (md5 hash) to the composite resource response? Combining multiple-ETags seems more work.

rs commented 7 years ago

Summing n string is less expensive than summing n maps with arbitrary number of fields (and subfields) stored as interface{}. We could checksum the JSON representation, it would be in-between in term of complexity but it would also be too late and tie the Etag computation to the representation which is handled by the rest package instead of resource package.

Functionally, I agree, both approaches are equivalent.

rs commented 7 years ago

Just looked at the code. Accessing the Etag during projection evaluation might be tricky actually. I think query.Projection.Eval should return a list of resources used during the evaluation for instance, so the caller can do additional work. I means that Etag comparison won't be able to happen before projection evaluation.

We have the same problem with If-Modified-Since and Last-Modified actually. In this case, we should return the min(resource, sub-resources).

Dragomir-Ivanov commented 7 years ago

What summing algorithm do you propose for Etags? Append all Etags and then do final md5 on the whole byte array, or XORing them all?

rs commented 7 years ago

I would go for md5. The xor route would be interesting but would require to either guarantee all Etags are same length or use padding. All rest-layer Etags are currently same size but it could change in the future or we may want to support pre-existing Etags with different format.

smyrman commented 7 years ago

Will the md5 hashed Etag be set in the header, while the raw MD5s are made available as _etag in each resource? Will the original Etag be available as a different header? How does the system respond if a resource is missing an E-tag, e.g. as discussed in rs/rest-layer-mongo#20.

One -case to be aware of:

A user fetches a resource with some other resource embedded GET http://192.168.0.103:3001/api/clients/59a85126eaba9ebd0180ac55?fields=*,services,visit
A user try to update the root resource if it has not changed PUT http://192.168.0.103:3001/api/clients/59a85126eaba9ebd0180ac55 If-Match X (X retrieved via 1)

rs commented 7 years ago

One solution would be to have a composite Etag with a separator like <root resource etag>-<composite projection etags>. The full Etag would be used by browser for caching, but we could strip the second part when the item is edited.

Dragomir-Ivanov commented 7 years ago

I think it is best if each resource brings its own _etag field just like in the itemList case. Then we will have ETags to update freely the root resource or each of the projected resources. I guess we will need to supply _updated as well. Also when PUT/POST-ing a resource, we can look for If-Match HTTP header, if non we can fallback to trying to find _etag field within the resource, and trying to use that. This will not help browser caching, but will prevent concurrent resource update, which is more important.

rs commented 7 years ago

I’m not a fan of poluting the payload with metadata.

Dragomir-Ivanov commented 7 years ago

Okay, but we are polluting the payload in itemList case, aren't we? Adding _etag in the referenced resources as well would service the same purpose. Pollution of every kind is not wanted, but can't we make this configurable, or better supplying enough of the data, so custom ResponseFormatter will do the job.

rs commented 7 years ago

What’s wrong with the composite etag described above?

Dragomir-Ivanov commented 7 years ago

There is one particular use-case, where you get root resource, and a referenced resource item/items. Then you want to PUT/PATCH a referenced resource item. In your proposal, we can't do that, because we don't have the ETag for this individual item, but just for the root resource and for all projected resource items Etag combined sum. If we want to update a referenced resource item then we will need to GET it again, then PUT. Not needing doing this is nice optimization, and will be in line of itemList case, where we have array of items, each with _etag field. In the root resource we also have array of referenced resources items, but without _etag field..

I understand that all this is optional, so maybe the best place to do this formatting is in the ResponseFormatter so the user can overwrite it.

smyrman commented 7 years ago

I suppose we have three or four cases for the composite e-tag:

list view : only composite etag ("root item?")
list view with embeds : another composite etag
detail view : only root item etag
detail view with embeds : root item and conposite e-tag

I suppose this will let you update the root-resource without a data race, but not an embedded sub-resource, e.g. if you want to fetch all resources through one big request with embeds, and update individual resources later without a re-fetch.

I’m not a fan of poluting the payload with metadata.

Neither am I. However, as @Dragomir-Ivanov points out, the rest package already include an "_etag" field in some cases. Why not do so consistently, and let that field be used for updates, and the header value E-tag for GET?

smyrman commented 7 years ago

Btw, making the header values e-tag of format <root resource etag>-<composite projection etags> when applicable is still nice in order to make it less confusing to the user why the E-tag in the header and payload sometimes do not match; at least he can see that the header E-tag actually includes the resource E-tag, and that it's of a longer format.

Dragomir-Ivanov commented 7 years ago

@rs Any thoughts on "polluting" the sub-resource payload with _etag?

rs commented 7 years ago

I have no better solution to propose so let’s do that :/

Dragomir-Ivanov commented 7 years ago

I was tinkering with the Etag for the composite resource, and it turned out that during the resource and sub-resource projection evaluation different parts of sub-resource projections can come from DB at different times, because some sub-resources are obtained with go-routines. This is a problem because for the same request, sometimes sub-resources can come in A,B,C order, but for other times they can come in B,C,A order, thus using md5.Hash() function producting different composite resource hasg/ETag, for exactly the same result. @rs Can you rethink your decision on using XOR of all ETags instead. It will produce the same result no matter the order.

smyrman commented 7 years ago

Can you rethink your decision on using XOR of all ETags instead.

I suppose that means that rs/rest-layer-mongo#20 would need to change to a compatible format?

sometimes sub-resources can come in A,B,C order, but for other times they can come in B,C,A order,

Have you actually observed this? From me just skimming the code, it looks like the order is pretty much guaranteed (unless they come in a random order from the Storer layer).

Expanded references are inserted at a given map field location, so evaluation order should not matter.

https://github.com/rs/rest-layer/blob/2c4c847a7e4d13e1245c93990665aaeb1a35cfe2/schema/query/projection_evaluator.go#L95

Connections are inserted at a given slice index:

https://github.com/rs/rest-layer/blob/2c4c847a7e4d13e1245c93990665aaeb1a35cfe2/schema/query/projection_evaluator.go#L121

smyrman commented 7 years ago

I suppose it is true that they can come from the DB at different times, but does that matter?

Dragomir-Ivanov commented 7 years ago

Maybe I am not understanding the code enough, but for composite resource's ETag we will need to hash with MD5: root resource ETag, sub-resource A ETag, sub-resource B ETag, etc. We will need to apply the MD5 hash in the same order, for us to have the same composite-resource ETag. We will need ether store&compute this hash after all sub-resources have been obtained from storage, or use algorithm where ordering doesn't matter (XOR). @smyrman The snippets above seem to be executed by separate go-routines, so there is no guarantees about their scheduling, and which finishes when. I may be wrong, but that is my understanding for now.

smyrman commented 7 years ago

We will need ether store&compute this hash after all sub-resources have been obtained from storage, or use algorithm where ordering doesn't matter (XOR).

Ok, I follow you now. I was assuming you would do the first, compute the hash after all sub-resources have been fetched. If you want to do the incremental hash while you are fetching resources, then yes, you would need to use another algorithm than MD5.

smyrman commented 7 years ago

From me just skimming the code, it looks like the order is pretty much guaranteed (unless they come in a random order from the Storer layer).

I was thinking about the order being consistent in the final JSON response, not in terms of evaluation order, which is what you ask for -- sorry for the confusion.

rs commented 7 years ago

In the light of this, xor is the way to go. We can handle different sizes using padding when necessary.

smyrman commented 7 years ago

We can handle different sizes using padding when necessary.

That only works if the non-conforment ID has a length shorter then the typical MD5 sub. Maybe better to just do a MD5-sum of any non-conferment IDs (length not == 32 chars) right before the XOR?

smyrman commented 7 years ago

Also, some final control questions before we dive into XOR.

What is the colission rate for using XOR for hash combinations. Is it significant?
XOR mapps identical hashes to zero, e.g. XOR(a,a) and XOR(b,b) both result in zero. Presumably this may be an issue for an object that potentially maps to the same object twice: {"author": "a", "commiter": "a"}, {"author": "b", "commiter": "b"}. Most likely though, it would still be a case of XOR(c,a,a) != XOR(d,b,b)
As the order is lost in XOR, is there any plausible situation where the order in the response actually matters, so that an unordered algorithm becomes inconsistent per unique URL including query parameters? I can't think of any myself.

Dragomir-Ivanov commented 7 years ago

Here is some discussion on the topic: https://crypto.stackexchange.com/questions/1368/is-it-a-good-idea-to-use-bitwise-xor-on-a-set-of-md5-sums

I guess collecting all the _etag fileds, sort them, and then do the final MD5 sum is another way of doing it.

smyrman commented 7 years ago

@Dragomir-Ivanov , I found the same link, and it seams the answer to 1 is about 2.938735877055719e-39 (given the math/data in the Stack Exchange answer is correct). That is probably more than good enough.

I guess collecting all the _etag fileds, sort them then do the final MD5 sum is another way of doing it.

If we still don't care about point 3, then yes. If we do, then it has many of the same problems.

However, I think XOR will work nicely. I just want to ask some control questions in case there is anything we haven't thought about:-)

Dragomir-Ivanov commented 7 years ago

On point 2: XOR(c,a,a) == XOR(c) Don't know how big is the chance of occurrence in rest-layer, but there is potential for it. I guess we need to choose carefully, because future changes of the ETag algorithm, will invalidate already build client caches. rest-layer is not 1.0, but still.

smyrman commented 7 years ago

That is a very good point. I suppose it means that for the previous example of{"author": "a", "commiter": "a"}, the E-tag from GET /commits?limit=1 would be the same E-tag as the one returned from GET /commits?limit=1,fields=comitter,author.

However, those are different URLs, so it should be OK.

If you change the fields commiter and author to "b", the commit item should get a new e-tag, and then there is also no issue.

smyrman commented 7 years ago

Can you think of any other cases where XOR(c,a,a) == XOR(c) could be an issue?

smyrman commented 7 years ago

I suppose this is actually a real case:

GET /commits?limit=1,fields=comitter,author E-tag is generated as XOR(c,a,a)
We change the object linked to by committer and author, e.g. we change the user's email address.
We do another . GET /commits?limit=1,fields=comitter,author E-tag is generated as XOR(c,b,b), and BOOM.

If you then set limt from 1 to 1000, there is actually a pretty high potential error rate for the E-tag generation via XOR, as any change to the linked (user) objects here are ignored.

smyrman commented 7 years ago

That leaves us with some options:

Delayed and alphanumerically sorted MD5 hash of MD5 hashes, as suggested by @Dragomir-Ivanov . Would ignore ordering in the response (point 3 in my control question list).
MD5 hash of the full JSON payload. This has the side-effect of allowing strong E-tags (we can guarantee byte compliant responses), although that's probably not a goal in itself. May come with a greater performance penalty.

Dragomir-Ivanov commented 7 years ago

Difference of the above options: (1) Will ignore ordering of appearance in the sub-resources. For example, having sub resource field "commits": [ c1, c2, c3], on a different run the DB can return these commits in a different order say: [c2,c3,c1], but since the individual _etags will be the same, option (1) will produce the same composite ETag. Option (2) will compute different ETag for each slightest change in the ordering.

There is a problem with adding _etag to each sub-resource item: If we add it, then a change in the resource outside current projection ( say we project ?fields=a,b , but something else changes c ), we will have different _etag so the composite ETag will be different, although the requested fields are going to be the same. I think we should make adding sub-resource _etag optional via HTTP header maybe(X-RS-Etags). If we had the * field selection we could do it like ?fields=*,_etag, where _etag could be a hidden field.

There are a lot of trade-offs we have to consider.

smyrman commented 7 years ago

There is a problem with adding _etag to each sub-resource item: If we add it, then a change in the resource outside current projection ( say we project ?fields=a,b , but something else changes c ),

This only matters for MD5 hash of the full JSON payload though, as the semantic weak E-tags generated by a hash of an alphanumerically sorted list of E-tags would change no matter if _etag is included in the response or not.

Besides that, I like the idea of making the _etag field optional to fetch by making it hidden by default. This also would resolve @rs concern about adding meta-data to the response body. That's probably handled as a separate issue/PR?

I am leaning a bit against staring out with strong E-tags of the full JSON payload if the performance is not significantly reduced by this. I found some benchmarks here: http://webcache.googleusercontent.com/search?q=cache:hqoueUAiJNcJ:blog.madewithdrew.com/post/benchmark-md5/+&cd=1&hl=no&ct=clnk&gl=no&client=safari

We might want to do some quick benchmarks ourselves. If it's really less than 1ms extra for a 500k response on some reference hardware, that doesn't sound too alarming.

rs commented 7 years ago

If we do strong etags we will lose the ability to make conditional write requests. Or we need to always keep the root resource stored etag as a prefix to be used for writes.

Dragomir-Ivanov commented 7 years ago

@smyrman Currently there is no way of obtaining Hidden fields. If you include them in the ?fields= parameter rest-layer returns an error. However I see a benefit of having "hidden" fields forcibly show via query, so I guess we can relax this, and maybe add another type of Forbidden field type, where it will behave like current Hidden field type.

I guess there are a lot of corner cases to cover. For example if someone choose not to include _etag in resource items ( root/sub ) then we can freely switch from MD5 of all resource _etag to MD5 to full JSON. In doing so we can calculate the same ETag for modified root/sub, if that modification is not evident in the composite resource projection. Doing so, we will not be able to use ETag GET HTTP response header for root resource modification, since it will be fingerprint for the whole response payload. If one wants to update single item be it the root one, or one of the sub-resources, it will need to use the _etag field from the item itself. I understand this is tough decision to make, so don't rush. I am sure we will make the right decision in the end.

smyrman commented 7 years ago

If we do strong etags we will lose the ability to make conditional write requests.

@rs, we will loose that anyway for the compound E-tag, either it's a strong E-tag or a hash of hashes, we would have to (and already have to for the list view), use the embedded _etag field for conditional writes. It's an open question if we want to use compound indexes for a single item view (with or without embeds), or just for the list view (with and without embeds).

Or we need to always keep the root resource stored etag as a prefix to be used for writes.

~We can not prefix a string E-tag, can we? Would not the browser just consider it a as one E-tag?~

smyrman commented 7 years ago

@smyrman Currently there is no way of obtaining Hidden fields. If you include them in the ?fields= parameter rest-layer returns an error. However I see a benefit of having "hidden" fields forcibly show via query, so I guess we can relax this, and maybe add another type of Forbidden field type, where it will behave like current Hidden field type.

At the moment, the _etag is not a Schema field at all, but included into the response by the DefaultResponseFormatter, and for the list-view only:

https://github.com/rs/rest-layer/blob/5b33327a0fb07b058fa7f55e5c7798c1a9b07164/rest/response.go#L102

Anyway, maybe we can leave the extra _etag field out for now then (just include the root _etag in the list view as we do today), and add the compound E-tag for the list view only -- maybe even just calcualte/set it in the DefaultFormatResponse directly.

Then this is a pretty small change, and it's easy to benchmark any performance loss.

Dragomir-Ivanov commented 7 years ago

There is already PR for that #162

rs commented 7 years ago

We can have an etag that is composed of two part, one which is the etag of the root object (as stored in the db) and another which is the compound etag (like checksum of the response).

The full etag will be used by the client for caching. But when it comes to write operations, the server is the one choosing how to extract the relevant information from the etag to fulfill the If-Match condition. In this occurrence, the server would just ignore the second part of the etag that is not related to the root document, and only use the first part for comparison.

Does it make sense?

Dragomir-Ivanov commented 7 years ago

I think when client is writing compound items, it gets more complicated. For one client must remove all sub-resources, because he obviously can't PUT the whole projection. Doing so, we already enter the realm of "not that trivial", so why we don't proceed further and include the _etag in the root resource.

That way compound ETag will be left solely as a fingerprint for the whole resource, and server can also check for _etag in the root resource body, and use it if no If-Match is provided. That way clients can update a resource safely, without any HTTP header modifications. Of-course all this _etag in resource body could be explicitly requested via custom HTTP header.

For distinct resources, say /commits/12345, we can set the ETag as the _etag, and probably omit the _etag from the resource, because it will be a duplication.

smyrman commented 7 years ago

The full etag will be used by the client for caching. But when it comes to write operations, the server is the one choosing how to extract the relevant information from the etag to fulfill the If-Match condition. In this occurrence, the server would just ignore the second part of the etag that is not related to the root document, and only use the first part for comparison.

Does it make sense?

Yes - for the single item GET view, this makes perfect sense. I suppose those would also be marked as strong then.

@Dragomir-Ivanov commented on the review.

smyrman commented 7 years ago

I think when client is writing compound items, it gets more complicated. For one client must remove all sub-resources, because he obviously can't PUT the whole projection. Doing so, we already enter the realm of "not that trivial",

Thats a very good point actually!

Dragomir-Ivanov commented 7 years ago

Since we have cleared all outstanding PRs, are you willing to continue discussion on this topic. I believe it is important and this can make rest-layer very flexible :) I guess the 3 of us can discuss on Slack for better latency.

rs / rest-layer

ETag validity on composite resource #157