Open Dragomir-Ivanov opened 7 years ago
Do you mean *,services{*},visits{*}
? Because without {}
you should just get the object ids.
This is a very valid point. What about computing a composite ETag?
cc @smyrman
@rs services
and visits
are bound to clients
by schema.Reference
and not schema.Connection
so *,services,visits
results in an arrays of services:[]
and visits:[]
attached to the returned clients item.
Composite ETag can't be cached in the DB, so we need to recompute it on the fly, compare with the supplied ETag and return 304 if they match.
Is all this work related to rest-layer
at all, aren't we putting too many responsibilities on it? Other frameworks will do this in a middle-ware function executed after the frameworks.
I guess if we have some context ( but not Go context.Context) exported in the request, and we put some important values there to let the following middle-ware functions make decisions based on them.
I am against context.Context
because it is hierarchical, and adding values in a child ctx
doesn't propagate to parent ctx
.
Okay, I think now I am understanding why middle-ware path is not considered here. In Go http.ResponseWriter
is single shot, and once you Write()
there is no turning back. That is why Alice
middle-ware can be used only for modifying the http.Request
object, but not the response.
Please confirm that my understanding is correct.
Yes. That’s why you can set a custom ResponseSender on rest.Handler
.
Computing a checksum of all the Etags shouldn’t be too expensive.
@rs Thanks, then customResponseSender
is the way to go.
Computing a checksum of all the Etags shouldn’t be too expensive.
What do you mean by this? I thought the ETags are check-sums themselves. My thinking is computing ETag for composite and list resources will be the right way ( and is cheap, compared to network latency), and this might be done in the ResponseSender
mentioned above. However we shouldn't enforce that, but maybe put an example for this particular case in the examples.
Etag is an opac string. We can combine all etags of the (sub-)resources using the hash algo of our choosing to create a composite etag.
Why do you want to fix this issue outside of rest layer? It is a real problem that need to be addressed.
Well, now that it is clear we can't make post rest-layer
middle-ware function to modify the response, we have no choice. I wanted because I have seen some Node frameworks doing in middle-ware functions, and it seemed self-contained problem.
Now, why would you want to combine all ETags, is it speed the only concern? Wouldn't be the same if you only calculate ETag (md5 hash) to the composite resource response? Combining multiple-ETags seems more work.
Summing n string is less expensive than summing n maps with arbitrary number of fields (and subfields) stored as interface{}
. We could checksum the JSON representation, it would be in-between in term of complexity but it would also be too late and tie the Etag computation to the representation which is handled by the rest
package instead of resource
package.
Functionally, I agree, both approaches are equivalent.
Just looked at the code. Accessing the Etag during projection evaluation might be tricky actually. I think query.Projection.Eval
should return a list of resources used during the evaluation for instance, so the caller can do additional work. I means that Etag comparison won't be able to happen before projection evaluation.
We have the same problem with If-Modified-Since
and Last-Modified
actually. In this case, we should return the min(resource, sub-resources).
What summing algorithm do you propose for Etags? Append all Etags and then do final md5 on the whole byte array, or XORing them all?
I would go for md5. The xor route would be interesting but would require to either guarantee all Etags are same length or use padding. All rest-layer Etags are currently same size but it could change in the future or we may want to support pre-existing Etags with different format.
Will the md5 hashed Etag be set in the header, while the raw MD5s are made available as _etag
in each resource? Will the original Etag be available as a different header? How does the system respond if a resource is missing an E-tag, e.g. as discussed in rs/rest-layer-mongo#20.
One -case to be aware of:
GET http://192.168.0.103:3001/api/clients/59a85126eaba9ebd0180ac55?fields=*,services,visit
PUT http://192.168.0.103:3001/api/clients/59a85126eaba9ebd0180ac55 If-Match X
(X retrieved via 1)One solution would be to have a composite Etag with a separator like <root resource etag>-<composite projection etags>
. The full Etag would be used by browser for caching, but we could strip the second part when the item is edited.
I think it is best if each resource brings its own _etag
field just like in the itemList
case.
Then we will have ETags to update freely the root resource
or each of the projected resources
.
I guess we will need to supply _updated
as well.
Also when PUT/POST-ing a resource, we can look for If-Match
HTTP header, if non we can fallback to trying to find _etag
field within the resource, and trying to use that. This will not help browser caching, but will prevent concurrent resource update, which is more important.
I’m not a fan of poluting the payload with metadata.
Okay, but we are polluting the payload in itemList
case, aren't we? Adding _etag
in the referenced resources
as well would service the same purpose. Pollution of every kind is not wanted, but can't we make this configurable, or better supplying enough of the data, so custom ResponseFormatter
will do the job.
What’s wrong with the composite etag described above?
There is one particular use-case, where you get root resource
, and a referenced resource
item/items. Then you want to PUT/PATCH a referenced resource
item. In your proposal, we can't do that, because we don't have the ETag for this individual item, but just for the root resource
and for all projected resource
items Etag combined sum.
If we want to update a referenced resource
item then we will need to GET it again, then PUT. Not needing doing this is nice optimization, and will be in line of itemList
case, where we have array of items, each with _etag
field. In the root resource
we also have array of referenced resources
items, but without _etag
field..
I understand that all this is optional, so maybe the best place to do this formatting is in the ResponseFormatter
so the user can overwrite it.
I suppose we have three or four cases for the composite e-tag:
I suppose this will let you update the root-resource without a data race, but not an embedded sub-resource, e.g. if you want to fetch all resources through one big request with embeds, and update individual resources later without a re-fetch.
I’m not a fan of poluting the payload with metadata.
Neither am I. However, as @Dragomir-Ivanov points out, the rest
package already include an "_etag" field in some cases. Why not do so consistently, and let that field be used for updates, and the header value E-tag for GET?
Btw, making the header values e-tag of format <root resource etag>-<composite projection etags>
when applicable is still nice in order to make it less confusing to the user why the E-tag in the header and payload sometimes do not match; at least he can see that the header E-tag actually includes the resource E-tag, and that it's of a longer format.
@rs Any thoughts on "polluting" the sub-resource payload with _etag
?
I have no better solution to propose so let’s do that :/
I was tinkering with the Etag
for the composite resource, and it turned out that during the resource
and sub-resource
projection evaluation different parts of sub-resource
projections can come from DB at different times, because some sub-resources are obtained with go-routines. This is a problem because for the same request, sometimes sub-resources can come in A,B,C order, but for other times they can come in B,C,A order, thus using md5.Hash()
function producting different composite resource hasg/ETag, for exactly the same result. @rs Can you rethink your decision on using XOR of all ETags instead. It will produce the same result no matter the order.
Can you rethink your decision on using XOR of all ETags instead.
I suppose that means that rs/rest-layer-mongo#20 would need to change to a compatible format?
sometimes sub-resources can come in A,B,C order, but for other times they can come in B,C,A order,
Have you actually observed this? From me just skimming the code, it looks like the order is pretty much guaranteed (unless they come in a random order from the Storer layer).
Expanded references are inserted at a given map field location, so evaluation order should not matter.
Connections are inserted at a given slice index:
I suppose it is true that they can come from the DB at different times, but does that matter?
Maybe I am not understanding the code enough, but for composite resource's ETag we will need to hash with MD5: root resource ETag, sub-resource A ETag, sub-resource B ETag, etc. We will need to apply the MD5 hash in the same order, for us to have the same composite-resource ETag. We will need ether store&compute this hash after all sub-resources have been obtained from storage, or use algorithm where ordering doesn't matter (XOR). @smyrman The snippets above seem to be executed by separate go-routines, so there is no guarantees about their scheduling, and which finishes when. I may be wrong, but that is my understanding for now.
We will need ether store&compute this hash after all sub-resources have been obtained from storage, or use algorithm where ordering doesn't matter (XOR).
Ok, I follow you now. I was assuming you would do the first, compute the hash after all sub-resources have been fetched. If you want to do the incremental hash while you are fetching resources, then yes, you would need to use another algorithm than MD5.
From me just skimming the code, it looks like the order is pretty much guaranteed (unless they come in a random order from the Storer layer).
I was thinking about the order being consistent in the final JSON response, not in terms of evaluation order, which is what you ask for -- sorry for the confusion.
In the light of this, xor is the way to go. We can handle different sizes using padding when necessary.
We can handle different sizes using padding when necessary.
That only works if the non-conforment ID has a length shorter then the typical MD5 sub. Maybe better to just do a MD5-sum of any non-conferment IDs (length not == 32 chars) right before the XOR?
Also, some final control questions before we dive into XOR.
{"author": "a", "commiter": "a"}
, {"author": "b", "commiter": "b"}
. Most likely though, it would still be a case of XOR(c,a,a) != XOR(d,b,b)
Here is some discussion on the topic: https://crypto.stackexchange.com/questions/1368/is-it-a-good-idea-to-use-bitwise-xor-on-a-set-of-md5-sums
I guess collecting all the _etag
fileds, sort them, and then do the final MD5 sum is another way of doing it.
@Dragomir-Ivanov , I found the same link, and it seams the answer to 1 is about 2.938735877055719e-39
(given the math/data in the Stack Exchange answer is correct). That is probably more than good enough.
I guess collecting all the _etag fileds, sort them then do the final MD5 sum is another way of doing it.
If we still don't care about point 3, then yes. If we do, then it has many of the same problems.
However, I think XOR will work nicely. I just want to ask some control questions in case there is anything we haven't thought about:-)
On point 2: XOR(c,a,a) == XOR(c)
Don't know how big is the chance of occurrence in rest-layer
, but there is potential for it. I guess we need to choose carefully, because future changes of the ETag algorithm, will invalidate already build client caches. rest-layer
is not 1.0, but still.
That is a very good point. I suppose it means that for the previous example of{"author": "a", "commiter": "a"}
, the E-tag from GET /commits?limit=1
would be the same E-tag as the one returned from GET /commits?limit=1,fields=comitter,author
.
However, those are different URLs, so it should be OK.
If you change the fields commiter
and author
to "b", the commit item should get a new e-tag, and then there is also no issue.
Can you think of any other cases where XOR(c,a,a) == XOR(c)
could be an issue?
I suppose this is actually a real case:
GET /commits?limit=1,fields=comitter,author
E-tag is generated as XOR(c,a,a)
GET /commits?limit=1,fields=comitter,author
E-tag is generated as XOR(c,b,b)
, and BOOM.If you then set limt from 1
to 1000
, there is actually a pretty high potential error rate for the E-tag generation via XOR, as any change to the linked (user) objects here are ignored.
That leaves us with some options:
Difference of the above options:
(1) Will ignore ordering of appearance in the sub-resources
. For example, having sub resource field "commits": [ c1, c2, c3]
, on a different run the DB can return these commits in a different order say: [c2,c3,c1]
, but since the individual _etags
will be the same, option (1) will produce the same composite ETag. Option (2) will compute different ETag for each slightest change in the ordering.
There is a problem with adding _etag
to each sub-resource item: If we add it, then a change in the resource outside current projection ( say we project ?fields=a,b
, but something else changes c
), we will have different _etag
so the composite ETag will be different, although the requested fields are going to be the same. I think we should make adding sub-resource _etag
optional via HTTP header maybe(X-RS-Etags
). If we had the *
field selection we could do it like ?fields=*,_etag
, where _etag
could be a hidden field.
There are a lot of trade-offs we have to consider.
There is a problem with adding _etag to each sub-resource item: If we add it, then a change in the resource outside current projection ( say we project ?fields=a,b , but something else changes c ),
This only matters for MD5 hash of the full JSON payload though, as the semantic weak E-tags generated by a hash of an alphanumerically sorted list of E-tags would change no matter if _etag
is included in the response or not.
Besides that, I like the idea of making the _etag
field optional to fetch by making it hidden by default. This also would resolve @rs concern about adding meta-data to the response body. That's probably handled as a separate issue/PR?
I am leaning a bit against staring out with strong E-tags of the full JSON payload if the performance is not significantly reduced by this. I found some benchmarks here: http://webcache.googleusercontent.com/search?q=cache:hqoueUAiJNcJ:blog.madewithdrew.com/post/benchmark-md5/+&cd=1&hl=no&ct=clnk&gl=no&client=safari
We might want to do some quick benchmarks ourselves. If it's really less than 1ms extra for a 500k response on some reference hardware, that doesn't sound too alarming.
If we do strong etags we will lose the ability to make conditional write requests. Or we need to always keep the root resource stored etag as a prefix to be used for writes.
@smyrman Currently there is no way of obtaining Hidden
fields. If you include them in the ?fields=
parameter rest-layer returns an error. However I see a benefit of having "hidden" fields forcibly show via query, so I guess we can relax this, and maybe add another type of Forbidden
field type, where it will behave like current Hidden
field type.
I guess there are a lot of corner cases to cover. For example if someone choose not to include _etag
in resource items ( root/sub ) then we can freely switch from MD5 of all resource _etag
to MD5 to full JSON.
In doing so we can calculate the same ETag for modified root/sub, if that modification is not evident in the composite resource projection.
Doing so, we will not be able to use ETag
GET HTTP response header for root resource modification, since it will be fingerprint for the whole response payload. If one wants to update single item be it the root
one, or one of the sub-resources
, it will need to use the _etag
field from the item itself.
I understand this is tough decision to make, so don't rush. I am sure we will make the right decision in the end.
If we do strong etags we will lose the ability to make conditional write requests.
@rs, we will loose that anyway for the compound E-tag, either it's a strong E-tag or a hash of hashes, we would have to (and already have to for the list view), use the embedded _etag
field for conditional writes. It's an open question if we want to use compound indexes for a single item view (with or without embeds), or just for the list view (with and without embeds).
Or we need to always keep the root resource stored etag as a prefix to be used for writes.
~We can not prefix a string E-tag, can we? Would not the browser just consider it a as one E-tag?~
@smyrman Currently there is no way of obtaining Hidden fields. If you include them in the ?fields= parameter rest-layer returns an error. However I see a benefit of having "hidden" fields forcibly show via query, so I guess we can relax this, and maybe add another type of Forbidden field type, where it will behave like current Hidden field type.
At the moment, the _etag
is not a Schema field at all, but included into the response by the DefaultResponseFormatter
, and for the list-view only:
Anyway, maybe we can leave the extra _etag
field out for now then (just include the root _etag
in the list view as we do today), and add the compound E-tag for the list view only -- maybe even just calcualte/set it in the DefaultFormatResponse
directly.
Then this is a pretty small change, and it's easy to benchmark any performance loss.
There is already PR for that #162
We can have an etag that is composed of two part, one which is the etag of the root object (as stored in the db) and another which is the compound etag (like checksum of the response).
The full etag will be used by the client for caching. But when it comes to write operations, the server is the one choosing how to extract the relevant information from the etag to fulfill the If-Match condition. In this occurrence, the server would just ignore the second part of the etag that is not related to the root document, and only use the first part for comparison.
Does it make sense?
I think when client is writing compound items, it gets more complicated. For one client must remove all sub-resources
, because he obviously can't PUT the whole projection. Doing so, we already enter the realm of "not that trivial", so why we don't proceed further and include the _etag
in the root resource.
That way compound ETag will be left solely as a fingerprint for the whole resource, and server can also check for _etag
in the root resource body, and use it if no If-Match
is provided. That way clients can update a resource safely, without any HTTP header modifications. Of-course all this _etag
in resource body could be explicitly requested via custom HTTP header.
For distinct resources, say /commits/12345, we can set the ETag as the _etag
, and probably omit the _etag
from the resource, because it will be a duplication.
The full etag will be used by the client for caching. But when it comes to write operations, the server is the one choosing how to extract the relevant information from the etag to fulfill the If-Match condition. In this occurrence, the server would just ignore the second part of the etag that is not related to the root document, and only use the first part for comparison.
Does it make sense?
Yes - for the single item GET
view, this makes perfect sense. I suppose those would also be marked as strong then.
@Dragomir-Ivanov commented on the review.
I think when client is writing compound items, it gets more complicated. For one client must remove all sub-resources, because he obviously can't PUT the whole projection. Doing so, we already enter the realm of "not that trivial",
Thats a very good point actually!
Since we have cleared all outstanding PRs, are you willing to continue discussion on this topic. I believe it is important and this can make rest-layer
very flexible :) I guess the 3 of us can discuss on Slack for better latency.
Lets suppose we have this request:
If I execute this in a row, first time I get 200 then 304, which is as it is supposed to be. But then I change some field values of
services
andvisits
behind the scenes.Then retry the query above, I still get 304, although the result of the query IS CHANGED.
ETag comparison has its place on updating concrete resources, but I think we should leave generation of ETags for compound resources, and for resouseLists to be generated in a middleware function executed after
rest-layer
.