Open nitriques opened 8 years ago
jit now issues 2 HTTP 200 responses before sending the 304
I assume you mean the following: JIT issues two requests to the remote server and waits for the responses before managing to send a 304 response. Right?
If yes, do you mean that this has been introduced in 2.0? Or is it the same in 1.7?
BTW: I just discovered a bug with JIT 1.7 when I tried to test the reponses using cURL. (I don't know for sure if JIT 2.0 has the same behaviour.) The "If-Modified-Since" logic is not working properly. I only get a 304 response if the value of the If-Modified-Since
header matches the Last-Modified
response header exactly. Now if I add some time to the If-Modified-Since
value (asking for a later point in time), it should still be a 304, but it is not. I get a 200 response instead.
The second one only request the headers. Based on libcurl's documentation, doing so sends a HEAD http request and should (if the remote server's smart enough) only sends headers...
I wouldn't care about this, because HEAD requests are standardized, AFAIK, and should work with any HTTP server. As you said, you can't rely on getting any cache headers from the remote server, (but that is the same with 200 responses).
Apart from that, none of your solutions sounds good to me. "Randomly poking" a remote server is not a clear solution, and it scales terribly. If you have some thousand external images in the cache, you will generate a lot of requests/traffic even if you don't need to serve a single images to a client (because it is used in old, archived articles, for example). This might mean less traffic than with the current behaviour, but it might as well be the other way 'round, depending on the percentage (and frequency) of cached images being actually served by JIT.
More questions:
If the latter is the case, there may be reasons for this (for example: servers not sending cache headers, as you mentioned above).
The more I think about it, the more I want performance.
Agreed. I once tested putting all my image files to Amazon AWS and serving them via JIT. Yes, this was significantly slower, you could see that in your browser when loading a page (without browser cache, of course). Using CloudFront made it a lot faster and smoother (but still a bit slower than with local files). At that time I also thought about speeding things up with external images, but I couldn't find a proper solution. I think that checking if a file still exists is the least JIT should do even with external images. So I decided to go for local images (and a bigger hard drive on the server)...
(The best idea I had was delivering the cache file first, then checking for the external file. This would solve any speed issues, but it would mean that you deliver a cache file for at least one more time if the original file has been removed. That felt wrong to me.)
About ETags: I personally don't use them at all, I remove all of them (in Apache):
If you're not taking advantage of the flexible validation model that ETags provide, it's better to just remove the ETag altogether. The Last-Modified header validates based on the component's timestamp. āhttps://developer.yahoo.com/blogs/ydn/high-performance-sites-rule-13-configure-etags-7211.html
Last question: Does JIT really use the remote files's ETag and cache headers? (I don't know, because I even reset image cache headers in Apache.)
Sorry for posting a lot of questions and no solutions. But it's a delicate matter (and I love JIT).
But it's a delicate matter (and I love JIT).
Same here !
I assume you mean the following: JIT issues two requests to the remote server and waits for the responses before managing to send a 304 response. Right?
You are right, but this is not what I've met: The request are those handle on the server where jit runs. I do not know if it's new to 2.0 (but it really looks like it)
The "If-Modified-Since" logic is not working properly.
I know! That the origin of my work on that extension haha!
Are we sure that JIT sends two requests to the remote server?
Totally.
Does this only occur once or with every request?
Right now, it's when there is no cache. When a cache file is created, it lasts forever, without ever checking if the url got modified. I was trying to solve #136 and stumble upon this for external images.
"Randomly poking" a remote server is not a clear solution, and it scales terribly.
Indeed. But how can you choose to check if you cache is still valid ?
Does JIT really use the remote files's ETag and cache headers?
Not at all. Only the Last Modified value is used. And then we calculate a etag based on that value and the filename.
You are right, but this is not what I've met: The request are those handle on the server where jit runs. I do not know if it's new to 2.0 (but it really looks like it)
Sorry, I still don't get it. Do you mean that JIT itself sends 3 answers to a single request??? Can you explain in simple words what is going on (on "Remote server", "JIT server" and "Client")?
When a cache file is created, it lasts forever, without ever checking if the url got modified.
So this has been broken in JIT 2? I am rather sure that JIT 1.7 sends a HEAD request to the remote server before delivering the cache file. (I vaguely remember that I myself implemented this a long time ago.)
"Randomly poking" a remote server is not a clear solution, and it scales terribly.
Indeed. But how can you choose to check if you cache is still valid ?
By sending a request to the remote server (using "If-Modifed since"), see above.
Does JIT really use the remote files's ETag and cache headers?
Not at all. Only the Last Modified value is used. And then we calculate a etag based on that value and the filename.
Ah, I see. It's unclear to me why ETags have been implemented at all. They don't add any value in this scenario.
Can you explain in simple words
Haha!
Forget the remote server. The remote server always send 200 responses since we never sent it any cache headers.
So this has been broken in JIT 2? I am rather sure that JIT 1.7 sends a HEAD request to the remote server before delivering the cache file.
Yes!
By sending a request to the remote server (using "If-Modifed since")
Yeah good idea, but that was never implemented. We do not store the original date from the remote server's response. Could we use the cache file's time ?
Ah, I see. It's unclear to me why ETags have been implemented at all. They don't add any value in this scenario.
Indeed! We should only use the date, parse it into a timestamp and check if it's earlier than, not strictly equals.
- First request: The server requests the external image, cache it and serve it as a 200.
Let's assume that the JIT response to the client is: 200
and Last-Modified: Tue, 19 Jul 2016 10:43:29 GMT
- Second request: The server gets the cache headers from the client, is not able to use it since the cache has a different last modified value than the one send in request number 1. This also returns a 200 (which I am trying to transform into a 304).
Ah, I see. It should be different, of course. If it is the same client, it will send If-Modified-Since: Tue, 19 Jul 2016 10:43:29 GMT
. In this case JIT should send a 304
repsonse. (JIT 1.7 does this, according to my tests.)
By sending a request to the remote server (using "If-Modifed since")
Yeah good idea, but that was never implemented. We do not store the original date from the remote server's response. Could we use the cache file's time ?
You are right. We only check if the file is still available under the given URL. (Which is probably sufficient for services like Flickr, for example, which have unique filenames.) My idea would make things more complicated, so I am not sure if it's a good one. :-)
Indeed! We should only use the date, parse it into a timestamp and check if it's earlier than, not strictly equals.
Yep.
JIT 1.7 does this, according to my tests.
Cool. 2.0 does not :( But only for external images. The big question is: how can we send the right Date the first time ? And all other times too and be able to invalidate it...
I knew you and I would have fun with this when I posted it yesterday night ;)
I knew you and I would have fun with this when I posted it yesterday night ;)
:-)
The big question is: how can we send the right Date the first time ? And all other times too and be able to invalidate it...
AFAIK, you don't need to send any date with a 304 response. (As far as I see, JIT sets the header anyway, but either my Apache or my nginx frontend server seem to remove it if the status code is 304.)
Anyway, for external images, JIT 1.7 will always do a HEAD request to the remote server; this is the important line:
$last_modified = strtotime(Image::getHttpHeaderFieldValue($image_path, 'Last-Modified'));
A bit later:
if($_SERVER['HTTP_IF_MODIFIED_SINCE'] == $last_modified_gmt || str_replace('"', NULL, stripslashes($_SERVER['HTTP_IF_NONE_MATCH'])) == $etag){
...
my Apache or my nginx frontend server seem to remove it if the status code is 304.
Apache does it. nginx most likely also does it. It's in the spec: https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.5
This link is also in our codebase ;)
For external images, JIT 1.7 will always do a HEAD request
Indeed. That's what I am trying to remove while not breaking things as I currently did.
By caching the file on the spot, we could avoid using the data from the remote server but we must still be able to avoid having an infinite cache. A proxy would only use the max-age won't it ?
That's what I am trying to remove while not breaking things as I currently did.
Aha, so we get to the point, finally! :-))
This functionality has been implemented on purpose in order to no longer deliver a cache file if the original (external) file no longer exists. I think it is a bad idea to avoid the HEAD request, because in this case you will serve the cache file (at least for a while) although the original file has been removed from the interwebs.
I totally agree with you. But right now, this HEAD request causes problems since it loads the network card with request under high load.
It's funny because there was lots of code to avoid serving bad cache for external resources, but we would gladly sent bad cache for our local files ;)
I'd rather send invalid cache file and be able to support more requests
but we would gladly sent bad cache for our local files ;)
AFAIK, we don't serve a cache file if the original local file has been removed (which is the topic here).
I'd rather send invalid cache file and be able to support more requests
I understand that this might be desired in certain systems. But not always. So what about making it optional? Just adding an option like "Don't check if external images still exist". The simplest implementation would deliver the cache file forever. A more advanced ā like mentioned above ā might deliver the cache file first, then do the HEAD request (and remove the cache file if the original file has gone away).
AFAIK, we don't serve a cache file if the original local file has been removed (which is the topic here).
No I was referring to #136 which states that we serve the cache even if the file changed.
So what about making it optional
Let me bring random back! Set the configurable factor to one, the HEAD request is always made, bad cache is never used. That would be the default since it's the old behaviour. Set it to 0.5 or even 0.1 to lessen the chances of issuing a HEAD request to validate the cache but may send an invalid response.
Otherwise, we need to hit the Database or other storage to expire the cache on a url basis. Like I said, performance is key, so I think this is a no-go.
Like you, I am not fond of involving the database. So yes, your proposal sounds like a solution. (Documentation is king, of course.)
Great!
@michael-e Do not rush anything, just a reminder. I am not in a rush to release this, I'll wait for you.
@michael-e again, no rush, just wanted to let you know that my random thingny works really great. I'll deploy it in production tomorrow š
Cool. Tell us how it works in production!
Still doing extensive testing, and it looks promising š
I have only done a short test (looking at the debug headers), and it seems to work "as advertised". But I guess in this case your production environment is a much better test!
When fetching an external image, jit now issues 2 HTTP 200 responses before sending the 304. Here's why:
I wanted to be able to server the cache from the external server #130 and #136
The two request are shaped like this:
I can't seem to figure out how we can get rid of one or the other, but I think we should never do both, as it's the way right now. Should we:
The more I think about it, the more I want performance. The simplest thing would be to number 4 including the following:
Never cache external images that do not send proper cache headers ?See #139