symphonycms / jit_image_manipulation

Just in Time Image Manipulation for Symphony CMS
http://symphonyextensions.com/extensions/jit_image_manipulation/
Other
25 stars 42 forks source link

Two 200 before the 304 #138

Open nitriques opened 8 years ago

nitriques commented 8 years ago

When fetching an external image, jit now issues 2 HTTP 200 responses before sending the 304. Here's why:

  1. The modified time and etag are calculated with a 2nd http request to get the headers (I wonder why, more on that later)
  2. I've added things to use the filemtime of the cache file, which then creates new last modified and etag (so our logic fails)
  3. You then get a proper 304.

I wanted to be able to server the cache from the external server #130 and #136

The two request are shaped like this:

  1. The first one requests the whole file, skipping the headers
  2. The second one only request the headers. Based on libcurl's documentation, doing so sends a HEAD http request and should (if the remote server's smart enough) only sends headers...

I can't seem to figure out how we can get rid of one or the other, but I think we should never do both, as it's the way right now. Should we:

  1. Do the HEAD request always, to make sure the cache is valid ? What about servers that do not return cache headers ?
  2. Store this info in the DB, to be able to regenerate the original etag ?
  3. Give up ? I mean, this behavior occurs only for the first person to hit the url, since all others will get the cache and never see the original response ?
  4. Change our we cache external images ?

The more I think about it, the more I want performance. The simplest thing would be to number 4 including the following:

  1. [x] Never cache external images that do not send proper cache headers ? See #139
  2. [x] Always respond with the cache file's mtime and etag and hide the one from the origin ?
  3. [x] Randomly poke the remote server to validate the cache ?
michael-e commented 8 years ago

jit now issues 2 HTTP 200 responses before sending the 304

I assume you mean the following: JIT issues two requests to the remote server and waits for the responses before managing to send a 304 response. Right?

If yes, do you mean that this has been introduced in 2.0? Or is it the same in 1.7?

BTW: I just discovered a bug with JIT 1.7 when I tried to test the reponses using cURL. (I don't know for sure if JIT 2.0 has the same behaviour.) The "If-Modified-Since" logic is not working properly. I only get a 304 response if the value of the If-Modified-Since header matches the Last-Modified response header exactly. Now if I add some time to the If-Modified-Since value (asking for a later point in time), it should still be a 304, but it is not. I get a 200 response instead.

The second one only request the headers. Based on libcurl's documentation, doing so sends a HEAD http request and should (if the remote server's smart enough) only sends headers...

I wouldn't care about this, because HEAD requests are standardized, AFAIK, and should work with any HTTP server. As you said, you can't rely on getting any cache headers from the remote server, (but that is the same with 200 responses).

Apart from that, none of your solutions sounds good to me. "Randomly poking" a remote server is not a clear solution, and it scales terribly. If you have some thousand external images in the cache, you will generate a lot of requests/traffic even if you don't need to serve a single images to a client (because it is used in old, archived articles, for example). This might mean less traffic than with the current behaviour, but it might as well be the other way 'round, depending on the percentage (and frequency) of cached images being actually served by JIT.

More questions:

If the latter is the case, there may be reasons for this (for example: servers not sending cache headers, as you mentioned above).

The more I think about it, the more I want performance.

Agreed. I once tested putting all my image files to Amazon AWS and serving them via JIT. Yes, this was significantly slower, you could see that in your browser when loading a page (without browser cache, of course). Using CloudFront made it a lot faster and smoother (but still a bit slower than with local files). At that time I also thought about speeding things up with external images, but I couldn't find a proper solution. I think that checking if a file still exists is the least JIT should do even with external images. So I decided to go for local images (and a bigger hard drive on the server)...

(The best idea I had was delivering the cache file first, then checking for the external file. This would solve any speed issues, but it would mean that you deliver a cache file for at least one more time if the original file has been removed. That felt wrong to me.)

About ETags: I personally don't use them at all, I remove all of them (in Apache):

If you're not taking advantage of the flexible validation model that ETags provide, it's better to just remove the ETag altogether. The Last-Modified header validates based on the component's timestamp. ā€”https://developer.yahoo.com/blogs/ydn/high-performance-sites-rule-13-configure-etags-7211.html

Last question: Does JIT really use the remote files's ETag and cache headers? (I don't know, because I even reset image cache headers in Apache.)

Sorry for posting a lot of questions and no solutions. But it's a delicate matter (and I love JIT).

nitriques commented 8 years ago

But it's a delicate matter (and I love JIT).

Same here !

I assume you mean the following: JIT issues two requests to the remote server and waits for the responses before managing to send a 304 response. Right?

You are right, but this is not what I've met: The request are those handle on the server where jit runs. I do not know if it's new to 2.0 (but it really looks like it)

The "If-Modified-Since" logic is not working properly.

I know! That the origin of my work on that extension haha!

Are we sure that JIT sends two requests to the remote server?

Totally.

Does this only occur once or with every request?

Right now, it's when there is no cache. When a cache file is created, it lasts forever, without ever checking if the url got modified. I was trying to solve #136 and stumble upon this for external images.

"Randomly poking" a remote server is not a clear solution, and it scales terribly.

Indeed. But how can you choose to check if you cache is still valid ?

Does JIT really use the remote files's ETag and cache headers?

Not at all. Only the Last Modified value is used. And then we calculate a etag based on that value and the filename.

michael-e commented 8 years ago

You are right, but this is not what I've met: The request are those handle on the server where jit runs. I do not know if it's new to 2.0 (but it really looks like it)

Sorry, I still don't get it. Do you mean that JIT itself sends 3 answers to a single request??? Can you explain in simple words what is going on (on "Remote server", "JIT server" and "Client")?

When a cache file is created, it lasts forever, without ever checking if the url got modified.

So this has been broken in JIT 2? I am rather sure that JIT 1.7 sends a HEAD request to the remote server before delivering the cache file. (I vaguely remember that I myself implemented this a long time ago.)

"Randomly poking" a remote server is not a clear solution, and it scales terribly.

Indeed. But how can you choose to check if you cache is still valid ?

By sending a request to the remote server (using "If-Modifed since"), see above.

Does JIT really use the remote files's ETag and cache headers?

Not at all. Only the Last Modified value is used. And then we calculate a etag based on that value and the filename.

Ah, I see. It's unclear to me why ETags have been implemented at all. They don't add any value in this scenario.

nitriques commented 8 years ago

Can you explain in simple words

Haha!

Forget the remote server. The remote server always send 200 responses since we never sent it any cache headers.

  1. First request: The server requests the external image, cache it and serve it as a 200.
  2. Second request: The server gets the cache headers from the client, is not able to use it since the cache has a different last modified value than the one send in request number 1. This also returns a 200 (which I am trying to transform into a 304).
  3. Third request: The 304 is properly sent.

So this has been broken in JIT 2? I am rather sure that JIT 1.7 sends a HEAD request to the remote server before delivering the cache file.

Yes!

By sending a request to the remote server (using "If-Modifed since")

Yeah good idea, but that was never implemented. We do not store the original date from the remote server's response. Could we use the cache file's time ?

Ah, I see. It's unclear to me why ETags have been implemented at all. They don't add any value in this scenario.

Indeed! We should only use the date, parse it into a timestamp and check if it's earlier than, not strictly equals.

michael-e commented 8 years ago
  1. First request: The server requests the external image, cache it and serve it as a 200.

Let's assume that the JIT response to the client is: 200 and Last-Modified: Tue, 19 Jul 2016 10:43:29 GMT

  1. Second request: The server gets the cache headers from the client, is not able to use it since the cache has a different last modified value than the one send in request number 1. This also returns a 200 (which I am trying to transform into a 304).

Ah, I see. It should be different, of course. If it is the same client, it will send If-Modified-Since: Tue, 19 Jul 2016 10:43:29 GMT. In this case JIT should send a 304 repsonse. (JIT 1.7 does this, according to my tests.)

By sending a request to the remote server (using "If-Modifed since")

Yeah good idea, but that was never implemented. We do not store the original date from the remote server's response. Could we use the cache file's time ?

You are right. We only check if the file is still available under the given URL. (Which is probably sufficient for services like Flickr, for example, which have unique filenames.) My idea would make things more complicated, so I am not sure if it's a good one. :-)

Indeed! We should only use the date, parse it into a timestamp and check if it's earlier than, not strictly equals.

Yep.

nitriques commented 8 years ago

JIT 1.7 does this, according to my tests.

Cool. 2.0 does not :( But only for external images. The big question is: how can we send the right Date the first time ? And all other times too and be able to invalidate it...

nitriques commented 8 years ago

I knew you and I would have fun with this when I posted it yesterday night ;)

michael-e commented 8 years ago

I knew you and I would have fun with this when I posted it yesterday night ;)

:-)

The big question is: how can we send the right Date the first time ? And all other times too and be able to invalidate it...

AFAIK, you don't need to send any date with a 304 response. (As far as I see, JIT sets the header anyway, but either my Apache or my nginx frontend server seem to remove it if the status code is 304.)

Anyway, for external images, JIT 1.7 will always do a HEAD request to the remote server; this is the important line:

$last_modified = strtotime(Image::getHttpHeaderFieldValue($image_path, 'Last-Modified'));

A bit later:

if($_SERVER['HTTP_IF_MODIFIED_SINCE'] == $last_modified_gmt || str_replace('"', NULL, stripslashes($_SERVER['HTTP_IF_NONE_MATCH'])) == $etag){
    ...
nitriques commented 8 years ago

my Apache or my nginx frontend server seem to remove it if the status code is 304.

Apache does it. nginx most likely also does it. It's in the spec: https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.5

This link is also in our codebase ;)

For external images, JIT 1.7 will always do a HEAD request

Indeed. That's what I am trying to remove while not breaking things as I currently did.

By caching the file on the spot, we could avoid using the data from the remote server but we must still be able to avoid having an infinite cache. A proxy would only use the max-age won't it ?

michael-e commented 8 years ago

That's what I am trying to remove while not breaking things as I currently did.

Aha, so we get to the point, finally! :-))

This functionality has been implemented on purpose in order to no longer deliver a cache file if the original (external) file no longer exists. I think it is a bad idea to avoid the HEAD request, because in this case you will serve the cache file (at least for a while) although the original file has been removed from the interwebs.

nitriques commented 8 years ago

I totally agree with you. But right now, this HEAD request causes problems since it loads the network card with request under high load.

It's funny because there was lots of code to avoid serving bad cache for external resources, but we would gladly sent bad cache for our local files ;)

nitriques commented 8 years ago

I'd rather send invalid cache file and be able to support more requests

michael-e commented 8 years ago

but we would gladly sent bad cache for our local files ;)

AFAIK, we don't serve a cache file if the original local file has been removed (which is the topic here).

I'd rather send invalid cache file and be able to support more requests

I understand that this might be desired in certain systems. But not always. So what about making it optional? Just adding an option like "Don't check if external images still exist". The simplest implementation would deliver the cache file forever. A more advanced ā€” like mentioned above ā€” might deliver the cache file first, then do the HEAD request (and remove the cache file if the original file has gone away).

nitriques commented 8 years ago

AFAIK, we don't serve a cache file if the original local file has been removed (which is the topic here).

No I was referring to #136 which states that we serve the cache even if the file changed.

So what about making it optional

Let me bring random back! Set the configurable factor to one, the HEAD request is always made, bad cache is never used. That would be the default since it's the old behaviour. Set it to 0.5 or even 0.1 to lessen the chances of issuing a HEAD request to validate the cache but may send an invalid response.

Otherwise, we need to hit the Database or other storage to expire the cache on a url basis. Like I said, performance is key, so I think this is a no-go.

michael-e commented 8 years ago

Like you, I am not fond of involving the database. So yes, your proposal sounds like a solution. (Documentation is king, of course.)

nitriques commented 8 years ago

Great!

nitriques commented 8 years ago

@michael-e Do not rush anything, just a reminder. I am not in a rush to release this, I'll wait for you.

nitriques commented 8 years ago

@michael-e again, no rush, just wanted to let you know that my random thingny works really great. I'll deploy it in production tomorrow šŸ‘

michael-e commented 8 years ago

Cool. Tell us how it works in production!

nitriques commented 8 years ago

Still doing extensive testing, and it looks promising šŸ‘

michael-e commented 8 years ago

I have only done a short test (looking at the debug headers), and it seems to work "as advertised". But I guess in this case your production environment is a much better test!