matthewwithanm / django-imagekit

Automated image processing for Django. Currently v4.0
http://django-imagekit.rtfd.org/
BSD 3-Clause "New" or "Revised" License
2.27k stars 275 forks source link

ImageKit probing S3 excessively for existence of cache files #402

Closed mgalgs closed 7 years ago

mgalgs commented 7 years ago

Disclaimer: I asked this question on the google group a few days ago but the group seems pretty dead... I don't have any conclusive evidence that there is a problem in imagekit. This is more of a general discussion / asking for advice, so please feel free to spend as little time on it as you would like.


Recently I've been noticing that a HUGE portion of my S3 bill is from API requests (as opposed to actual data transfer/storage). For example, last month my S3 bill breakdown was:

PUT, COPY, POST, or LIST requests       851,391,339 Requests    $4,682.65
GET and all other requests              1,388,299,727 Requests  $610.85
first 50 TB / month of storage used     3,484.621 GB-Mo         $90.60

The most interesting one is the PUT, COPY, POST, LIST line. ImageKit incurs a LIST request (by virtue of the underlying django-s3-storage library, in my case) each time it checks for the existence of a cache file. I don't think there are any COPY's going on, and uploading images will be a POST (or a PUT, not quite sure). I incurred a total of 850M+ requests in this category last month, but the total number of objects I have stored in S3 only increased by 25 million, and I'm not deleting anything, so the remaining ~825M requests must have been LIST requests. I assume most (or all?) of those LIST requests were from ImageKit checking for the existence of cached files.

I enabled logging on my S3 bucket and noticed that, indeed, there are a ton of duplicate LIST requests coming in for the same files in a relatively short time span:

... str-media [04/Feb/2017:14:05:31 +0000] 52.52.139.202 ... ... REST.GET.BUCKET - "GET /str-media/?delimiter=/&prefix=media/CACHE/images/invitem_images/IMG_7309_WMBam4N/28db3a4ea2bb49134756c53f31e9560e.JPG HTTP/1.1" 200 - 751 - 17 16 "-" "Boto/2.40.0 Python/2.7.6 Linux/3.13.0-106-generic" -
... str-media [04/Feb/2017:14:41:42 +0000] 52.52.158.136 ... ... REST.GET.BUCKET - "GET /str-media/?delimiter=/&prefix=media/CACHE/images/invitem_images/IMG_7309_WMBam4N/28db3a4ea2bb49134756c53f31e9560e.JPG HTTP/1.1" 200 - 751 - 17 16 "-" "Boto/2.40.0 Python/2.7.6 Linux/3.13.0-106-generic" -
... str-media [04/Feb/2017:15:35:56 +0000] 52.52.154.85  ... ... REST.GET.BUCKET - "GET /str-media/?delimiter=/&prefix=media/CACHE/images/invitem_images/IMG_7309_WMBam4N/28db3a4ea2bb49134756c53f31e9560e.JPG HTTP/1.1" 200 - 751 - 40 39 "-" "Boto/2.40.0 Python/2.7.6 Linux/3.13.0-106-generic" -
... str-media [04/Feb/2017:15:43:57 +0000] 52.52.158.136 ... ... REST.GET.BUCKET - "GET /str-media/?delimiter=/&prefix=media/CACHE/images/invitem_images/IMG_7309_WMBam4N/28db3a4ea2bb49134756c53f31e9560e.JPG HTTP/1.1" 200 - 751 - 15 14 "-" "Boto/2.40.0 Python/2.7.6 Linux/3.13.0-106-generic" -
... str-media [04/Feb/2017:16:00:42 +0000] 52.52.216.112 ... ... REST.GET.BUCKET - "GET /str-media/?delimiter=/&prefix=media/CACHE/images/invitem_images/IMG_7309_WMBam4N/28db3a4ea2bb49134756c53f31e9560e.JPG HTTP/1.1" 200 - 751 - 18 17 "-" "Boto/2.40.0 Python/2.7.6 Linux/3.13.0-107-generic" -
... str-media [04/Feb/2017:16:20:32 +0000] 52.9.95.192   ... ... REST.GET.BUCKET - "GET /str-media/?delimiter=/&prefix=media/CACHE/images/invitem_images/IMG_7309_WMBam4N/28db3a4ea2bb49134756c53f31e9560e.JPG HTTP/1.1" 200 - 751 - 13 12 "-" "Boto/2.40.0 Python/2.7.6 Linux/3.13.0-106-generic" -
... str-media [04/Feb/2017:16:42:56 +0000] 52.52.158.136 ... ... REST.GET.BUCKET - "GET /str-media/?delimiter=/&prefix=media/CACHE/images/invitem_images/IMG_7309_WMBam4N/28db3a4ea2bb49134756c53f31e9560e.JPG HTTP/1.1" 200 - 751 - 17 16 "-" "Boto/2.40.0 Python/2.7.6 Linux/3.13.0-106-generic" -

(REST.GET.BUCKET appears to correspond to LIST requests in the logs)

I haven't changed IMAGEKIT_CACHE_BACKEND, which means it's using my default cache backend, which is memcached. As I understand it, the only reason we should be seeing these duplicate existence checks is if the info about the cached files is being evicted from memcached before the next time that it's needed. But looking at the logs above it just seems like those duplicate requests are too close together to be explained by eviction... I could be totally wrong on that and haven't yet validated that theory. I do see dozens of evictions per second in memcached, so maybe this really is what's happening...

Anyways, I was just hoping to get a second opinion on what might be going on here and ways to diagnose it before I take the somewhat drastic action of introducing another less ephemeral cache backend (e.g. redis) to my project.

Thanks in advance.

/cc @dkeitel

vstoykov commented 7 years ago

Hello @mgalgs you are using django-imagekit from github and the latest released version right? (I need to release the new version soon sorry for the delay).

If you are not using latest django-imagekit from github then caching is not enabled by default and you need to configure it with IMAGEKIT_CACHE_BACKEND.

Also if you are not set default timeout for the cache backend to be very big then it is by default only 5 minutes and I see that these requests are rare then 5 minutes.

You can try with configuring the cache and see if it will help

mgalgs commented 7 years ago

Hello @mgalgs you are using django-imagekit from github and the latest released version right? (I need to release the new version soon sorry for the delay).

If you are not using latest django-imagekit from github then caching is not enabled by default and you need to configure it with IMAGEKIT_CACHE_BACKEND.

I'm using 3.3, yes.

Also if you are not set default timeout for the cache backend to be very big then it is by default only 5 minutes and I see that these requests are rare then 5 minutes.

Yeah I had thought about that but looking at the implementation it seems like the timeout only applies when the file does not exist. So we should only see the existence check once, then the cache file should actually be generated, then the fact that the cachefile exists should remain forever (or until it's evicted from the cache backend) with no further existence checks, right?

def set_state(self, file, state):
    key = self.get_key(file)
    if state == CacheFileState.DOES_NOT_EXIST:
        self.cache.set(key, state, self.existence_check_timeout)
    else:
        self.cache.set(key, state)

(source)

vstoykov commented 7 years ago

@mgalgs I'm not sure how you configure your CACHES and IMAGEKIT_CACHE_BACKEND but I just made new pull request which will enable more aggressive caching by default (1 day when in production) #403.

Can you install imagekit from my branch and test it? I have one more place where I thing that caching can be implemented. Look at the docs https://django-imagekit.readthedocs.io/en/latest/caching.html#cache-file-backend and more precisely at the Note bellow

Like regular Django ImageFields, IK doesn’t currently cache width and height values, so accessing those will always result in a read. That will probably change soon though.

Probably this also need to be cached and calls to storage backend can be reduced.

mgalgs commented 7 years ago

@mgalgs I'm not sure how you configure your CACHES and IMAGEKIT_CACHE_BACKEND but I just made new pull request which will enable more aggressive caching by default (1 day when in production) #403.

I see that you added a timeout when the cache file does exist. I don't think that's actually helpful... If it exists it's better to just leave it there indefinitely.

Can you install imagekit from my branch and test it? I have one more place where I thing that caching can be implemented. Look at the docs https://django-imagekit.readthedocs.io/en/latest/caching.html#cache-file-backend and more precisely at the Note bellow

Like regular Django ImageFields, IK doesn’t currently cache width and height values, so accessing those will always result in a read. That will probably change soon though. Probably this also need to be cached and calls to storage backend can be reduced.

Sounds like a good enhancement, but separate from this issue, FWIW. Since the symptom here is excessive existence checks, not reads.

vstoykov commented 7 years ago

I see that you added a timeout when the cache file does exist. I don't think that's actually helpful... If it exists it's better to just leave it there indefinitely.

The timeout that I was adding is when the file actually exists. Currently it uses default timeout provided from the cache backend (and by default in Django it is 5 minutes if not configured). I wanted to have sane defaults but anybody to have the ability to tune it by theirs own needs.

I will change the logic by default to set the timeout to never expire when DEBUG = False instead of 1 day.

vstoykov commented 7 years ago

Actually in order to see if this change will have any effect in your setup you can make your cache configuration now as follows:

CACHES = {
    'default': {
        # your default cache config
    },
    'imagekit': {
        'BACKEND': 'desired backend',
        'LOCATION': 'backend location',
        'TIMEOUT': None,
    }
}

IMAGEKIT_CACHE_BACKEND = 'imagekit'

If this works with current version of ImageKit on your setup then the changes that I want to introduce in this merge request will have positive effect by default for other people in your situation. If this config does not help you then these changes will not make any difference for your case.

mgalgs commented 7 years ago

Ah sorry, you're absolutely right. I was thinking the django default was to never expire. Let me try changing the default to never expire like you mentioned.

vstoykov commented 7 years ago

@mgalgs can you say if there is any benefit with the caching if configured as in my previous comment or you need to wait for next billing?

mgalgs commented 7 years ago

Trying it tomorrow. I'll know by the end of the day. Will report back.

On Sun, Feb 12, 2017, 09:43 Venelin Stoykov notifications@github.com wrote:

@mgalgs https://github.com/mgalgs can you say if there is any benefit with the caching if configured as in my previous comment or you need to wait for next billing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/matthewwithanm/django-imagekit/issues/402#issuecomment-279234794, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJRzjN1OIrXfSfiKFLmWj1mksIFpQ-pks5rb0TJgaJpZM4L6f-l .

vstoykov commented 7 years ago

thanks

mgalgs commented 7 years ago

This seems to be working! I'm just judging based on the logs, but I'll know for sure at the end of the day tomorrow when I can check what my total API usage was for this type of request.

vstoykov commented 7 years ago

@mgalgs did you checked your total API usage to confirm that correctly configured cache reduces the load on S3?

If so can you check the wording in #403 if it is understandable what will be the new default behavior about caching in ImageKit?

mgalgs commented 7 years ago

Yeah I actually logged on yesterday and couldn't figure out a way to see my API requests over time... I forgot to make a note of my total requests before the change so I won't have a great way of confirming for sure that this made an impact until the end of the month... However, I'm 95% sure that this is going to reduce my API usage drastically, just based on the log files.

I meant to submit a PR documenting this in the README since it can save your users potentially thousands of dollars, but since you might be solving it a different way I decided to hold off.

vstoykov commented 7 years ago

@mgalgs I think that after #403 I can close this issue.