matthewwithanm / django-imagekit

Automated image processing for Django. Currently v4.0
http://django-imagekit.rtfd.org/
BSD 3-Clause "New" or "Revised" License
2.26k stars 276 forks source link

Memory issue #462

Closed apiljic closed 6 years ago

apiljic commented 6 years ago

Hello,

I am working on a news aggregator site and use django-imagekit to create news article thumbnails.

The application is hosted on Heroku. Over time, I noticed the application consumes more and more memory. Using Heroku metrics, at the beginning the app was at about 300MB, now about 800MB. Currently, there are about 8000 images in the database. The images are stored in an S3 bucket.

I believe the problem has to do with images. If I create a clone of the application and remove all images from the clone database, the problem is gone. If I remove images from templates, the problem is gone and the memory profile goes down.

The models:

class Article(models.Model):
    title = models.CharField(max_length=255)
    …
    photo = models.ForeignKey(Photo, blank=True, null=True, related_name='+', on_delete=models.SET_NULL)
    …

class Photo(models.Model):
    name = models.TextField()
    …
    photo = models.ImageField(upload_to='user/photos/%Y/%m/%d', max_length=255)
    …

    thumb = ImageSpecField(source='photo',
            processors=[resize.ResizeToFit(131, 131),],
            options={'quality': 90})
    thumbnail_image = ImageSpecField(source='photo',
            processors=[resize.ResizeToFill(100, 100),],
            options={'quality': 90})
    news_small = ImageSpecField(source='photo',
            processors=[resize.ResizeToFill(125, 94),],
            format='JPEG',
            options={'quality': 90})
    …

Template example:

 <a href="{{ item.get_absolute_url }}"><img src="{{ item.photo.news_small.url }}" alt=""></a>

Settings.py

redis_url = urlparse(os.environ.get('REDIS_URL'))
CACHES = {
    'default': {
        'BACKEND': 'redis_cache.RedisCache',
        'LOCATION': '%s:%s' % (redis_url.hostname, redis_url.port),
        'OPTIONS': {
            'DB': 0,
            'PARSER_CLASS': 'redis.connection.HiredisParser',
            'PASSWORD': redis_url.password
        }
    }
}

On the homepage, about 25 thumbnails are shown. But the problem occurs also on a page where only one image is included.

Current versions: Django==1.8.17 django-imagekit==4.0.2

The problem might be with the way I implemented django-imagekit. However, I don’t understand what I am doing wrong, so I was hoping someone here might recognize the issue and help. I would appreciate any advice.

vstoykov commented 6 years ago

@apiljic did you manage to see where the problem is?

apiljic commented 6 years ago

I think I managed to narrow it down. But I wasn't able to fix the problem.

The memory consumption in my case seems to be dependent on the amount of thumbnails in the CACHE folder at S3. I managed to bring the memory well bellow the limit of a 1x dyno (512 MB) by simply deleting all thumbnails from the CACHE folder.

There were 40.000 thumbnails which were created over time. Now, the memory is low, but as those thumbnails get recreated over time, I assume the memory is slowly going to climb back up.

Deleting thumbnails from CACHE is obviously not a long term solution, but I did it to better understand the problem.

Also, when I flushed redis, but kept the thumbnails in the CACHE folder, the memory jumped as well. So to me it looks like the number of thumbnails in the CACHE folder has an impact on memory when django-imagekit checks if the thumbnail exists or not, not when it is created.

Is this information helpful?

vstoykov commented 6 years ago

40.000 thumbnails is a lot. Can it be that the storage is not cleaning after itself when check if the file is there or not? It can also be a memory leak in ImageKit itself.

As a workaround you can think of a way to reload/restart the worker after some time/number of request. This is not a real solution and if you can investigate the leak I will be very grateful.

apiljic commented 6 years ago

Yes, 40.000 thousands is a lot, but since the application aggregates news from many sources, that number can go up quite quickly.

I am still not sure where the problem exactly is. But restarting the worker doesn't work. What happens if you restart the worker is: 1) if you open a page with no thumbnail - memory is low, 2) if you upload an image to that page - the first time it is loaded with a thumbnail (thumbnail created on demand), the memory already jumps very high on that single request (not to the maximum, but close), 3) if you restart the worker again and reload the page with the thumbnail - memory stays low, 4) if you flush Redis and reload the same page with the thumbnail - memory jumps just as high as in 2), 5) once all thumbnails are removed from S3 CACHE folder - memory stays low all the time (probably until there are thousands of thumbnails created again).

Maybe the problem is in imagekit, you are right.

So far I was not able to figure it out. But I was hoping to hear from others what kind of memory profile they see and if they can see any dependence on number of cached thumbnails.

If you have any more ideas, please let me know.

apiljic commented 6 years ago

Update:

I tested sorl-thumbnail on a single template. I can observe exactly the same memory jump as with django-imagekit. If s3 CACHE contains several thousand thumbnails, with both packages it is enough to create a single thumbnail for memory to jump high up. If the CACHE is empty or contains only few thumbnails, the memory does not jump (or at least not significantly).

This suggests the problem lies elsewhere and this issue can be closed. However, I would still appreciate any suggestion on what might be the problem here.

vstoykov commented 6 years ago

Tank you for your investigation.

Can I ask you one last thing. Can you simulate the same behavior but not on S3 but on local file system? If the problem disappears then this means that is related to your storage backend (or library used by your storage backend) with which you access S3.

apiljic commented 6 years ago

I could try that. But so far I only relied on Heroku metric. Any suggestion on how I could track the memory locally?

apiljic commented 6 years ago

@vstoykov Like you suggested, the problem seems to have been in the django-storages/boto library I use for AWS (maybe related to https://github.com/jschneier/django-storages/issues/95). After switching to django-s3-storage, the problem appears to be gone and the memory is at a similar level, regardless of the size of CACHE folder. Thank you very much for your feedback! I really appreciate it.