mozilla / bedrock

Making mozilla.org awesome, one pebble at a time
https://www.mozilla.org
Mozilla Public License 2.0
1.18k stars 919 forks source link

Bedrock performance tooling and initial optimisations #15513

Open stevejalim opened 1 week ago

stevejalim commented 1 week ago

This changeset adds support for profiling Bedrock using django-silk locally (or anywhere the bedrock_test image is used - but not in production).

It also contains some optimisations - via cacheing - to reduce the DB queries executed on the busiest pages.

Significant changes and points to review

Please revew this PR commit by commit, paying sceptical attention to the usage of cacheing etc. They contain details of the number queries saved.

I know that at the moment Bedrock is using a verson of Django's LocMemCache backend - this means that for 45 pods in production, we'll still get plenty of cache misses until the pods have all had a call that warms the cache. It might be that, given the TTL of the cached items, we never really get to that point in prod, but we will in Dev and Stage where there are far fewer pods.

We'd certainly get more cacheing uplift if we had a shared cache backend, such as Redis. Given we now have Redis in play for the rq backend, we could switch it at this point (expanding this PR), or as a separate change - opinions welcome!

Issue / Bugzilla link

15505

Testing

Unit tests passing should be enough here, but feel free to follow the notes in profiling/hit_popular_pages.py to test drive things yourself.

Questions

Is the addition of django-silk work mentioning in formal documentation?

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 90.47619% with 6 lines in your changes missing coverage. Please review.

Project coverage is 78.89%. Comparing base (e775d32) to head (9e6ee0f).

Files with missing lines Patch % Lines
bedrock/settings/base.py 44.44% 5 Missing :warning:
bedrock/urls.py 50.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #15513 +/- ## ========================================== + Coverage 78.82% 78.89% +0.07% ========================================== Files 158 158 Lines 8282 8338 +56 ========================================== + Hits 6528 6578 +50 - Misses 1754 1760 +6 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.


🚨 Try these New Features:

stevejalim commented 6 days ago

Talking with @bkochendorfer, we're both OK with the idea of using Redis for the Web deployments too (the same Redis we use for the CMS task queues because there's little real-world risk of task eviction if Redis gets low on memory). As such, I will follow this PR with another one focused on using Redis as an FE cache backend.

@pmac @robhudson Are you aware of any issues that are likley switching from LocMem to Redis for a cache - is there anywhere where we specifically exploit the local nature of it?

pmac commented 6 days ago

I don't know of any issues specific to locmem other than this will be a lot slower. I am a little concerned about using the same redis for cache and queue. We are doing this with basket, but that's a much lower volume. My understanding is that you'd use very different settings in redis for a cache vs. queue.

stevejalim commented 5 days ago

I don't know of any issues specific to locmem other than this will be a lot slower. I am a little concerned about using the same redis for cache and queue. We are doing this with basket, but that's a much lower volume. My understanding is that you'd use very different settings in redis for a cache vs. queue.

I hear you. I'm happy to try these cache uses based on LocMemCache for now and see what the uplift is like. My leaning towards a shared cache was to reduce cache misses from pods with cold local-memory caches, but doing things one step as a time is good - may not need to use a shared cache once the pods are warm.