mozilla / bedrock

Making mozilla.org awesome, one pebble at a time
https://www.mozilla.org
Mozilla Public License 2.0
1.18k stars 920 forks source link

Restrict CDN cache to only certain query params #10428

Open pmac opened 3 years ago

pmac commented 3 years ago

Our CDN has the capability to include the query parameters for requests in the cache key or not. It also will allow us to specify which parameter names to include or exclude. We currently include all query params in the cache key, but to improve we should only include those which we use on the server side in our views. Let's get a list of these params together and update the CDN config to only include those in the list.

alexgibson commented 3 years ago

Parameter list:

Note: I'm mostly discovering these params via searching for request.GET.get.

alexgibson commented 3 years ago

@pmac here's a WIP list, but I need to do a bit more checking still I think.

Quick question for validation: the params we're interested in are only the ones that we use to render different content, right? Also: do I need to worry about things like the newsletter preference center, or stub attribution endpoint? I'm assuming those aren't cached by the CDN already.

pmac commented 3 years ago

@pmac here's a WIP list, but I need to do a bit more checking still I think.

Great start! Thanks!

Quick question for validation: the params we're interested in are only the ones that we use to render different content, right?

Correct.

Also: do I need to worry about things like the newsletter preference center, or stub attribution endpoint? I'm assuming those aren't cached by the CDN already.

I believe that is correct. I'll have to check the stub attribution one as in theory that could be cached for people who request a signature for the same set of attributes.

pmac commented 3 years ago

We don't need to respect the geo param tough right? We are specifically not respecting it for the prod domain.

alexgibson commented 3 years ago

We don't need to respect the geo param tough right? We are specifically not respecting it for the prod domain.

We are for stage though, right? I thought that uses the CDN also.

pmac commented 3 years ago

True. I guess we should keep the stage and prod CDN configs as similar as possible. I was thinking we'd only do this query param thing on prod, since it's really meant to improve our cache hit ratio and we don't have any issue with that on stage.

alexgibson commented 3 years ago

Yeah, I also think from a testing perspective doing this on stage first before rolling it out makes sense.

alexgibson commented 3 years ago

@pmac ok I think the list above looks to be about it. Some of the items in the list I wasn't 100% sure about being effected, but I included them anyway to be on the safe side. Let me know what you think.

I also didn't include any params in the newsletter app, or the stub attribution endpoint. Maybe we can exclude those URLs if needed?

pmac commented 3 years ago

I did check stub attribution and it does allow itself to be cached on the assumption that the query params would be unique per cache entry. I think the solution there is to just turn the cache off for that view.

stevejalim commented 2 years ago

For the benefit of ticket archaeologists: the CDN in use when this ticket was written was Cloudflare. Since then, the CDN was switched to AWS Cloudfront.

alexgibson commented 2 years ago

@stevejalim note that this issue was never completed originally (hence it's still open). Right now I believe we still include all query params in the cache key, which has always been the case up until now as far as I'm aware.

stevejalim commented 2 years ago

@alexgibson just checked and yeah, you're right - thanks for the poke! we do appear to be cacheing all querystrings, still, even though we've swapped CDN

alexgibson commented 1 year ago

@pmac is this something we still need / want to do for GCP?

pmac commented 1 year ago

We're not quite on the GCP CDN yet, but it probably wouldn't hurt to check in with @bkochendorfer when we are on this to see if it'd help, or if we even need this sort of help.

bkochendorfer commented 1 year ago

It looks like gcp cloud cdn does support adding particular query keys to the cache key. As @pmac mentioned we aren't using GCP CDN quite yet but will in the near future. The way it is currently configured is the same as AWS Cloudfront in that we are caching all query keys.

Can you all help me understand what the intent of this change would be? I see in the originating comment that we want to improve but I'm not sure if that is a performance concern or something else.

pmac commented 1 year ago

IIRC the idea is that there's only a subset of query params that we use to affect the page served from our servers. The rest are either for analytics (UTM params) or front-end things. So we could improve our cache-hit rate by only allowing those params which we need to affect the cache. If our hit-rate is good enough, it is probably not worth the effort to make this happen. But the hypothesis is that we often use UTM params in our campaigns that end up on www.m.o, and so if those requests were handled via cache more often that might be better for the backend.

bkochendorfer commented 1 year ago

@pmac Ok understood. Right now our cache rate moves between 95-99%, so fairly high. Happy to experiment with this if we think we can pull that up even higher. If we know the headers we do not want to add we can add those to exclusion list (UTM) which might make this easier.

pmac commented 1 year ago

If there's an exclusion list then just excluding the several utm_* params would be most of it I think.