thechangelog / changelog.com

Changelog is news and podcast for developers. This is our open source platform.
https://changelog.com/posts/changelog-is-open-source
Other
2.69k stars 245 forks source link

CDN cache misses increased 7x after Oct. 8, 2023 #486

Closed gerhard closed 6 months ago

gerhard commented 11 months ago

What is the problem?

After Oct. 8, 2023, CDN cache misses increased by 7x, from 0.75m to 5.2m

image

This seems to be coming from the app origin, specifically F_fly_2022_03_13

If we group by url, we notice that /feed, / & /feed/ requests are the ones with the highest increase in cache misses:

image

What is the Fastly response if we enable debug headers?

image

This looks like a clean cache miss, and we do not have shielding enabled. Fixed it.

Note While at it, I did the same for the feeds origin.

Does this fix it?

This looks a lot better now:

image

👉 Let's give this a few hours before we check the hits vs misses again.

Questions

Do the cache headers look correct?

Looking at a specific URL endpoint served by the origin, we see the following:

image

This looks correct to me. surrogate-control is set and should have precedence over cache-control. Reference: Understand how cache control headers work.

If the surrogate-control headers are respected, why does this content have so many misses? cc @jamesarosen

Could this config change be related to the increased cache misses?

We had a two VCL changes earlier that day:

The second one looks suspicious, but on closer inspection we are simply setting surrogate-control to AWS S3 headers which are prefixed with x-amz-meta-:

image

Could the above change be related to the increased cache misses @jamesarosen?

To be continued...

jerodsanto commented 11 months ago

Hmmm I never set up /feed to be served directly from R2 even though we are pre-generating it like all the other feeds now. No need to hit the app origin for this. Should I make that change @gerhard and see if this makes it all better?

jerodsanto commented 11 months ago

I went ahead and made this change so now every /feed request is being sent to R2 origin instead of the app origin:

curl -I https://changelog.com/feed
HTTP/2 200
content-type: application/xml
etag: W/"59b2e72e41e688450a1ca572543e9d20"
last-modified: Sun, 22 Oct 2023 13:17:46 GMT
cf-cache-status: DYNAMIC
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=DRcCdyhiCjkzqQ%2FdYsrHq%2BQqAbg%2FKpWp1k35Qeckvca3COcoQB8P8JJ3fAwp8aWKb12c2Tt0C3kL6RL0ETFFFDsX181Hal%2BK9%2F4OrxKkapP1ds2IBIguMea5q5jyPfVB5w4myw%2BJoLc%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
cache-control: max-age=3600, s-maxage=604800, stale-while-revalidate=604800, stale-if-error=604800
server: cloudflare
cf-ray: 81a20c6928694e03-MCI
access-control-allow-origin: *
accept-ranges: bytes
date: Sun, 22 Oct 2023 13:22:32 GMT
via: 1.1 varnish
age: 12
x-served-by: cache-mci680073-MCI
x-cache: HIT
x-cache-hits: 1
x-timer: S1697980952.206757,VS0,VE1
vary: Accept-Encoding
strict-transport-security: max-age=31557600
alt-svc: h3=":443";ma=86400,h3-29=":443";ma=86400,h3-27=":443";ma=86400
content-length: 224492
gerhard commented 11 months ago

OK! That is an improvement for the /feed endpoint: https://ui.honeycomb.io/changelog/datasets/fastly/result/CogXHT7GnNF

image

Want to do the same for /feed/ & /feed.json ? FTR: https://ui.honeycomb.io/changelog/datasets/fastly/result/F3QLxZ2CAtp

image

gerhard commented 11 months ago

@jerodsanto I have a placeholder Fastly config that re-enables shielding for both the app & the feeds origin: https://manage.fastly.com/configure/services/7gKbcKSKGDyqU7IuDr43eG/diff/169,170

Are you OK to have this enabled? I ask because I'm not sure if removing shielding on the app in https://manage.fastly.com/configure/services/7gKbcKSKGDyqU7IuDr43eG/diff/167,169 was intentional.

BTW, I just noticed that I never activated v168: https://manage.fastly.com/configure/services/7gKbcKSKGDyqU7IuDr43eG/diff/167,168

gerhard commented 11 months ago

Even after the /feed switch to R2, I still don't understand why / cache misses increased more than 10x: https://ui.honeycomb.io/changelog/datasets/fastly/result/FQjroVdMo3E

image

Time to double-check with Fastly support...

jamesarosen commented 11 months ago

The second one looks suspicious, but on closer inspection we are simply setting surrogate-control to AWS S3 headers which are prefixed with x-amz-meta-

The diff shows you deleting that logic, not adding it.

Want to do the same for /feed/ & /feed.json

What's the difference between /feed and /feed/? Instead of serving the same content from both, you'll get better caching if you rewrite or redirect the path at the edge.

removing shielding on the app

Why? Given your global audience, shielding is definitely going to help you keep CHR high.

Even after the /feed switch to R2, I still don't understand why / cache misses increased more than 10x

Can you show the curl / httpstat for the origin request for / and /feed?

gerhard commented 11 months ago

The diff shows you deleting that logic, not adding it.

Yes, we deleted that logic, because we no longer need it (we replaced S3 with R2).

Since app responses do not use x-amz-meta- headers, deleting that logic shouldn't have had an impact on this, and yet it looks like they did - problems started happening at the same time, as captured in my previous comment.

What's the difference between /feed and /feed/? Instead of serving the same content from both, you'll get better caching if you rewrite or redirect the path at the edge.

/feed is served by R2 (that's the origin), while /feed/ is served by the app. They should both be served by R2. Rewriting /feed/ to /feed at the edge makes most sense, thanks for the reminder!

Given your global audience, shielding is definitely going to help you keep CHR high.

Yes, I'm pretty sure that we want to keep shielding for all origins. Want to double-check with @jerodsanto in case he had a good reason to disable it on the app. Perhaps it's related to the Websockets work that he did a few weeks back: https://changelog.slack.com/archives/C03SA8VE2/p1695996344381799 (the following 10 messages have more context).

Can you show the curl / httpstat for the origin request for / and /feed?

Let's ignore /feed now, since that origin has changed to R2. Let's focus on / instead.

This is what the request to origin currently returns:

httpstat https://changelog-2022-03-13.fly.dev/
Connected to 66.51.126.203:443 from 192.168.88.18:61248

HTTP/2 200
cache-control: no-store, must-revalidate
content-length: 60923
content-type: text/html; charset=utf-8
date: Tue, 24 Oct 2023 07:10:40 GMT
permissions-policy: interest-cohort=()
referrer-policy: strict-origin-when-cross-origin
server: Fly/55f8eaa0 (2023-10-09)
surrogate-control: max-age=60, stale-while-revalidate=60, stale-if-error=604800
surrogate-key: pages
x-content-type-options: nosniff
x-download-options: noopen
x-frame-options: SAMEORIGIN
x-permitted-cross-domain-policies: none
x-request-id: F5D5jDFXU3h2Ar8CmxuB
via: 2 fly.io
fly-request-id: 01HDG8AW9KDVKY0Y5QGTBGH22S-lhr

Body stored in: /var/folders/z5/n5y08g4j073_ld2d9019j3_80000gn/T/tmpndfxmb7l

  DNS Lookup   TCP Connection   TLS Handshake   Server Processing   Content Transfer
[     6ms    |       4ms      |      7ms      |       355ms       |       157ms      ]
             |                |               |                   |                  |
    namelookup:6ms            |               |                   |                  |
                        connect:10ms          |                   |                  |
                                    pretransfer:17ms              |                  |
                                                      starttransfer:372ms            |
                                                                                 total:529ms

And here is the same request to Fastly:

httpstat https://changelog.com/
Connected to 151.101.129.162:443 from 192.168.88.18:52383

HTTP/2 200
cache-control: no-store, must-revalidate
content-type: text/html; charset=utf-8
permissions-policy: interest-cohort=()
referrer-policy: strict-origin-when-cross-origin
server: Fly/55f8eaa0 (2023-10-09)
x-changelog-vanity-redirect: false
x-content-type-options: nosniff
x-download-options: noopen
x-frame-options: SAMEORIGIN
x-permitted-cross-domain-policies: none
x-request-id: F5B91xd3BIDdNqgBIcCh
via: 1.1 fly.io, 1.1 varnish
fly-request-id: 01HDC6KYH20EX58WHF21A24427-lhr
access-control-allow-origin: *
accept-ranges: bytes
date: Sun, 22 Oct 2023 17:24:55 GMT
age: 73
x-served-by: cache-lon4265-LON
x-cache: HIT
x-cache-hits: 1
x-timer: S1697995496.598494,VS0,VE1
vary: Accept-Encoding
strict-transport-security: max-age=31557600
alt-svc: h3=":443";ma=86400,h3-29=":443";ma=86400,h3-27=":443";ma=86400
content-length: 60890

Body stored in: /var/folders/z5/n5y08g4j073_ld2d9019j3_80000gn/T/tmpccsfo978

  DNS Lookup   TCP Connection   TLS Handshake   Server Processing   Content Transfer
[     7ms    |       4ms      |      6ms      |        4ms        |        4ms       ]
             |                |               |                   |                  |
    namelookup:7ms            |               |                   |                  |
                        connect:11ms          |                   |                  |
                                    pretransfer:17ms              |                  |
                                                      starttransfer:21ms             |
                                                                                 total:25ms

And another request to Fastly:

httpstat https://changelog.com/
Connected to 151.101.129.162:443 from 192.168.88.18:52384

HTTP/2 200
cache-control: no-store, must-revalidate
content-type: text/html; charset=utf-8
permissions-policy: interest-cohort=()
referrer-policy: strict-origin-when-cross-origin
server: Fly/55f8eaa0 (2023-10-09)
x-changelog-vanity-redirect: false
x-content-type-options: nosniff
x-download-options: noopen
x-frame-options: SAMEORIGIN
x-permitted-cross-domain-policies: none
x-request-id: F5B96CwMXESe9kwAjrRi
via: 1.1 fly.io, 1.1 varnish
fly-request-id: 01HDC6P65JHTS4GRNVBVEH6SZ5-lhr
access-control-allow-origin: *
accept-ranges: bytes
date: Sun, 22 Oct 2023 17:25:07 GMT
age: 11
x-served-by: cache-lon420124-LON
x-cache: HIT
x-cache-hits: 1
x-timer: S1697995507.000441,VS0,VE1
vary: Accept-Encoding
strict-transport-security: max-age=31557600
alt-svc: h3=":443";ma=86400,h3-29=":443";ma=86400,h3-27=":443";ma=86400
content-length: 60862

Body stored in: /var/folders/z5/n5y08g4j073_ld2d9019j3_80000gn/T/tmpaxw8mrmh

  DNS Lookup   TCP Connection   TLS Handshake   Server Processing   Content Transfer
[     6ms    |       3ms      |     10ms      |        5ms        |        4ms       ]
             |                |               |                   |                  |
    namelookup:6ms            |               |                   |                  |
                        connect:9ms           |                   |                  |
                                    pretransfer:19ms              |                  |
                                                      starttransfer:24ms             |
                                                                                 total:28ms

While this looks good, we continue seeing a high rate of misses: 449 misses vs 289 hits in the last 1h: https://ui.honeycomb.io/changelog/datasets/fastly/result/oJz39TUUryB

image

This is how that changed in the last 14 days vs 14 days prior: https://ui.honeycomb.io/changelog/datasets/fastly/result/4nwqn3VMREu

image

gerhard commented 11 months ago

To make debugging this easier, I am attaching the current VCL config v169 7gKbcKSKGDyqU7IuDr43eG_169.vcl.txt

Yes, I definitely intend to version control this in our repository.

jamesarosen commented 11 months ago

https://changelog-nightly-2023-10-10.fly.dev/ doesn't have any cache-control or surrogate-control headers. Should it? Even if it changes pretty frequently, you could get some benefit from something like

cache-control: private, max-age=3600, must-revalidate
surrogate-control: public, max-age=60, stale-while-revalidate=3600
gerhard commented 11 months ago

Sorry @jamesarosen, that paste was my bad (I just fixed it).

The origin paste should have been:

httpstat https://changelog-2022-03-13.fly.dev/
Connected to 66.51.126.203:443 from 192.168.88.18:61248

HTTP/2 200
cache-control: no-store, must-revalidate
content-length: 60923
content-type: text/html; charset=utf-8
date: Tue, 24 Oct 2023 07:10:40 GMT
permissions-policy: interest-cohort=()
referrer-policy: strict-origin-when-cross-origin
server: Fly/55f8eaa0 (2023-10-09)
surrogate-control: max-age=60, stale-while-revalidate=60, stale-if-error=604800
surrogate-key: pages
x-content-type-options: nosniff
x-download-options: noopen
x-frame-options: SAMEORIGIN
x-permitted-cross-domain-policies: none
x-request-id: F5D5jDFXU3h2Ar8CmxuB
via: 2 fly.io
fly-request-id: 01HDG8AW9KDVKY0Y5QGTBGH22S-lhr

Body stored in: /var/folders/z5/n5y08g4j073_ld2d9019j3_80000gn/T/tmpndfxmb7l

  DNS Lookup   TCP Connection   TLS Handshake   Server Processing   Content Transfer
[     6ms    |       4ms      |      7ms      |       355ms       |       157ms      ]
             |                |               |                   |                  |
    namelookup:6ms            |               |                   |                  |
                        connect:10ms          |                   |                  |
                                    pretransfer:17ms              |                  |
                                                      starttransfer:372ms            |
                                                                                 total:529ms
jerodsanto commented 11 months ago

@gerhard me disabling shielding when adding /feed to R2 was unintentional. I believe it was because I cloned the wrong version of the config or something...

I just now activated Version 170 now, which re-enables shielding on both the app and feed origins.

GIPeon commented 11 months ago

Sorry to hijack does this have anything to do with 404s on https://changelog.com/master/feed ? Thanks

jerodsanto commented 11 months ago

Yes it does @GIPeon thanks for mentioning.

@gerhard all feed endpoints were 404ing with shielding on. It appears that Fastly was forwarding to Fly for these instead of Cloudflare. I disabled shielding on the feeds origin, cleared cache, and we back with 200s once again.

I'm not sure why the shielding would cause this, though...

jamesarosen commented 11 months ago

Usually, if enabling shielding causes 404s, it’s because some routing condition you expect to be true is no longer true because you’re modified the request at the edge.

Less often, it’s because you’ve cached 404 responses in the shield.

jerodsanto commented 11 months ago

The logic we're using to send requests over to R2 is:

if (req.http.host == "changelog.com" && table.contains(feeds, req.url.path)) {
    set req.backend = F_Cloudflare_R2_Feeds_Host;
}

How exactly does turning on shielding "modify the request at the edge"? What changes that might make this logic fail?

jerodsanto commented 11 months ago

@gerhard I also just disabled shielding to the Fly origin because when it's on we lose the ability to detect the original host requested (eg - changelog.com, x.changelog.com, etc).

The code I'm using to do that (which maybe we can adjust to still work with shielding on) is this:

def get_host(conn) do
    host = get_header(conn, "x-forwarded-host") || get_header(conn, "host") || Map.get(conn, :host, nil)

    host
    |> to_string()
    |> String.split(":")
    |> List.first()
  end

This works with shielding off, doesn't with it on.

jamesarosen commented 11 months ago

This works with shielding off, doesn't with it on.

This is an excellent hint!

My hypothesis

For requests that don't use the shield, you're getting

host: x.changelog.com

But for requests that use the shield, you've overridden the host.

This configuration causes Fastly to change the host header:

backend F_fly_2022_03_13 {
    .always_use_host_header = true;
    .host = "changelog-2022-03-13.fly.dev";
    .host_header = "changelog-2022-03-13.fly.dev";

So if you set req.backend = F_fly_2022_03_13 at the edge, the request at shield will have host: changelog-2022-03-13.fly.dev and won't match routing rules that check req.host.

Potential Solutions

Most of the host-based logic doesn't really need to run at both edge and shield. You could add req.restarts == 0 to the condition to make sure those things only happen at edge, and not at shield.

You could save off the original host very early in the request:

if (req.restarts == 0 && !req.http.original-host) {
  req.http.original-host = req.http.host;
}

and then use req.http.original-host to do your logic.

See Specifying an override host

gerhard commented 10 months ago

all feed endpoints were 404ing with shielding on. It appears that Fastly was forwarding to Fly for these instead of Cloudflare. I disabled shielding on the feeds origin, cleared cache, and we back with 200s once again.

Got it! This confirms the problem: https://ui.honeycomb.io/changelog/datasets/fastly/result/JMomht9BuJu

image

We now have a trigger in Honeycomb that will send me a notification if that happens again: https://ui.honeycomb.io/changelog/datasets/fastly/triggers/8pENCV13L76 . If you want @jerodsanto, we can notify you as well.

I also just disabled shielding to the Fly origin because when it's on we lose the ability to detect the original host requested (eg - changelog.com, x.changelog.com, etc).

Interesting.

@jamesarosen explanation makes sense to me. To double-check that I understood it correctly:

Do the changes look good to you @jamesarosen?

In the meantime, if you need to make another VCL change @jerodsanto, please ensure that you are starting from v175, otherwise we may end up with a config change that is not ready to go out.

jamesarosen commented 10 months ago

Do the changes look good to you @jamesarosen?

I think they're worth a shot! They certainly look safe enough to try. And I don't have any other specific ideas yet.

gerhard commented 9 months ago

OK, I have just activated v176, let's see how this goes!

gerhard commented 9 months ago

That didn't work as expected 🤔 https://ui.honeycomb.io/changelog/datasets/fastly/result/8fAqcgcYS4s

image

It took a while for the trigger to kick in, but it worked as expected: image

All OK now: image

jamesarosen commented 9 months ago

I'm glad things have stabilized. Configuration changes like this often result in a lot of cache misses at first, so the brief spike isn't surprising.

The only change I might make would be from

if (req.restarts == 0 && !req.http.original-host) {

to

if (req.restarts == 0) {

That way, an attacker can't inject a fake original-host header. In this particular case, I don't think it will matter. It's just a good general practice to be defensive about your variables.

gerhard commented 9 months ago

I should have been clearer in my previous message. I had to revert back to v175 in order for feeds to work correctly. There must be something missing in the diff which is causing feeds origin from serving 404s.

Shortly after I enabled config v176, my first screenshot shows elevated 404 responses for all feeds. The trigger notified me and I reverted back to v175. All back to normal afterwards.

My next step is to set this up on my replica setup, and figure out what I'm missing there, instead of introducing more errors in production.

In case you want to help debug this further @jamesarosen, I have attached both v175 & v176 in one of my previous comments. This week is full-on for me, cannot make it sync. Next week looks much better though in case you are up for it @jamesarosen. Happy to continue this via DMs.

gerhard commented 6 months ago

I no longer think that it's worth spending time on this.

As a follow-up to the https://github.com/thechangelog/changelog.com/discussions/485 conversation, I am running a few experiments:

  1. Compare a different CDN to Fastly 🤓
  2. Start building our own CDN 🧐

In the context of the first experiment, I followed up on @lawik's suggestion and tried out https://bunny.net as a drop-in replacement. I then set up https://hyperping.io monitoring for the root url - / - for all three instances:

  1. Fly.io
  2. Fastly.com
  3. Bunny.net

After 10 days, we can answer:

What is the global average response time in the last 7 days, as monitored by https://hyperping.io/ ?

This is what we have learned:

https://changelog.com/ https://changelog-2024-01-12.fly.dev/
Screenshot 2024-02-25 at 14 22 18 Screenshot 2024-02-25 at 14 26 34

According to the above, serving requests directly from the origin - Fly.io - is faster globally than when using the Fastly CDN. While the 20% difference - 327ms vs 273ms - doesn't sound like a lot, every continent apart from Europe is slower when served via Fastly.

As for rhe global average response time in the last 7 days when served via Bunny.net:

Screenshot 2024-02-25 at 15 23 25

At 72ms, this is a clear winner and something that I would like us to looks into further.

I am also curious about how Cloudflare compares to the above. That is my next action item.


As for the homepage hit ratio - the reason for this issue - this sums it up:

image

👆 The above is a 99.85% cache hit ratio for /

As for Fastly, the cache hit ratio for / was 44% in the same 7 day period, Feb. 18-25, 2024:

image

FWIW, I found this Bunny.net view really useful in a nerdy way:

image
gerhard commented 6 months ago

Last action item follow-up: I set up https://cloudflare.changelog.place on Cloudflare, but I cannot complete the integration since Host Header Override is only available to Enterprise users - source:

image

Without this feature, Cloudflare is sending the cloudflare.changelog.place host to Fly.io which it does not recognise, and it redirects to changelog.com. This results in an infinite redirect loop:

image

Following-up privately with @adamstac re Cloudflare.

gerhard commented 6 months ago

Closing since this is done from my perspective.

OK to reopen if following-up in this context makes sense.