Closed gerhard closed 6 months ago
Hmmm I never set up /feed
to be served directly from R2 even though we are pre-generating it like all the other feeds now. No need to hit the app origin for this. Should I make that change @gerhard and see if this makes it all better?
I went ahead and made this change so now every /feed
request is being sent to R2 origin instead of the app origin:
curl -I https://changelog.com/feed
HTTP/2 200
content-type: application/xml
etag: W/"59b2e72e41e688450a1ca572543e9d20"
last-modified: Sun, 22 Oct 2023 13:17:46 GMT
cf-cache-status: DYNAMIC
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=DRcCdyhiCjkzqQ%2FdYsrHq%2BQqAbg%2FKpWp1k35Qeckvca3COcoQB8P8JJ3fAwp8aWKb12c2Tt0C3kL6RL0ETFFFDsX181Hal%2BK9%2F4OrxKkapP1ds2IBIguMea5q5jyPfVB5w4myw%2BJoLc%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
cache-control: max-age=3600, s-maxage=604800, stale-while-revalidate=604800, stale-if-error=604800
server: cloudflare
cf-ray: 81a20c6928694e03-MCI
access-control-allow-origin: *
accept-ranges: bytes
date: Sun, 22 Oct 2023 13:22:32 GMT
via: 1.1 varnish
age: 12
x-served-by: cache-mci680073-MCI
x-cache: HIT
x-cache-hits: 1
x-timer: S1697980952.206757,VS0,VE1
vary: Accept-Encoding
strict-transport-security: max-age=31557600
alt-svc: h3=":443";ma=86400,h3-29=":443";ma=86400,h3-27=":443";ma=86400
content-length: 224492
OK! That is an improvement for the /feed
endpoint: https://ui.honeycomb.io/changelog/datasets/fastly/result/CogXHT7GnNF
Want to do the same for /feed/
& /feed.json
? FTR: https://ui.honeycomb.io/changelog/datasets/fastly/result/F3QLxZ2CAtp
@jerodsanto I have a placeholder Fastly config that re-enables shielding for both the app & the feeds origin: https://manage.fastly.com/configure/services/7gKbcKSKGDyqU7IuDr43eG/diff/169,170
Are you OK to have this enabled? I ask because I'm not sure if removing shielding on the app in https://manage.fastly.com/configure/services/7gKbcKSKGDyqU7IuDr43eG/diff/167,169 was intentional.
BTW, I just noticed that I never activated v168
: https://manage.fastly.com/configure/services/7gKbcKSKGDyqU7IuDr43eG/diff/167,168
Even after the /feed
switch to R2, I still don't understand why /
cache misses increased more than 10x
: https://ui.honeycomb.io/changelog/datasets/fastly/result/FQjroVdMo3E
Time to double-check with Fastly support...
The second one looks suspicious, but on closer inspection we are simply setting surrogate-control to AWS S3 headers which are prefixed with x-amz-meta-
The diff shows you deleting that logic, not adding it.
Want to do the same for /feed/ & /feed.json
What's the difference between /feed
and /feed/
? Instead of serving the same content from both, you'll get better caching if you rewrite or redirect the path at the edge.
removing shielding on the app
Why? Given your global audience, shielding is definitely going to help you keep CHR high.
Even after the /feed switch to R2, I still don't understand why / cache misses increased more than 10x
Can you show the curl
/ httpstat
for the origin request for /
and /feed
?
The diff shows you deleting that logic, not adding it.
Yes, we deleted that logic, because we no longer need it (we replaced S3 with R2).
Since app responses do not use x-amz-meta-
headers, deleting that logic shouldn't have had an impact on this, and yet it looks like they did - problems started happening at the same time, as captured in my previous comment.
What's the difference between /feed and /feed/? Instead of serving the same content from both, you'll get better caching if you rewrite or redirect the path at the edge.
/feed
is served by R2 (that's the origin), while /feed/
is served by the app. They should both be served by R2. Rewriting /feed/
to /feed
at the edge makes most sense, thanks for the reminder!
Given your global audience, shielding is definitely going to help you keep CHR high.
Yes, I'm pretty sure that we want to keep shielding for all origins. Want to double-check with @jerodsanto in case he had a good reason to disable it on the app. Perhaps it's related to the Websockets work that he did a few weeks back: https://changelog.slack.com/archives/C03SA8VE2/p1695996344381799 (the following 10 messages have more context).
Can you show the
curl
/httpstat
for the origin request for/
and/feed
?
Let's ignore /feed
now, since that origin has changed to R2. Let's focus on /
instead.
This is what the request to origin currently returns:
httpstat https://changelog-2022-03-13.fly.dev/
Connected to 66.51.126.203:443 from 192.168.88.18:61248
HTTP/2 200
cache-control: no-store, must-revalidate
content-length: 60923
content-type: text/html; charset=utf-8
date: Tue, 24 Oct 2023 07:10:40 GMT
permissions-policy: interest-cohort=()
referrer-policy: strict-origin-when-cross-origin
server: Fly/55f8eaa0 (2023-10-09)
surrogate-control: max-age=60, stale-while-revalidate=60, stale-if-error=604800
surrogate-key: pages
x-content-type-options: nosniff
x-download-options: noopen
x-frame-options: SAMEORIGIN
x-permitted-cross-domain-policies: none
x-request-id: F5D5jDFXU3h2Ar8CmxuB
via: 2 fly.io
fly-request-id: 01HDG8AW9KDVKY0Y5QGTBGH22S-lhr
Body stored in: /var/folders/z5/n5y08g4j073_ld2d9019j3_80000gn/T/tmpndfxmb7l
DNS Lookup TCP Connection TLS Handshake Server Processing Content Transfer
[ 6ms | 4ms | 7ms | 355ms | 157ms ]
| | | | |
namelookup:6ms | | | |
connect:10ms | | |
pretransfer:17ms | |
starttransfer:372ms |
total:529ms
And here is the same request to Fastly:
httpstat https://changelog.com/
Connected to 151.101.129.162:443 from 192.168.88.18:52383
HTTP/2 200
cache-control: no-store, must-revalidate
content-type: text/html; charset=utf-8
permissions-policy: interest-cohort=()
referrer-policy: strict-origin-when-cross-origin
server: Fly/55f8eaa0 (2023-10-09)
x-changelog-vanity-redirect: false
x-content-type-options: nosniff
x-download-options: noopen
x-frame-options: SAMEORIGIN
x-permitted-cross-domain-policies: none
x-request-id: F5B91xd3BIDdNqgBIcCh
via: 1.1 fly.io, 1.1 varnish
fly-request-id: 01HDC6KYH20EX58WHF21A24427-lhr
access-control-allow-origin: *
accept-ranges: bytes
date: Sun, 22 Oct 2023 17:24:55 GMT
age: 73
x-served-by: cache-lon4265-LON
x-cache: HIT
x-cache-hits: 1
x-timer: S1697995496.598494,VS0,VE1
vary: Accept-Encoding
strict-transport-security: max-age=31557600
alt-svc: h3=":443";ma=86400,h3-29=":443";ma=86400,h3-27=":443";ma=86400
content-length: 60890
Body stored in: /var/folders/z5/n5y08g4j073_ld2d9019j3_80000gn/T/tmpccsfo978
DNS Lookup TCP Connection TLS Handshake Server Processing Content Transfer
[ 7ms | 4ms | 6ms | 4ms | 4ms ]
| | | | |
namelookup:7ms | | | |
connect:11ms | | |
pretransfer:17ms | |
starttransfer:21ms |
total:25ms
And another request to Fastly:
httpstat https://changelog.com/
Connected to 151.101.129.162:443 from 192.168.88.18:52384
HTTP/2 200
cache-control: no-store, must-revalidate
content-type: text/html; charset=utf-8
permissions-policy: interest-cohort=()
referrer-policy: strict-origin-when-cross-origin
server: Fly/55f8eaa0 (2023-10-09)
x-changelog-vanity-redirect: false
x-content-type-options: nosniff
x-download-options: noopen
x-frame-options: SAMEORIGIN
x-permitted-cross-domain-policies: none
x-request-id: F5B96CwMXESe9kwAjrRi
via: 1.1 fly.io, 1.1 varnish
fly-request-id: 01HDC6P65JHTS4GRNVBVEH6SZ5-lhr
access-control-allow-origin: *
accept-ranges: bytes
date: Sun, 22 Oct 2023 17:25:07 GMT
age: 11
x-served-by: cache-lon420124-LON
x-cache: HIT
x-cache-hits: 1
x-timer: S1697995507.000441,VS0,VE1
vary: Accept-Encoding
strict-transport-security: max-age=31557600
alt-svc: h3=":443";ma=86400,h3-29=":443";ma=86400,h3-27=":443";ma=86400
content-length: 60862
Body stored in: /var/folders/z5/n5y08g4j073_ld2d9019j3_80000gn/T/tmpaxw8mrmh
DNS Lookup TCP Connection TLS Handshake Server Processing Content Transfer
[ 6ms | 3ms | 10ms | 5ms | 4ms ]
| | | | |
namelookup:6ms | | | |
connect:9ms | | |
pretransfer:19ms | |
starttransfer:24ms |
total:28ms
While this looks good, we continue seeing a high rate of misses: 449
misses vs 289
hits in the last 1h
: https://ui.honeycomb.io/changelog/datasets/fastly/result/oJz39TUUryB
This is how that changed in the last 14 days vs 14 days prior: https://ui.honeycomb.io/changelog/datasets/fastly/result/4nwqn3VMREu
To make debugging this easier, I am attaching the current VCL config v169
7gKbcKSKGDyqU7IuDr43eG_169.vcl.txt
Yes, I definitely intend to version control this in our repository.
https://changelog-nightly-2023-10-10.fly.dev/
doesn't have any cache-control
or surrogate-control
headers. Should it? Even if it changes pretty frequently, you could get some benefit from something like
cache-control: private, max-age=3600, must-revalidate
surrogate-control: public, max-age=60, stale-while-revalidate=3600
Sorry @jamesarosen, that paste was my bad (I just fixed it).
The origin paste should have been:
httpstat https://changelog-2022-03-13.fly.dev/
Connected to 66.51.126.203:443 from 192.168.88.18:61248
HTTP/2 200
cache-control: no-store, must-revalidate
content-length: 60923
content-type: text/html; charset=utf-8
date: Tue, 24 Oct 2023 07:10:40 GMT
permissions-policy: interest-cohort=()
referrer-policy: strict-origin-when-cross-origin
server: Fly/55f8eaa0 (2023-10-09)
surrogate-control: max-age=60, stale-while-revalidate=60, stale-if-error=604800
surrogate-key: pages
x-content-type-options: nosniff
x-download-options: noopen
x-frame-options: SAMEORIGIN
x-permitted-cross-domain-policies: none
x-request-id: F5D5jDFXU3h2Ar8CmxuB
via: 2 fly.io
fly-request-id: 01HDG8AW9KDVKY0Y5QGTBGH22S-lhr
Body stored in: /var/folders/z5/n5y08g4j073_ld2d9019j3_80000gn/T/tmpndfxmb7l
DNS Lookup TCP Connection TLS Handshake Server Processing Content Transfer
[ 6ms | 4ms | 7ms | 355ms | 157ms ]
| | | | |
namelookup:6ms | | | |
connect:10ms | | |
pretransfer:17ms | |
starttransfer:372ms |
total:529ms
@gerhard me disabling shielding when adding /feed
to R2 was unintentional. I believe it was because I cloned the wrong version of the config or something...
I just now activated Version 170 now, which re-enables shielding on both the app and feed origins.
Sorry to hijack does this have anything to do with 404s on https://changelog.com/master/feed ? Thanks
Yes it does @GIPeon thanks for mentioning.
@gerhard all feed endpoints were 404ing with shielding on. It appears that Fastly was forwarding to Fly for these instead of Cloudflare. I disabled shielding on the feeds origin, cleared cache, and we back with 200s once again.
I'm not sure why the shielding would cause this, though...
Usually, if enabling shielding causes 404s, it’s because some routing condition you expect to be true is no longer true because you’re modified the request at the edge.
Less often, it’s because you’ve cached 404 responses in the shield.
The logic we're using to send requests over to R2 is:
if (req.http.host == "changelog.com" && table.contains(feeds, req.url.path)) {
set req.backend = F_Cloudflare_R2_Feeds_Host;
}
How exactly does turning on shielding "modify the request at the edge"? What changes that might make this logic fail?
@gerhard I also just disabled shielding to the Fly origin because when it's on we lose the ability to detect the original host requested (eg - changelog.com, x.changelog.com, etc).
The code I'm using to do that (which maybe we can adjust to still work with shielding on) is this:
def get_host(conn) do
host = get_header(conn, "x-forwarded-host") || get_header(conn, "host") || Map.get(conn, :host, nil)
host
|> to_string()
|> String.split(":")
|> List.first()
end
This works with shielding off, doesn't with it on.
This works with shielding off, doesn't with it on.
This is an excellent hint!
For requests that don't use the shield, you're getting
host: x.changelog.com
But for requests that use the shield, you've overridden the host.
This configuration causes Fastly to change the host
header:
backend F_fly_2022_03_13 {
.always_use_host_header = true;
.host = "changelog-2022-03-13.fly.dev";
.host_header = "changelog-2022-03-13.fly.dev";
So if you set req.backend = F_fly_2022_03_13
at the edge, the request at shield will have host: changelog-2022-03-13.fly.dev
and won't match routing rules that check req.host
.
Most of the host-based logic doesn't really need to run at both edge and shield. You could add req.restarts == 0
to the condition to make sure those things only happen at edge, and not at shield.
You could save off the original host very early in the request:
if (req.restarts == 0 && !req.http.original-host) {
req.http.original-host = req.http.host;
}
and then use req.http.original-host
to do your logic.
all feed endpoints were 404ing with shielding on. It appears that Fastly was forwarding to Fly for these instead of Cloudflare. I disabled shielding on the feeds origin, cleared cache, and we back with 200s once again.
Got it! This confirms the problem: https://ui.honeycomb.io/changelog/datasets/fastly/result/JMomht9BuJu
We now have a trigger in Honeycomb that will send me a notification if that happens again: https://ui.honeycomb.io/changelog/datasets/fastly/triggers/8pENCV13L76 . If you want @jerodsanto, we can notify you as well.
I also just disabled shielding to the Fly origin because when it's on we lose the ability to detect the original host requested (eg - changelog.com, x.changelog.com, etc).
Interesting.
@jamesarosen explanation makes sense to me. To double-check that I understood it correctly:
v175
v176
, WHICH IS NOT YET ACTIVEDo the changes look good to you @jamesarosen?
In the meantime, if you need to make another VCL change @jerodsanto, please ensure that you are starting from v175
, otherwise we may end up with a config change that is not ready to go out.
Do the changes look good to you @jamesarosen?
I think they're worth a shot! They certainly look safe enough to try. And I don't have any other specific ideas yet.
OK, I have just activated v176
, let's see how this goes!
That didn't work as expected 🤔 https://ui.honeycomb.io/changelog/datasets/fastly/result/8fAqcgcYS4s
It took a while for the trigger to kick in, but it worked as expected:
All OK now:
I'm glad things have stabilized. Configuration changes like this often result in a lot of cache misses at first, so the brief spike isn't surprising.
The only change I might make would be from
if (req.restarts == 0 && !req.http.original-host) {
to
if (req.restarts == 0) {
That way, an attacker can't inject a fake original-host
header. In this particular case, I don't think it will matter. It's just a good general practice to be defensive about your variables.
I should have been clearer in my previous message. I had to revert back to v175
in order for feeds to work correctly. There must be something missing in the diff which is causing feeds origin from serving 404s.
Shortly after I enabled config v176
, my first screenshot shows elevated 404
responses for all feeds. The trigger notified me and I reverted back to v175
. All back to normal afterwards.
My next step is to set this up on my replica setup, and figure out what I'm missing there, instead of introducing more errors in production.
In case you want to help debug this further @jamesarosen, I have attached both v175
& v176
in one of my previous comments. This week is full-on for me, cannot make it sync. Next week looks much better though in case you are up for it @jamesarosen. Happy to continue this via DMs.
I no longer think that it's worth spending time on this.
As a follow-up to the https://github.com/thechangelog/changelog.com/discussions/485 conversation, I am running a few experiments:
In the context of the first experiment, I followed up on @lawik's suggestion and tried out https://bunny.net as a drop-in replacement. I then set up https://hyperping.io monitoring for the root url - /
- for all three instances:
After 10 days, we can answer:
What is the global average response time in the last 7 days, as monitored by https://hyperping.io/ ?
This is what we have learned:
https://changelog.com/ | https://changelog-2024-01-12.fly.dev/ |
---|---|
According to the above, serving requests directly from the origin - Fly.io - is faster globally than when using the Fastly CDN. While the 20% difference - 327ms
vs 273ms
- doesn't sound like a lot, every continent apart from Europe is slower when served via Fastly.
As for rhe global average response time in the last 7 days when served via Bunny.net:
At 72ms
, this is a clear winner and something that I would like us to looks into further.
I am also curious about how Cloudflare compares to the above. That is my next action item.
As for the homepage hit ratio - the reason for this issue - this sums it up:
👆 The above is a 99.85%
cache hit ratio for /
As for Fastly, the cache hit ratio for /
was 44%
in the same 7 day period, Feb. 18-25, 2024:
FWIW, I found this Bunny.net view really useful in a nerdy way:
Last action item follow-up: I set up https://cloudflare.changelog.place on Cloudflare, but I cannot complete the integration since Host Header Override
is only available to Enterprise users - source:
Without this feature, Cloudflare is sending the cloudflare.changelog.place
host to Fly.io which it does not recognise, and it redirects to changelog.com
. This results in an infinite redirect loop:
Following-up privately with @adamstac re Cloudflare.
Closing since this is done from my perspective.
OK to reopen if following-up in this context makes sense.
What is the problem?
After Oct. 8, 2023, CDN cache misses increased by
7x
, from0.75m
to5.2m
This seems to be coming from the app origin, specifically
F_fly_2022_03_13
If we group by
url
, we notice that/feed
,/
&/feed/
requests are the ones with the highest increase in cache misses:What is the Fastly response if we enable debug headers?
This looks like a clean cache miss, and we do not have shielding enabled. Fixed it.
Does this fix it?
This looks a lot better now:
👉 Let's give this a few hours before we check the hits vs misses again.
Questions
Do the cache headers look correct?
Looking at a specific URL endpoint served by the origin, we see the following:
This looks correct to me.
surrogate-control
is set and should have precedence overcache-control
. Reference: Understand how cache control headers work.If the
surrogate-control
headers are respected, why does this content have so many misses? cc @jamesarosenCould this config change be related to the increased cache misses?
We had a two VCL changes earlier that day:
The second one looks suspicious, but on closer inspection we are simply setting
surrogate-control
to AWS S3 headers which are prefixed withx-amz-meta-
:Could the above change be related to the increased cache misses @jamesarosen?
To be continued...