kubernetes / k8s.io

Code and configuration to manage Kubernetes project infrastructure, including various *.k8s.io sites
https://git.k8s.io/community/sig-k8s-infra
Apache License 2.0
733 stars 814 forks source link

Downloads going over cdn.dl.k8s.io are much slower than direct downloads from the bucket #5755

Closed xmudrii closed 5 months ago

xmudrii commented 1 year ago

I've observed that downloads using curl going over cdn.dl.k8s.io (dl.k8s.io) are much slower than direct downloads from the bucket (storage.googleapis.com/kubernetes-release).

For example, downloading kubelet v1.28.1 directly from the bucket yields the following results:

curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.28.1/bin/linux/amd64/kubelet
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  105M  100  105M    0     0  23.8M      0  0:00:04  0:00:04 --:--:-- 23.8M

The download took 4 seconds in total. However, downloading via the CDN yields much different results:

curl -LO https://dl.k8s.io/v1.28.1/bin/linux/amd64/kubelet
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   138  100   138    0     0    744      0 --:--:-- --:--:-- --:--:--   745
100  105M  100  105M    0     0  1643k      0  0:01:05  0:01:05 --:--:-- 1784k

It took one minute and five seconds to download the same file.

Update: it turns out that cache miss downloads are slow, and cache hit downloads are fast. This can be be determined from x-cache: MISS and x-cache: HIT headers. Once the file is cached on Fastly side, downloads are fast, but prior to that, downloads are insanely slow.

/sig k8s-infra /priority important-soon /kind bug cc @ameukam @BenTheElder

xmudrii commented 1 year ago

Update: it turns out that cache miss downloads are slow, and cache hit downloads are fast. This can be be determined from x-cache: MISS and x-cache: HIT headers. Once the file is cached on Fastly side, downloads are fast, but prior to that, downloads are insanely slow.

xrstf commented 1 year ago

It might be related, but the CDN is not just slow, it's inconsistent. 1.29-alpha.1 was released yesterday, but depending from where you perform a curl -L https://dl.k8s.io/release/latest-1.29.txt, you receive either alpha.0 or alpha.1

This will even change on the same computer if you just re-run the same curl command a few seconds later. Not sure if individual CDN servers "downgrade" their data or if I'm just hitting tons of random CDN nodes that all have an inconsistent state, but it's weird and sadly unreliable :/

These two request happened basically at the same time:

< HTTP/2 200 
< x-guploader-uploadid: ADPycdutDBgx7kyHbX7GUaTmNyxVRNVE82erWSx3_jmUaV5c01OeI7dkYmcu9pfg9gj5BTsgpYgYhWRUMYxkNtP4PVKi26f6HtKM
< expires: Sun, 24 Sep 2023 12:42:09 GMT
< last-modified: Wed, 26 Jul 2023 09:06:19 GMT
< etag: "9b59bd47d18f2395481cf230a43a56e0"
< content-type: text/plain
< cache-control: private, no-store
< accept-ranges: bytes
< date: Tue, 26 Sep 2023 10:40:55 GMT
< via: 1.1 varnish
< age: 165525
< x-served-by: cache-fra-etou8220117-FRA
< x-cache: HIT
< x-cache-hits: 1
< access-control-allow-origin: *
< content-length: 15
< 
* Connection #1 to host cdn.dl.k8s.io left intact
v1.29.0-alpha.0

and

< HTTP/2 200
< x-guploader-uploadid: ADPycds7gWeT690zb-SSaamOrnGHAi6AgaV_K0SWCSe5XMLoJ1zFIE0NiJNe0v8Nr0STrfLXh5GwEv5JBgB6RhU6cqOdVHcHyJIy
< expires: Tue, 26 Sep 2023 07:08:47 GMT
< last-modified: Mon, 25 Sep 2023 20:56:50 GMT
< etag: "7d852bf327f00c76b50173de7dbaebf6"
< content-type: text/plain
< cache-control: private, no-store
< accept-ranges: bytes
< date: Tue, 26 Sep 2023 10:40:50 GMT
< via: 1.1 varnish
< age: 12723
< x-served-by: cache-muc13944-MUC
< x-cache: HIT
< x-cache-hits: 1
< access-control-allow-origin: *
< content-length: 15
<
* Connection #1 to host cdn.dl.k8s.io left intact
v1.29.0-alpha.1

Both claim a cache hit, but return different results.

xmudrii commented 1 year ago

This can lead to serious issues. It looks like you're getting served from FRA and MUC, and these nodes might indeed have different cache. I think we should ignore version markers from cache, these can get changed often, especially latest ones.

ameukam commented 1 year ago

Yeah. We are not specific about file extensions for the cache configuration.

I'll open a PR to fix it this week. Another option could be to directly serve those version makers through the nginx instance instance of the CDN provider.

ameukam commented 1 year ago

@xrstf can you open an new issue with what you described ? To better track what's happening. Thanks!

xrstf commented 1 year ago

Can do, done => #5900.

ameukam commented 1 year ago

We increased the TTL for the different objects in https://github.com/kubernetes/k8s.io/pull/5871. Hopefully the situation should be better.

The current CDN is a "pull-through" cache so a MISS is expected for any object at the POP close the client for the first request. Our real issue the number of the objects that need to be cached at edge. We have a lot of objects (in this case binaries) rarely pulled. I don't think there is an efficient mechanism to warm all the POP of the CDN provider for all the objects we currently host but I open to any suggestions.

Note that our cache is currently over 99% now. I don't think we can do more that.

image

BenTheElder commented 1 year ago

IIRC a mid-level cache was mentioned talking to fastly previously?

ameukam commented 1 year ago

IIRC a mid-level cache was mentioned talking to fastly previously?

maybe you're talking about Origin Shield ? If that the case, the feature is mostly efficient with regional buckets which is not the case for gs://kubernetes-release. I'll ask about the exact requirements for this feature.

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ameukam commented 8 months ago

@xmudrii is the problem still happening ?

xmudrii commented 8 months ago

@ameukam I'll check and get back to you

xmudrii commented 8 months ago

@ameukam This is still the issue for non-cached artifacts downloaded over dl.k8s.io, see the screenshot:

image
xmudrii commented 8 months ago

/remove-lifecycle stale

ameukam commented 8 months ago

Non-cached artifacts going through Fasly will always be slow for the first request on the POP close the requester. Fastly don't replicate all the objects over it's entire network. Objects are cached based on the requests. If the object is not present at Fastly Edge, it will always be slower than the origin.

xmudrii commented 8 months ago

@ameukam Is there anything that we can do to make it at least a little faster? The difference is huge, it takes 5 seconds when downloading directly from the bucket, but about 1 minutes and 30 seconds when downloading from the CDN. Subsequent requests might be slow as well because there's a chance to get you redirected to some other edge location.

ameukam commented 8 months ago

One possibility could be Fastly Origin Shield but we need to switch to the origin to a regional bucket.

xmudrii commented 8 months ago

Even cached requests are much slower for me. Something that takes 3-5 seconds when downloaded from the bucket directly takes 30-40 seconds when downloaded via CDN. I double-checked with @xrstf and he sees okay speeds on 2nd and 3rd try (the 1st try is also slow for him), but that's not the case for me.

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

xmudrii commented 5 months ago

I think this has been mostly fixed, I didn't observe it for a while, closing the issue for now /close

k8s-ci-robot commented 5 months ago

@xmudrii: Closing this issue.

In response to [this](https://github.com/kubernetes/k8s.io/issues/5755#issuecomment-2120351539): >I think this has been mostly fixed, I didn't observe it for a while, closing the issue for now >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.