rancher / rke

Rancher Kubernetes Engine (RKE), an extremely simple, lightning fast Kubernetes distribution that runs entirely within containers.
Apache License 2.0
3.21k stars 580 forks source link

DockerHub rate limiting affecting image pulls #2228

Closed immanuelfodor closed 3 years ago

immanuelfodor commented 4 years ago

DockerHub rate limiting (https://docs.docker.com/docker-hub/download-rate-limit/) will be effective from Nov 1 resulting in just 100 manifest pulls allowed in 6 hours for anonymous users.

Please provide guideance / best practice description how to reduce DockerHub registry checks to the bare minimum within an RKE cluster. Currently, a 3-node cluster with some deployed workloads make about 960+ registry-1.docker.io and 580+ auth.docker.io DNS queries in 24 hours (according to PiHole DNS log) which is well above the upcoming limit.

Some of the options I can think of:

kube-api:
    always_pull_images: false

Then recreate all pods to apply the change.

Should these be enough, or are there any more steps one can do to ensure DockerHub registry checking is only done on intentional/explicit image pulls initiated by kubectl operations?

How can one debug when and why an RKE cluster is checking the registry for new images?

RKE version:

RKE version

``` $ rke version INFO[0000] Running RKE version: v1.1.6 Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:51:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"} ```

Docker version: (docker version,docker info preferred)

Docker version

``` $ docker version Client: Docker Engine - Community Version: 19.03.8 API version: 1.40 Go version: go1.12.17 Git commit: afacb8b Built: Wed Mar 11 01:27:04 2020 OS/Arch: linux/amd64 Experimental: false Server: Docker Engine - Community Engine: Version: 19.03.8 API version: 1.40 (minimum version 1.12) Go version: go1.12.17 Git commit: afacb8b Built: Wed Mar 11 01:25:42 2020 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.2.13 GitCommit: 7ad184331fa3e55e52b890ea95e65ba581ae3429 runc: Version: 1.0.0-rc10 GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd docker-init: Version: 0.18.0 GitCommit: fec3683 ``` ``` $ docker info Client: Debug Mode: false Server: Containers: 101 Running: 85 Paused: 0 Stopped: 16 Images: 46 Server Version: 19.03.8 Storage Driver: overlay2 Backing Filesystem: Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd init version: fec3683 Security Options: seccomp Profile: default Kernel Version: 4.18.0-147.8.1.el8_1.x86_64 Operating System: CentOS Linux 8 (Core) OSType: linux Architecture: x86_64 CPUs: 4 Total Memory: 5.661GiB Name: node1 ID: .... Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false ```

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

OS and kernel version

``` $ cat /etc/os-release NAME="CentOS Linux" VERSION="8 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="8" PLATFORM_ID="platform:el8" PRETTY_NAME="CentOS Linux 8 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:8" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-8" CENTOS_MANTISBT_PROJECT_VERSION="8" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="8" ``` ``` $ uname -r 4.18.0-147.8.1.el8_1.x86_64 ```

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

Proxmox (KVM/QEMU)

cluster.yml file:

cluster.yml

``` # If you intened to deploy Kubernetes in an air-gapped environment, # please consult the documentation on how to configure custom RKE images. nodes: - address: 192.168.1.6 port: "22" internal_address: "" role: - controlplane - worker - etcd hostname_override: node1 user: centos docker_socket: /var/run/docker.sock ssh_key: "" ssh_key_path: ~/.ssh/id_ed25519 ssh_cert: "" ssh_cert_path: "" labels: {} taints: [] - address: 192.168.1.7 port: "22" internal_address: "" role: - controlplane - worker - etcd hostname_override: node2 user: centos docker_socket: /var/run/docker.sock ssh_key: "" ssh_key_path: ~/.ssh/id_ed25519 ssh_cert: "" ssh_cert_path: "" labels: {} taints: [] - address: 192.168.1.8 port: "22" internal_address: "" role: - controlplane - worker - etcd hostname_override: node3 user: centos docker_socket: /var/run/docker.sock ssh_key: "" ssh_key_path: ~/.ssh/id_ed25519 ssh_cert: "" ssh_cert_path: "" labels: {} taints: [] services: etcd: image: "" extra_args: {} extra_binds: [] extra_env: [] win_extra_args: {} win_extra_binds: [] win_extra_env: [] external_urls: [] ca_cert: "" cert: "" key: "" path: "" uid: 1000 gid: 1000 snapshot: true retention: 48h creation: 6h backup_config: interval_hours: 12 retention: 6 kube-api: image: "" extra_args: {} extra_binds: [] extra_env: [] win_extra_args: {} win_extra_binds: [] win_extra_env: [] service_cluster_ip_range: 10.43.0.0/16 service_node_port_range: "" pod_security_policy: false always_pull_images: false secrets_encryption_config: enabled: true audit_log: enabled: true admission_configuration: null event_rate_limit: null kube-controller: image: "" extra_args: {} extra_binds: [] extra_env: [] win_extra_args: {} win_extra_binds: [] win_extra_env: [] cluster_cidr: 10.42.0.0/16 service_cluster_ip_range: 10.43.0.0/16 scheduler: image: "" extra_args: {} extra_binds: [] extra_env: [] win_extra_args: {} win_extra_binds: [] win_extra_env: [] kubelet: image: "" extra_args: max-pods: 150 extra_binds: [] extra_env: [] win_extra_args: {} win_extra_binds: [] win_extra_env: [] cluster_domain: cluster.local infra_container_image: "" cluster_dns_server: 10.43.0.10 fail_swap_on: false generate_serving_certificate: false kubeproxy: image: "" extra_args: {} extra_binds: [] extra_env: [] win_extra_args: {} win_extra_binds: [] win_extra_env: [] network: plugin: canal options: {} mtu: 0 node_selector: {} update_strategy: null authentication: strategy: x509 sans: [] webhook: null addons: "" addons_include: - ./dashboard/k8s-dash-recommended.yml - ./dashboard/dashboard-adminuser.yml system_images: etcd: rancher/coreos-etcd:v3.4.3-rancher1 alpine: rancher/rke-tools:v0.1.64 nginx_proxy: rancher/rke-tools:v0.1.64 cert_downloader: rancher/rke-tools:v0.1.64 kubernetes_services_sidecar: rancher/rke-tools:v0.1.64 kubedns: rancher/k8s-dns-kube-dns:1.15.2 dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.2 kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.2 kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1 coredns: rancher/coredns-coredns:1.6.9 coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1 nodelocal: rancher/k8s-dns-node-cache:1.15.7 kubernetes: rancher/hyperkube:v1.18.6-rancher1 flannel: rancher/coreos-flannel:v0.12.0 flannel_cni: rancher/flannel-cni:v0.3.0-rancher6 calico_node: rancher/calico-node:v3.13.4 calico_cni: rancher/calico-cni:v3.13.4 calico_controllers: rancher/calico-kube-controllers:v3.13.4 calico_ctl: rancher/calico-ctl:v3.13.4 calico_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4 canal_node: rancher/calico-node:v3.13.4 canal_cni: rancher/calico-cni:v3.13.4 canal_flannel: rancher/coreos-flannel:v0.12.0 canal_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4 weave_node: weaveworks/weave-kube:2.6.4 weave_cni: weaveworks/weave-npc:2.6.4 pod_infra_container: rancher/pause:3.1 ingress: rancher/nginx-ingress-controller:nginx-0.32.0-rancher1 ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1 metrics_server: rancher/metrics-server:v0.3.6 windows_pod_infra_container: rancher/kubelet-pause:v0.1.4 ssh_key_path: ~/.ssh/id_ed25519 ssh_cert_path: "" ssh_agent_auth: false authorization: mode: rbac options: {} ignore_docker_version: false kubernetes_version: "" private_registries: [] ingress: provider: nginx options: use-forwarded-headers: "true" proxy-body-size: "80M" use-http2: "true" node_selector: {} extra_args: {} dns_policy: "" extra_envs: [] extra_volumes: [] extra_volume_mounts: [] update_strategy: null cluster_name: "test" cloud_provider: name: "" prefix_path: "" win_prefix_path: "" addon_job_timeout: 0 bastion_host: address: "" port: "" user: "" ssh_key: "" ssh_key_path: "" ssh_cert: "" ssh_cert_path: "" monitoring: provider: "" options: {} node_selector: {} update_strategy: null replicas: null restore: restore: false snapshot_name: "" dns: provider: coredns upstreamnameservers: - 192.168.1.2 - 192.168.1.3 ```

Steps to Reproduce:

Check PiHole DNS logs.

Results:

960+ registry-1.docker.io and 580+ auth.docker.io DNS queries within the last 24 hours.

immanuelfodor commented 4 years ago

Even though I've implemented all above changes (turn off admission controller, do not use latest tags, set imagePullPolicy to IfNotPresent), the cluster is querying DockerHub periodically (seems to be 5 minutes, only one node on the screenshot but all nodes do this so it's x3):

image

What more could I do to limit the DockerHub requests? I have only one latest tag on a busybox init container, maybe it causes it? But it's not in use as the parent pod is running fine. How can I debug what triggers the DockerHub requests?

alexMillerVince commented 4 years ago

Same issue here!

immanuelfodor commented 3 years ago

Hmm, nobody else is worrying that RKE clusters might/will get banned from DockerHub from 1st Nov?

immanuelfodor commented 3 years ago

Harbor Docker registry updated its docs to address the rate limiting: https://goharbor.io/docs/2.1.0/administration/configure-proxy-cache/

As of Harbor v2.1.1, Harbor proxy cache fires a HEAD request to determine whether any layer of a cached image has been updated in the Docker Hub registry. Using this method to check the target registry will not trigger the Docker Hub rate limiter. If any image layer was updated, the proxy cache will pull the new image, which will count towards the Docker Hub rate limiter.

Maybe a HEAD request is what RKE does and that's why the DNS query for DockerHub? Or it's trying to blindly pull the image no matter what is the current rate limit for that IP address? Does somebody know how registry checking for new images works in RKE?

immanuelfodor commented 3 years ago

Hi @superseb, you seem to be a core contributor, could you please help with this time critical issue? If not, could you help who to ask or where should I ask around?

superseb commented 3 years ago

I've setup a single node RKE cluster and monitored DNS queries but I don't see the behavior you are seeing. First of all, RKE is a binary that you manually run to create/provision a cluster, this runs ad-hoc and does not do anything outside of it running and provisioning. After RKE is done running, the components are upstream Kubernetes components with our settings. If this is causing the behavior we can certainly look into it, but we first need to isolate where the behavior is coming from. I've run a default cluster.yml and ran a few pods but don't see recurring requests towards Docker Hub. The only way I could reproduce your behavior was by running a pod with a nonexistent image, which then went into ImagePullBackOff which eventually reaches the 5 minute mark and will continue to do so. So to analyze the issue we need:

The way to enable verbose logging is by using the following in cluster.yml:

services:
  kubelet:
    extra_args:
      v: 9

Let me know if I missed anything.

immanuelfodor commented 3 years ago

Thank you very much, this was what I needed for the debug. It turns out that the cluster was also running a private docker registry and it had a long-forgotten replication job that regularly checked for image updates. When I removed the registry, the DNS queries stopped just as you predicted: RKE wouldn't check images with kube-api.always_pull_images: false in cluster.yml and imagePullPolicy: IfNotPresent on deployments. Since Harbor v2.1.1 only uses HEAD requests (https://github.com/goharbor/harbor/issues/13112), if I upgrade the registry, even though the DNS queries will come back, the rate limit won't be hit.

One more question: would kube-api.always_pull_images: true and/or imagePullPolicy: Always also use HEAD requests against DockerHub in RKE? As I understand, it's a security best practice to always pull images but I also don't want to be rate limited.

vincent99 commented 3 years ago

RKE doesn't make API calls to DockerHub (or any other registry). It asks the Docker daemons on nodes to pull images or run containers, then the docker daemon makes whatever API calls it wants to the registry to accomplish that.

You would think their client would use their preferred method to talk to their own registry, but to confirm that you need a SSL man-in-the-middle proxy.

immanuelfodor commented 3 years ago

I see, thanks for the explanation. Then I need to check what method Docker is using in my version.

$ docker version
Client: Docker Engine - Community
 Version:           19.03.8
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        afacb8b
 Built:             Wed Mar 11 01:27:04 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.8
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       afacb8b
  Built:            Wed Mar 11 01:25:42 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683