siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.82k stars 544 forks source link

Image mirror on private network not working during initial installation #8050

Open XLordalX opened 11 months ago

XLordalX commented 11 months ago

Bug Report

Description

Image mirror on private network not working during initial installation.

  registries:
    mirrors:
      docker.io:
        endpoints:
          - http://10.44.253.162:5000
      registry.k8s.io:
        endpoints:
          - http://10.44.253.162:5001
      gcr.io:
        endpoints:
          - http://10.44.253.162:5002
      ghcr.io:
        endpoints:
          - http://10.44.253.162:5003

I tried configuring the network interface, with no luck:

network:
    interfaces:
      - interface: ens18
        routes:
          - network: 10.44.0.0/16
            gateway: 10.44.0.1

It works perfectly fine when I enable it after bootstrap, so the registry is definitely working.

Logs

NODE            SERVICE      STATE     HEALTH   LAST CHANGE   LAST EVENT
10.44.253.200   apid         Running   OK       46m51s ago    Health check successful
10.44.253.200   containerd   Running   OK       47m8s ago     Health check successful
10.44.253.200   cri          Running   OK       46m51s ago    Health check successful
10.44.253.200   dashboard    Running   ?        47m7s ago     Process Process(["/sbin/dashboard"]) started with PID 1453
10.44.253.200   etcd         Failed    ?        26m45s ago    Failed to run pre stage: failed to pull image "gcr.io/etcd-development/etcd:v3.5.10": 1 error(s) occurred:
                timeout
10.44.253.200   kubelet   Failed   ?   26m51s ago   Failed to run pre stage: 2 error(s) occurred:
                timeout
                failed to pull image "ghcr.io/siderolabs/kubelet:v1.28.3": failed to resolve reference "ghcr.io/siderolabs/kubelet:v1.28.3": failed to do request: Head "http://10.44.253.162:5003/v2/siderolabs/kubelet/manifests/v1.28.3?ns=ghcr.io": dial tcp 10.44.253.162:5003: i/o timeout
10.44.253.200   machined   Running   OK   47m13s ago   Health check successful
10.44.253.200   trustd     Running   OK   46m51s ago   Health check successful
10.44.253.200   udevd      Running   OK   47m13s ago   Health check successful

Environment

smira commented 11 months ago

So what is the problem here? The logs you posted are not from the installation, and they point towards network error.

During the installation Talos only pulls the installer image.

Registry mirror configuration works during the initial install as well.

XLordalX commented 10 months ago

@smira The network error only occurs during installation. If I remove the mirror config before installation and add it back after installation, it works fine.

smira commented 10 months ago

Please provide the error during the installation.

XLordalX commented 10 months ago

@smira Actually, looks like I was wrong about being able to reach the mirror after installation. It seems that talos is unable to reach the local network at 10.44.0.0/16 at all.

This is my network configuration:

machine:
  network:
    interfaces:
      - interface: eth0
        routes:
          - network: 10.44.0.0/16
            gateway: 10.44.0.1

I am able to ping the gateway from a pod just fine, but cannot ping any other servers on the same local network while I can ping the talos node from other servers on the network. Any ideas how to debug this?

smira commented 10 months ago

The configuration makes sense to me, but I certainly don't know how it should be configured.

What is strange is that there's no address assigned to the machine, only a route.

Usually debugging involves taking packet dumps at different points to see what might be wrong.

soulwhisper commented 10 months ago

I have this bug too. my mirror is internal domains. I have tested them by 'docker pull ', yet it also stuck at "failed to pull image "gcr.io/etcd-development/etcd:v3.5.11": failed to resolve reference "gcr.io/etcd-development/etcd:v3.5.11": gcr.io/etcd-development/etcd:v3.5.11: not found"

smira commented 10 months ago

@soulwhisper your error is different and not related to the above, and that looks like misconfiguration.

soulwhisper commented 10 months ago

@smira so how could i dig deeper into what happened when bootstrap talos cluster? Theoretically, containerd image pull should try mirror first, then directly. That is where i get this error. But how can I check this failed mirror pull log?

smira commented 10 months ago

@smira so how could i dig deeper into what happened when bootstrap talos cluster? Theoretically, containerd image pull should try mirror first, then directly. That is where i get this error. But how can I check this failed mirror pull log?

first, you should probably open a separate issue or a github discussion.

second, the error is there in your message, it's not found, so the image is not in your mirror (it's in gcr.io, so the problem is with your mirror); Talos doesn't fallback to upstream registry unless you configure it to do so. you can check your mirror logs to see what is wrong. You can also look into the docs.

jakejx commented 5 months ago

@smira piggy backing off this, how does one configure Talos to fallback to the upstream registry? I could not find anything in the docs or config reference that describes this.

smira commented 5 months ago

By injecting the endpoint of the upstream registry as the (last) option.

jakejx commented 5 months ago

Thanks I gave that a shot. Seems like talos doesn't fallback if it's a non-network error.