siderolabs / cluster-api-control-plane-provider-talos

A control plane provider for CAPI + Talos
Mozilla Public License 2.0
60 stars 20 forks source link

Bootstrap request to first cpn takes a long time #133

Open magicite opened 2 years ago

magicite commented 2 years ago

Might be the same as #109.

I'm creating a cluster on some old Dell R620s with sidero and am noticing that once the first control plane node gets to the point where it needs to receive the bootstrap request, it takes a variable amount of time to receive it. I've seen between 6 and 15 minutes. If I kill the cacppt-controller-manager pod, that seems to kickstart things.

I'm running the latest of everything.

[root@dill04 sidero]# clusterctl upgrade plan --kubeconfig-context admin@ben-sidero-demo-2
Checking cert-manager version...
Cert-Manager is already up to date

Checking new release availability...

Latest release available for the v1beta1 API Version of Cluster API (contract):

NAME                    NAMESPACE       TYPE                     CURRENT VERSION   NEXT VERSION
bootstrap-talos         cabpt-system    BootstrapProvider        v0.5.4            Already up to date
control-plane-talos     cacppt-system   ControlPlaneProvider     v0.4.6            Already up to date
cluster-api             capi-system     CoreProvider             v1.2.0            Already up to date
infrastructure-sidero   sidero-system   InfrastructureProvider   v0.5.2            Already up to date

You are already up to date!

Attached is the cacppt-controller-manager log. This log has two cluster provisions, with the first one not having the "takes a long time" issue, and the second one experiencing the issue. I think the interesting bits start at 1.658505243132774e+09 (last failed message). cacppt-controller-manager-delayed-bootstrap.txt cpn_console.txt

smira commented 2 years ago

@Unix4ever any ideas?

magicite commented 1 year ago

A few updates in case it's helpful.

Here's how I'm creating the docker management cluster:

talosctl cluster create \
  --name bootstrap \
  --kubernetes-version 1.24.3 \
  -p 69:69/udp,8081:8081/tcp,51821:51821/udp \
  --memory 4096 \
  --workers 0 \
  --nameservers 16.110.135.51,16.110.135.52 \
  --registry-mirror docker.io=http://dill04.us.cray.com:2022 \
  --registry-mirror k8s.gcr.io=http://dill04.us.cray.com:2023 \
  --registry-mirror quay.io=http://dill04.us.cray.com:2024 \
  --registry-mirror gcr.io=http://dill04.us.cray.com:2025 \
  --registry-mirror ghcr.io=http://dill04.us.cray.com:2026 \
  --registry-mirror registry.k8s.io=http://dill04.us.cray.com:2027 \
  --with-cluster-discovery=false \
  --config-patch @env.yaml \
  --config-patch-control-plane @env.yaml \
  --config-patch-worker @env.yaml \
  --endpoint $HOST_IP

with the patch file being

- op: add
  path: /machine/env
  value:
     http_proxy: xxx
     https_proxy: xxx
     no_proxy: xxx
- op: add
  path: /cluster/allowSchedulingOnMasters
  value: true
- op: add
  path: /machine/time
  value:
    servers:
    - 16.110.135.123
    - 16.229.168.10
smira commented 1 year ago

Are you using latest versions of the providers? We had some fixes since that time.

magicite commented 1 year ago

Yes - I am using the latest released versions of the providers.

[root@dill04 demo-1.2]# clusterctl --kubeconfig-context admin@bootstrap upgrade plan
Checking cert-manager version...
Cert-Manager is already up to date

Checking new release availability...

Latest release available for the v1beta1 API Version of Cluster API (contract):

NAME                    NAMESPACE       TYPE                     CURRENT VERSION   NEXT VERSION
bootstrap-talos         cabpt-system    BootstrapProvider        v0.5.5            Already up to date
control-plane-talos     cacppt-system   ControlPlaneProvider     v0.4.10           Already up to date
cluster-api             capi-system     CoreProvider             v1.2.4            Already up to date
infrastructure-sidero   sidero-system   InfrastructureProvider   v0.5.5            Already up to date

You are already up to date!