rancher / fleet

Deploy workloads from Git to large fleets of Kubernetes clusters
https://fleet.rancher.io/
Apache License 2.0
1.5k stars 226 forks source link

Stuck in "waitApplied" while trying to deploy a service #684

Open wofr opened 2 years ago

wofr commented 2 years ago

I tried to deploy the simple app "https://github.com/rancher/fleet-examples/tree/master/simple" to one of my clusters (running on GKE). Unfortunalty it always stucks in "waitApplied", and it does not matter which of cluster I pick for the deployment, exepct for the local cluster (where the rancher-server is running) for this one the deployment went fine.

grafik

The yaml file of the deploy job looks like this

apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  creationTimestamp: "2021-12-13T09:56:59Z"
  generation: 1
  managedFields:
  - apiVersion: fleet.cattle.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        .: {}
        f:branch: {}
        f:insecureSkipTLSVerify: {}
        f:paths: {}
        f:repo: {}
        f:targets: {}
    manager: rancher
    operation: Update
    time: "2021-12-13T09:56:59Z"
  - apiVersion: fleet.cattle.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        .: {}
        f:commit: {}
        f:conditions: {}
        f:desiredReadyClusters: {}
        f:display:
          .: {}
          f:readyBundleDeployments: {}
          f:state: {}
        f:gitJobStatus: {}
        f:lastSyncedImageScanTime: {}
        f:observedGeneration: {}
        f:readyClusters: {}
        f:resourceCounts:
          .: {}
          f:desiredReady: {}
          f:missing: {}
          f:modified: {}
          f:notReady: {}
          f:orphaned: {}
          f:ready: {}
          f:unknown: {}
          f:waitApplied: {}
        f:resources: {}
        f:summary:
          .: {}
          f:desiredReady: {}
          f:nonReadyResources: {}
          f:ready: {}
          f:waitApplied: {}
    manager: fleetcontroller
    operation: Update
    time: "2021-12-13T09:57:05Z"
  name: test-app-guestbook
  namespace: fleet-default
  resourceVersion: "3654240"
  uid: a70409c1-04c6-4903-97c5-0ec940149691
spec:
  branch: master
  insecureSkipTLSVerify: false
  paths:
  - simple
  repo: https://github.com/rancher/fleet-examples.git
  targets:
  - clusterGroup: energy-generator
status:
  commit: 43d73d518157fd7cca661ac950a54e61690a0cde
  conditions:
  - lastUpdateTime: "2021-12-13T09:57:05Z"
    message: WaitApplied(1) [Bundle test-app-guestbook-simple]
    status: "False"
    type: Ready
  - lastUpdateTime: "2021-12-13T10:17:19Z"
    status: "True"
    type: Accepted
  - lastUpdateTime: "2021-12-13T09:57:00Z"
    status: "True"
    type: ImageSynced
  - lastUpdateTime: "2021-12-13T09:57:01Z"
    status: "False"
    type: Reconciling
  - lastUpdateTime: "2021-12-13T09:57:00Z"
    status: "False"
    type: Stalled
  - lastUpdateTime: "2021-12-13T10:17:19Z"
    status: "True"
    type: Synced
  desiredReadyClusters: 1
  display:
    readyBundleDeployments: 0/1
    state: WaitApplied
  gitJobStatus: Current
  lastSyncedImageScanTime: null
  observedGeneration: 1
  readyClusters: 0
  resourceCounts:
    desiredReady: 6
    missing: 0
    modified: 0
    notReady: 0
    orphaned: 0
    ready: 0
    unknown: 0
    waitApplied: 6
  resources:
  - apiVersion: apps/v1
    id: default/frontend
    kind: Deployment
    name: frontend
    namespace: default
    state: WaitApplied
    type: apps.deployment
  - apiVersion: apps/v1
    id: default/redis-master
    kind: Deployment
    name: redis-master
    namespace: default
    state: WaitApplied
    type: apps.deployment
  - apiVersion: apps/v1
    id: default/redis-slave
    kind: Deployment
    name: redis-slave
    namespace: default
    state: WaitApplied
    type: apps.deployment
  - apiVersion: v1
    id: default/frontend
    kind: Service
    name: frontend
    namespace: default
    state: WaitApplied
    type: service
  - apiVersion: v1
    id: default/redis-master
    kind: Service
    name: redis-master
    namespace: default
    state: WaitApplied
    type: service
  - apiVersion: v1
    id: default/redis-slave
    kind: Service
    name: redis-slave
    namespace: default
    state: WaitApplied
    type: service
  summary:
    desiredReady: 1
    nonReadyResources:
    - bundleState: WaitApplied
      name: test-app-guestbook-simple
    ready: 0
    waitApplied: 1

The log file of the gitjob on my local cluster looks suspicious

ime="2021-12-13T08:09:12Z" level=error msg="error syncing 'fleet-default/fleet-examples-app': handler sync-repo: DesiredSet - Replace Wait batch/v1, Kind=Job fleet-default/fleet-examples-app-8bafe for sync-repo fleet-default/fleet-examples-app, requeuing"
time="2021-12-13T08:58:54Z" level=error msg="error syncing 'fleet-default/test': handler sync-repo: DesiredSet - Replace Wait batch/v1, Kind=Job fleet-default/test-8bafe for sync-repo fleet-default/test, requeuing"
time="2021-12-13T08:59:05Z" level=error msg="error syncing 'fleet-default/test': handler sync-repo: DesiredSet - Replace Wait batch/v1, Kind=Job fleet-default/test-8bafe for sync-repo fleet-default/test, requeuing"
time="2021-12-13T09:00:10Z" level=error msg="error syncing 'fleet-default/test': handler sync-repo: DesiredSet - Replace Wait batch/v1, Kind=Job fleet-default/test-8bafe for sync-repo fleet-default/test, requeuing"
time="2021-12-13T09:11:36Z" level=error msg="error syncing 'fleet-default/test': handler sync-repo: DesiredSet - Replace Wait batch/v1, Kind=Job fleet-default/test-8bafe for sync-repo fleet-default/test, requeuing"
time="2021-12-13T09:56:07Z" level=error msg="error syncing 'fleet-default/test-deploy-simple-app': handler sync-repo: DesiredSet - Replace Wait batch/v1, Kind=Job fleet-default/test-deploy-simple-app-8bafe for sync-repo fleet-default/test-deploy-simple-app, requeuing"

I also found a completed job on my local cluster, which offers me following information

grafik

The log of the job itself looks like the following

{"level":"info","ts":1639389424.3984752,"caller":"git/git.go:157","msg":"Successfully cloned https://github.com/rancher/fleet-examples.git @ 43d73d518157fd7cca661ac950a54e61690a0cde (grafted, HEAD) in path /workspace/source"}
{"level":"info","ts":1639389424.502895,"caller":"git/git.go:198","msg":"Successfully initialized and updated submodules in path /workspace/source"}

One more screenshot from the rancher - ui grafik

So for me it seems the deplyoment does not really make it to the destination cluster, It seems I do already have problems making it available on the "local" system.

manno commented 1 year ago

This looks like the cluster registration is broken. fleet-agent tries to read the registration secret (ClusterRegistrationToken.status.secretName) and doesn't have access. In fleet versions < 0.6.0 any user was allowed to retrieve secrets from the system registration namespace, they did have random names though.

Maybe this cluster is hardened?