kubestellar / kubestellar

KubeStellar - a flexible solution for multi-cluster configuration management for edge, multi-cloud, and hybrid cloud
https://kubestellar.io
Apache License 2.0
312 stars 75 forks source link

bug: secret vc-vcluster lacks config-incluster #2717

Open MikeSpreitzer opened 1 week ago

MikeSpreitzer commented 1 week ago

Describe the bug

While investigating #2579, #2700, #2701, #2702, #2715, and #2716, I did some more testing. This was on the same IBM Cloud VM, and commit 45fc718da51dd5b193840584e3c2db0e271d3892 . This time, after many successful iterations, the script hung waiting for the its-with-cluteradm Job to be completed. But that Job will never complete, it failed all allowed retries. It failed because the Secret named "vc-vcluster" in the "its1-system" namespace in the KubeFlex hosting cluster does NOT have a data entry named "config-incluster" --- but that job's pods are defined to use it as their kubeconfig. In particular, they have an environment variable definition for KUBECONFIG that is the pathname where that data entry would appear (in a mounted volume based on that Secret).

I extracted the log of the KubeFlex controller-manager with the following command.

kubectl  --context kind-kubeflex logs -n kubeflex-system kubeflex-controller-manager-7db8894656-f6nsk > kfcm.log

I have attached that log, along with the full typescript.

kfcm.log

fail8.log

Output from KubeStellar-Snapshot.sh

mspreitz@mjs-dev7a:~$ bash <(curl -s https://raw.githubusercontent.com/kubestellar/kubestellar/refs/heads/main/scripts/kubestellar-snapshot.sh) -V -Y -L
KubeStellar Snapshot v0.2.0{COLOR_NONE}

Script run on 2025-01-11_15:06:05
Checking script dependencies:
✔ kubectl version v1.29.10 at /usr/local/bin/kubectl
✔ helm version v3.16.3 at /usr/sbin/helm
✔ jq version jq-1.6 at /usr/bin/jq
Using kubeconfig(s): /home/mspreitz/.kube/config
Validating contexts(s): 
✔ cluster1 
✔ cluster2 
✔ its1 *
✔ kind-kubeflex 
✔ wds1 

KubeStellar:
- Helm chart ks-core (v0.26.0-alpha.3) in namespace default in context kind-kubeflex
  - Secret=sh.helm.release.v1.ks-core.v1 in namespace default

KubeFlex:
- kubeflex-system namespace in context kind-kubeflex
- controller-manager: version=0.7.2, pod=kubeflex-controller-manager-7db8894656-f6nsk, status=running
- postgres-postgresql-0: pod=postgres-postgresql-0, status=running

Control Planes:
- its1: type=vcluster, pch=its-with-clusteradm, context=kind-kubeflex, namespace=its1-system
  - Post Create Hook: pod=its-with-clusteradm-2kcvg
its-with-clusteradm-qgqd4, ns=its1-system, status=
  - Status addon controller: pod=, ns=its1-system, version=, status=
Error from server (NotFound): namespaces "open-cluster-management" not found
  - Open-cluster-manager: not found
error: only one of -c or an inline [CONTAINER] arg is allowed

Steps To Reproduce

I used the following loop.

while time bash <(curl -s https://raw.githubusercontent.com/kubestellar/kubestellar/45fc718da51dd5b193840584e3c2db0e271d3892/scripts/create-kubestellar-demo-env.sh); do
    date
    sleep 1
done; date

Expected Behavior

No sign of failure

Want to contribute?

Additional Context

No response

MikeSpreitzer commented 1 week ago

/cc @pdettori

pdettori commented 1 week ago

@MikeSpreitzer do you have also a log for the pod running the (failed) job? Also, could this also be related to https://github.com/kubestellar/kubeflex/issues/276 ?

MikeSpreitzer commented 1 week ago

To be clear, the vc-vcluster Secret exists, and has the data element named "config"; it just lacks the data element named "config-incluster".

MikeSpreitzer commented 1 week ago

The state still exists on that VM; I have not made any changes yet.

I looked at the logs from the two failed containers before opening this issue; they complained about authorization failures, saying that the its1-system:default ServiceAccount is not authorized for various things. Which is true, but beside the point (which is that the intended credentials were not picked up, and thus the wrong identity was being used).

MikeSpreitzer commented 1 week ago
mspreitz@mjs-dev7a:~$ kubectl --context kind-kubeflex get pods -n its1-system
NAME                                                READY   STATUS      RESTARTS   AGE
coredns-68559449b6-jnmwn-x-kube-system-x-vcluster   1/1     Running     0          2d9h
its-with-clusteradm-2kcvg                           0/2     Error       0          2d9h
its-with-clusteradm-qgqd4                           0/2     Error       0          2d9h
update-cluster-info-s6ckj                           0/1     Error       0          2d9h
update-cluster-info-tgwpx                           0/1     Completed   0          2d9h
vcluster-0                                          2/2     Running     0          2d9h

Following are the logs from the two containers in the older Pod (the info above does not establish which is older, but other info does).

mspreitz@mjs-dev7a:~$ kubectl --context kind-kubeflex logs -n its1-system its-with-clusteradm-qgqd4
Defaulted container "its-with-clusteradm-clusteradm" out of: its-with-clusteradm-clusteradm, its-with-clusteradm-statusaddon
flag v has been set
flag wait has been set
I0111 10:09:01.591465       1 exec.go:41] "init options:" dry-run=false force=false output-file=""
I0111 10:09:01.591639       1 loader.go:141] Config not found: /etc/kube/config-incluster
I0111 10:09:01.592021       1 merged_client_builder.go:121] Using in-cluster configuration
I0111 10:09:01.592685       1 loader.go:141] Config not found: /etc/kube/config-incluster
I0111 10:09:01.592932       1 merged_client_builder.go:121] Using in-cluster configuration
I0111 10:09:01.593122       1 loader.go:141] Config not found: /etc/kube/config-incluster
I0111 10:09:01.593325       1 merged_client_builder.go:121] Using in-cluster configuration
I0111 10:09:01.593523       1 loader.go:141] Config not found: /etc/kube/config-incluster
Preflight check: HubApiServer check Failed with 0 warnings and 1 errors
Preflight check: cluster-info check Failed with 0 warnings and 1 errors
Error: [preflight] Some fatal errors occurred:
    [ERROR HubApiServer check]: failed to find the given Current Context in Contexts of the kubeconfig
    [ERROR cluster-info check]: configmaps "cluster-info" is forbidden: User "system:serviceaccount:its1-system:default" cannot get resource "configmaps" in API group "" in the namespace "kube-public"

E0111 10:09:01.602695       1 clusteradm.go:132] "Error:" err=<
    [preflight] Some fatal errors occurred:
        [ERROR HubApiServer check]: failed to find the given Current Context in Contexts of the kubeconfig
        [ERROR cluster-info check]: configmaps "cluster-info" is forbidden: User "system:serviceaccount:its1-system:default" cannot get resource "configmaps" in API group "" in the namespace "kube-public"
 >

mspreitz@mjs-dev7a:~$ kubectl --context kind-kubeflex logs -n its1-system its-with-clusteradm-qgqd4 -c its-with-clusteradm-statusaddon
Error: query: failed to query with labels: secrets is forbidden: User "system:serviceaccount:its1-system:default" cannot list resource "secrets" in API group "" in the namespace "open-cluster-management"
MikeSpreitzer commented 1 week ago

Looking now, I see that the data element named "vc-vcluster" was added almost a day after that Secret was created. Following is the output from kubectl --context kind-kubeflex get secret -n its1-system vc-vcluster -o yaml --show-managed-fields.

I strongly recommend that clients set the FieldManager to something meaningful when making calls on the apiserver.

apiVersion: v1
data:
  certificate-authority: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVIyZ0F3SUJBZ0lCQURBS0JnZ3Foa2pPUFFRREFqQWpNU0V3SHdZRFZRUUREQmhyTTNNdGMyVnkKZG1WeUxXTmhRREUzTXpZMU9UQXhNVGd3SGhjTk1qVXdNVEV4TVRBd09ETTRXaGNOTXpVd01UQTVNVEF3T0RNNApXakFqTVNFd0h3WURWUVFEREJock0zTXRjMlZ5ZG1WeUxXTmhRREUzTXpZMU9UQXhNVGd3V1RBVEJnY3Foa2pPClBRSUJCZ2dxaGtqT1BRTUJCd05DQUFTaVYvZHNLeUpkaWtkSUVHek5peXNuek1VQzFORGlleFRaK2EzMElRQjIKdmRneEZlZmdmek1hVGQ5WU8vYmtkajZIVEN0bHJMRm1NcHZaR0RjcGdrT3dvMEl3UURBT0JnTlZIUThCQWY4RQpCQU1DQXFRd0R3WURWUjBUQVFIL0JBVXdBd0VCL3pBZEJnTlZIUTRFRmdRVVkxMmpqL0VkcE9YMFhHTGRLVlZJCjB3MnB5cVF3Q2dZSUtvWkl6ajBFQXdJRFJ3QXdSQUlnYWpqa3hNQTlQYjFpYkIvQ3IyMlZsVExsOGVVbkxNMlQKOENMQ1lIN2RJVk1DSUFJM012ZmRIb01SY3ZGNXZOblhJQ1pPamFRNGRkcHRObE9VZlpPeTdQcGMKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
  client-certificate: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJrVENDQVRlZ0F3SUJBZ0lJZlJSdFJGR09ZWm93Q2dZSUtvWkl6ajBFQXdJd0l6RWhNQjhHQTFVRUF3d1kKYXpOekxXTnNhV1Z1ZEMxallVQXhOek0yTlRrd01URTRNQjRYRFRJMU1ERXhNVEV3TURnek9Gb1hEVEkyTURFeApNVEV3TURnek9Gb3dNREVYTUJVR0ExVUVDaE1PYzNsemRHVnRPbTFoYzNSbGNuTXhGVEFUQmdOVkJBTVRESE41CmMzUmxiVHBoWkcxcGJqQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJMRTRLTmJMZGx4WVJoT3QKcUxueGNSdzVZL0lVSDBTaE9vWEw0Q05xNWR5T2lhT1phcUhRcWVrV085TE9KRHphdjF2YlBFTVJMR1g0Uk1YcApPb01PZ0tlalNEQkdNQTRHQTFVZER3RUIvd1FFQXdJRm9EQVRCZ05WSFNVRUREQUtCZ2dyQmdFRkJRY0RBakFmCkJnTlZIU01FR0RBV2dCUnV6cUo3WHNPcldVWmRDOE9Vc1E0Tjl0UlJmekFLQmdncWhrak9QUVFEQWdOSUFEQkYKQWlCMTdpZWZrQ1JJU0NCcTJWcWh3ekVXc0tlSGNsVkFid2ZYTnd1TVBXNFJsZ0loQUpoazRmOTkyaUhFZ0Y5VApoSm43YWVITGw2aTZUa2xXcE8rQzBxZnFpWTVPCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0KLS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkekNDQVIyZ0F3SUJBZ0lCQURBS0JnZ3Foa2pPUFFRREFqQWpNU0V3SHdZRFZRUUREQmhyTTNNdFkyeHAKWlc1MExXTmhRREUzTXpZMU9UQXhNVGd3SGhjTk1qVXdNVEV4TVRBd09ETTRXaGNOTXpVd01UQTVNVEF3T0RNNApXakFqTVNFd0h3WURWUVFEREJock0zTXRZMnhwWlc1MExXTmhRREUzTXpZMU9UQXhNVGd3V1RBVEJnY3Foa2pPClBRSUJCZ2dxaGtqT1BRTUJCd05DQUFRMUUyLzRZMEFSckFyamhVOFJOUTlvS3pucVFEaGVPQVR5VVpLNnFHcFAKVVhkZWo0UlliZHBJTEdrLzNqY0dVb2tsTEpyYXNlVE1YaklkNHFpUlp3Y3NvMEl3UURBT0JnTlZIUThCQWY4RQpCQU1DQXFRd0R3WURWUjBUQVFIL0JBVXdBd0VCL3pBZEJnTlZIUTRFRmdRVWJzNmllMTdEcTFsR1hRdkRsTEVPCkRmYlVVWDh3Q2dZSUtvWkl6ajBFQXdJRFNBQXdSUUloQUlSRHdDUWJHQ1ZFdDE3aFZuN2lyL0huMGpUSGx1RVIKZVhuYTIycWhEUGN1QWlBbWdpeFB0Y1NWWVFRNHVLdFUySTJiMnlDeXJpLy80aHhTVmo4K1I5SXp3dz09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
  client-key: <snip/>
  config: <snip/>
  config-incluster: <snip/>
kind: Secret
metadata:
  creationTimestamp: "2025-01-11T10:08:49Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        .: {}
        f:certificate-authority: {}
        f:client-certificate: {}
        f:client-key: {}
        f:config: {}
      f:metadata:
        f:ownerReferences:
          .: {}
          k:{"uid":"07ef3d00-332e-4b0a-8176-1232da13bb1a"}: {}
      f:type: {}
    manager: vcluster
    operation: Update
    time: "2025-01-11T10:08:49Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        f:config-incluster: {}
    manager: Go-http-client
    operation: Update
    time: "2025-01-12T05:00:16Z"
  name: vc-vcluster
  namespace: its1-system
  ownerReferences:
  - apiVersion: v1
    controller: false
    kind: Service
    name: vcluster
    uid: 07ef3d00-332e-4b0a-8176-1232da13bb1a
  resourceVersion: "135412"
  uid: 47c851ea-fb58-40c0-8e19-5e3c95482e4a
type: Opaque
MikeSpreitzer commented 1 week ago

Following is the output from kubectl --context kind-kubeflex get pch its-with-clusteradm -o yaml.

apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
kind: PostCreateHook
metadata:
  annotations:
    meta.helm.sh/release-name: ks-core
    meta.helm.sh/release-namespace: default
  creationTimestamp: "2025-01-11T10:07:48Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
    kflex.kubestellar.io/cptype: its
  name: its-with-clusteradm
  resourceVersion: "766"
  uid: e6991989-d68e-4277-ad08-d0f9cf1fe9f9
spec:
  templates:
  - apiVersion: batch/v1
    kind: Job
    metadata:
      name: '{{.HookName}}'
    spec:
      backoffLimit: 1
      template:
        spec:
          containers:
          - args:
            - init
            - -v=5
            - --wait
            env:
            - name: KUBECONFIG
              value: /etc/kube/{{.ITSkubeconfig}}
            image: quay.io/kubestellar/clusteradm:0.9.0
            name: '{{.HookName}}-clusteradm'
            volumeMounts:
            - mountPath: /etc/kube
              name: kubeconfig
              readOnly: true
          - args:
            - upgrade
            - --install
            - status-addon
            - oci://ghcr.io/kubestellar/ocm-status-addon-chart
            - --version
            - v0.2.0-rc14
            - --namespace
            - open-cluster-management
            - --create-namespace
            - --set
            - controller.verbosity=5
            - --set
            - agent.hub_burst=10
            - --set
            - agent.hub_qps=5
            - --set
            - agent.local_burst=10
            - --set
            - agent.local_qps=5
            - --set
            - agent.log_flush_frequency=5s
            - --set
            - agent.logging_format=text
            - --set
            - agent.metrics_bind_addr=:8080
            - --set
            - agent.pprof_bind_addr=:8082
            - --set
            - agent.v=5
            - --set
            - agent.vmodule=
            env:
            - name: HELM_CONFIG_HOME
              value: /tmp
            - name: HELM_CACHE_HOME
              value: /tmp
            - name: KUBECONFIG
              value: /etc/kube/{{.ITSkubeconfig}}
            image: quay.io/kubestellar/helm:3.16.1
            name: '{{.HookName}}-statusaddon'
            volumeMounts:
            - mountPath: /etc/kube
              name: kubeconfig
              readOnly: true
          restartPolicy: Never
          volumes:
          - name: kubeconfig
            secret:
              secretName: '{{.ITSSecretName}}'
MikeSpreitzer commented 1 week ago

Investigating Helm chart values:

mspreitz@mjs-dev7a:~$ helm list -A
NAME        NAMESPACE       REVISION    UPDATED                                 STATUS      CHART                       APP VERSION   
ks-core     default         1           2025-01-11 10:07:46.688841712 +0000 UTC deployed    core-chart-0.26.0-alpha.3   0.26.0-alpha.3
postgres    kubeflex-system 1           2025-01-11 10:08:00.672202675 +0000 UTC deployed    postgresql-13.1.5           16.0.0        
vcluster    its1-system     1           2025-01-11 10:08:29.984697297 +0000 UTC deployed    vcluster-0.16.4             0.16.4        

mspreitz@mjs-dev7a:~$ helm get values ks-core
USER-SUPPLIED VALUES:
ITSes:
- name: its1
WDSes:
- name: wds1
- name: wds2
  type: host
verbosity:
  default: 5

mspreitz@mjs-dev7a:~$ helm get values -n its1-system vcluster
USER-SUPPLIED VALUES:
syncer:
  extraArgs:
  - --tls-san=its1.localtest.me
  - --out-kube-config-server=https://its1.localtest.me:9443
  - --tls-san=kubeflex-control-plane
vcluster:
  image: rancher/k3s:v1.27.2-k3s1
pdettori commented 1 week ago

based on the supplied logs and debug info, it looks like the issue is related to this code: https://github.com/kubestellar/kubeflex/blob/2476b2ec34826b390d0980840c153b3ebcc585ba/pkg/reconcilers/vcluster/reconciler.go#L127-L131

It looks like in the scenario you showed it ran "almost a day after that Secret was created". Perhaps the check should be changed to check just if the secret exists rather than wait for the condition of available. What do you think ?

MikeSpreitzer commented 1 week ago

If the existence of the Secret implies that it has the needed "config" data element, then yes, the call to r.ReconcileKubeconfigSecret need wait no longer.

But still, it very disturbing that it took about 19 hours for the ControlPlane to become "available". Surely that is a problem too?

MikeSpreitzer commented 1 week ago

Here are the logs from the two containers of the Pod named "vcluster-0" (the only one from its StatefulSet) in namespace "its1-system" in the KubeFlex hosting cluster.

2717-syncer.log 2717-vcluster.log

MikeSpreitzer commented 1 week ago

Following is the output from kubectl --context kind-kubeflex get -n its1-system pod vcluster-0 -o yaml --show-managed-fields.

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2025-01-11T10:08:32Z"
  generateName: vcluster-
  labels:
    app: vcluster
    apps.kubernetes.io/pod-index: "0"
    controller-revision-hash: vcluster-565cbcbcdd
    release: vcluster
    statefulset.kubernetes.io/pod-name: vcluster-0
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName: {}
        f:labels:
          .: {}
          f:app: {}
          f:apps.kubernetes.io/pod-index: {}
          f:controller-revision-hash: {}
          f:release: {}
          f:statefulset.kubernetes.io/pod-name: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"72ed06ea-bf00-41d2-9e8e-26aab3ab6203"}: {}
      f:spec:
        f:containers:
          k:{"name":"syncer"}:
            .: {}
            f:args: {}
            f:env:
              .: {}
              k:{"name":"CONFIG"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"POD_IP"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:fieldRef: {}
              k:{"name":"VCLUSTER_NODE_NAME"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:fieldRef: {}
              k:{"name":"VCLUSTER_TELEMETRY_CONFIG"}:
                .: {}
                f:name: {}
                f:value: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:livenessProbe:
              .: {}
              f:failureThreshold: {}
              f:httpGet:
                .: {}
                f:path: {}
                f:port: {}
                f:scheme: {}
              f:initialDelaySeconds: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:name: {}
            f:readinessProbe:
              .: {}
              f:failureThreshold: {}
              f:httpGet:
                .: {}
                f:path: {}
                f:port: {}
                f:scheme: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:resources:
              .: {}
              f:limits:
                .: {}
                f:cpu: {}
                f:memory: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
            f:securityContext:
              .: {}
              f:allowPrivilegeEscalation: {}
              f:runAsGroup: {}
              f:runAsUser: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
            f:volumeMounts:
              .: {}
              k:{"mountPath":"/.cache/helm"}:
                .: {}
                f:mountPath: {}
                f:name: {}
              k:{"mountPath":"/data"}:
                .: {}
                f:mountPath: {}
                f:name: {}
              k:{"mountPath":"/etc/coredns/custom"}:
                .: {}
                f:mountPath: {}
                f:name: {}
                f:readOnly: {}
              k:{"mountPath":"/manifests/coredns"}:
                .: {}
                f:mountPath: {}
                f:name: {}
                f:readOnly: {}
              k:{"mountPath":"/tmp"}:
                .: {}
                f:mountPath: {}
                f:name: {}
          k:{"name":"vcluster"}:
            .: {}
            f:args: {}
            f:command: {}
            f:env:
              .: {}
              k:{"name":"SERVICE_CIDR"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:configMapKeyRef: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:name: {}
            f:resources:
              .: {}
              f:limits:
                .: {}
                f:memory: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
            f:securityContext:
              .: {}
              f:allowPrivilegeEscalation: {}
              f:runAsGroup: {}
              f:runAsUser: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
            f:volumeMounts:
              .: {}
              k:{"mountPath":"/data"}:
                .: {}
                f:mountPath: {}
                f:name: {}
              k:{"mountPath":"/etc/rancher"}:
                .: {}
                f:mountPath: {}
                f:name: {}
        f:dnsPolicy: {}
        f:enableServiceLinks: {}
        f:hostname: {}
        f:restartPolicy: {}
        f:schedulerName: {}
        f:securityContext: {}
        f:serviceAccount: {}
        f:serviceAccountName: {}
        f:subdomain: {}
        f:terminationGracePeriodSeconds: {}
        f:volumes:
          .: {}
          k:{"name":"config"}:
            .: {}
            f:emptyDir: {}
            f:name: {}
          k:{"name":"coredns"}:
            .: {}
            f:configMap:
              .: {}
              f:defaultMode: {}
              f:name: {}
            f:name: {}
          k:{"name":"custom-config-volume"}:
            .: {}
            f:configMap:
              .: {}
              f:defaultMode: {}
              f:name: {}
              f:optional: {}
            f:name: {}
          k:{"name":"data"}:
            .: {}
            f:name: {}
            f:persistentVolumeClaim:
              .: {}
              f:claimName: {}
          k:{"name":"helm-cache"}:
            .: {}
            f:emptyDir: {}
            f:name: {}
          k:{"name":"tmp"}:
            .: {}
            f:emptyDir: {}
            f:name: {}
    manager: kube-controller-manager
    operation: Update
    time: "2025-01-11T10:08:32Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          k:{"type":"ContainersReady"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"Initialized"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"PodReadyToStartContainers"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"Ready"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
        f:containerStatuses: {}
        f:hostIP: {}
        f:hostIPs: {}
        f:phase: {}
        f:podIP: {}
        f:podIPs:
          .: {}
          k:{"ip":"10.244.0.15"}:
            .: {}
            f:ip: {}
        f:startTime: {}
    manager: kubelet
    operation: Update
    subresource: status
    time: "2025-01-11T10:08:49Z"
  name: vcluster-0
  namespace: its1-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: vcluster
    uid: 72ed06ea-bf00-41d2-9e8e-26aab3ab6203
  resourceVersion: "1219"
  uid: 58cecae2-d288-48d8-9a80-ee8ecfb1be54
spec:
  containers:
  - args:
    - -c
    - /bin/k3s server --write-kubeconfig=/data/k3s-config/kube-config.yaml --data-dir=/data
      --disable=traefik,servicelb,metrics-server,local-storage,coredns --disable-network-policy
      --disable-agent --disable-cloud-controller --flannel-backend=none --kube-apiserver-arg=bind-address=127.0.0.1
      --disable-scheduler --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle,-ttl
      --kube-apiserver-arg=endpoint-reconciler-type=none --service-cidr=$(SERVICE_CIDR)
      && true
    command:
    - /bin/sh
    env:
    - name: SERVICE_CIDR
      valueFrom:
        configMapKeyRef:
          key: cidr
          name: vc-cidr-vcluster
    image: rancher/k3s:v1.27.2-k3s1
    imagePullPolicy: IfNotPresent
    name: vcluster
    resources:
      limits:
        memory: 2Gi
      requests:
        cpu: 200m
        memory: 256Mi
    securityContext:
      allowPrivilegeEscalation: false
      runAsGroup: 0
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/rancher
      name: config
    - mountPath: /data
      name: data
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xp8ds
      readOnly: true
  - args:
    - --name=vcluster
    - --kube-config=/data/k3s-config/kube-config.yaml
    - --service-account=vc-workload-vcluster
    - --kube-config-context-name=my-vcluster
    - --leader-elect=false
    - --sync=-ingressclasses
    - --tls-san=its1.localtest.me
    - --out-kube-config-server=https://its1.localtest.me:9443
    - --tls-san=kubeflex-control-plane
    env:
    - name: POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: VCLUSTER_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: CONFIG
      value: '---'
    - name: VCLUSTER_TELEMETRY_CONFIG
      value: '{"disabled":false,"instanceCreator":"helm","instanceCreatorUID":""}'
    image: ghcr.io/loft-sh/vcluster:0.16.4
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 60
      httpGet:
        path: /healthz
        port: 8443
        scheme: HTTPS
      initialDelaySeconds: 60
      periodSeconds: 2
      successThreshold: 1
      timeoutSeconds: 1
    name: syncer
    readinessProbe:
      failureThreshold: 60
      httpGet:
        path: /readyz
        port: 8443
        scheme: HTTPS
      periodSeconds: 2
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: "1"
        memory: 512Mi
      requests:
        cpu: 20m
        memory: 64Mi
    securityContext:
      allowPrivilegeEscalation: false
      runAsGroup: 0
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /.cache/helm
      name: helm-cache
    - mountPath: /tmp
      name: tmp
    - mountPath: /manifests/coredns
      name: coredns
      readOnly: true
    - mountPath: /etc/coredns/custom
      name: custom-config-volume
      readOnly: true
    - mountPath: /data
      name: data
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xp8ds
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: vcluster-0
  nodeName: kubeflex-control-plane
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: vc-vcluster
  serviceAccountName: vc-vcluster
  subdomain: vcluster-headless
  terminationGracePeriodSeconds: 10
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: data-vcluster-0
  - emptyDir: {}
    name: helm-cache
  - emptyDir: {}
    name: tmp
  - emptyDir: {}
    name: config
  - configMap:
      defaultMode: 420
      name: vcluster-coredns
    name: coredns
  - configMap:
      defaultMode: 420
      name: coredns-custom
      optional: true
    name: custom-config-volume
  - name: kube-api-access-xp8ds
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-01-11T10:08:38Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2025-01-11T10:08:36Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-01-11T10:08:49Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-01-11T10:08:49Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-01-11T10:08:36Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://6a75ae588524904afd13576b80c05b85b9d9d7971871de14c1692e0416e0c602
    image: ghcr.io/loft-sh/vcluster:0.16.4
    imageID: docker.io/library/import-2025-01-11@sha256:0c78512e6ad01541738353962c224ed291f0b5a1bc73a3848aa9a0d754576676
    lastState: {}
    name: syncer
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2025-01-11T10:08:37Z"
  - containerID: containerd://02ed8641b158dc1d6cad375e0315f58bfd7e700207823ef4fbfc8d03d6727ba3
    image: docker.io/rancher/k3s:v1.27.2-k3s1
    imageID: docker.io/library/import-2025-01-11@sha256:5b19941558d264f5e244bdb6fd74fde0f6992629bc0e95d501c9f6cad74cb7d9
    lastState: {}
    name: vcluster
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2025-01-11T10:08:38Z"
  hostIP: 172.18.0.4
  hostIPs:
  - ip: 172.18.0.4
  phase: Running
  podIP: 10.244.0.15
  podIPs:
  - ip: 10.244.0.15
  qosClass: Burstable
  startTime: "2025-01-11T10:08:36Z"
MikeSpreitzer commented 1 week ago

Here is a more recent extraction of kubectl --context kind-kubeflex logs -n kubeflex-system kubeflex-controller-manager-7db8894656-f6nsk > kfcm2.log. Note that it extends the earlier one with two log entries at 05:00:16 Jan 12 UTC, the time at which the config-incluster data item was written to the Secret.

kfcm2.log

MikeSpreitzer commented 1 week ago

Following is the output from kubectl --context kind-kubeflex get ControlPlane its1 -o yaml --show-managed-fields.

apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
kind: ControlPlane
metadata:
  annotations:
    meta.helm.sh/release-name: ks-core
    meta.helm.sh/release-namespace: default
  creationTimestamp: "2025-01-11T10:07:48Z"
  finalizers:
  - kflex.kubestellar.org/finalizer
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
    kflex.kubestellar.io/cptype: its
  managedFields:
  - apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:meta.helm.sh/release-name: {}
          f:meta.helm.sh/release-namespace: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/managed-by: {}
      f:spec:
        .: {}
        f:backend: {}
        f:postCreateHook: {}
        f:postCreateHookVars:
          .: {}
          f:ITSSecretName: {}
          f:ITSkubeconfig: {}
        f:type: {}
    manager: helm
    operation: Update
    time: "2025-01-11T10:07:48Z"
  - apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .: {}
          v:"kflex.kubestellar.org/finalizer": {}
        f:labels:
          f:kflex.kubestellar.io/cptype: {}
    manager: Go-http-client
    operation: Update
    time: "2025-01-11T10:08:49Z"
  - apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        .: {}
        f:conditions: {}
        f:observedGeneration: {}
        f:postCreateHooks:
          .: {}
          f:its-with-clusteradm: {}
        f:secretRef:
          .: {}
          f:inClusterKey: {}
          f:key: {}
          f:name: {}
          f:namespace: {}
    manager: Go-http-client
    operation: Update
    subresource: status
    time: "2025-01-14T20:16:21Z"
  name: its1
  resourceVersion: "579804"
  uid: c3cada0a-8ba8-4df4-bd50-57c40c87bb3d
spec:
  backend: shared
  postCreateHook: its-with-clusteradm
  postCreateHookVars:
    ITSSecretName: vc-vcluster
    ITSkubeconfig: config-incluster
  type: vcluster
status:
  conditions:
  - lastTransitionTime: "2025-01-14T20:16:21Z"
    lastUpdateTime: "2025-01-14T20:16:21Z"
    message: ""
    reason: Available
    status: "True"
    type: Ready
  - lastTransitionTime: "2025-01-14T20:16:21Z"
    lastUpdateTime: "2025-01-14T20:16:21Z"
    message: ""
    reason: ReconcileSuccess
    status: "True"
    type: Synced
  observedGeneration: 0
  postCreateHooks:
    its-with-clusteradm: true
  secretRef:
    inClusterKey: config-incluster
    key: config
    name: vc-vcluster
    namespace: its1-system
MikeSpreitzer commented 1 week ago

https://github.com/kubestellar/kubeflex/blob/2476b2ec34826b390d0980840c153b3ebcc585ba/internal/controller/controlplane_controller.go#L165-L172 shows the types of objects that can trigger that controller to Reconcile.

MikeSpreitzer commented 1 week ago

kubectl --context kind-kubeflex get cm -n kubeflex-system kubeflex-config -o yaml --show-managed-fields shows that the ConfigMap was created at 2025-01-11T10:07:47Z and not updated in any later second. Following are its data items.

data:
  domain: localtest.me
  externalPort: "9443"
  hostContainer: kubeflex-control-plane
  isOpenShift: "false"
MikeSpreitzer commented 1 week ago

kubectl --context kind-kubeflex get ns its1-system -o yaml --show-managed-fields shows the Namespace was created at 2025-01-11T10:08:24Z and not updated in any later second.

pdettori commented 1 week ago

https://github.com/kubestellar/kubeflex/blob/2476b2ec34826b390d0980840c153b3ebcc585ba/internal/controller/controlplane_controller.go#L165-L172 shows the types of objects that can trigger that controller to Reconcile.

changes of state in the stateful set should have triggered the reconciliation, but perhaps that event was missed (?)

MikeSpreitzer commented 1 week ago

Kubernetes informers are eventually consistent with the apiservers. The only way an informer can stay stale for 19 hours is for there to be communication problems and/or scheduling starvation for 19 hours. There was no CPU overload inside the VM in question.

MikeSpreitzer commented 1 week ago

BTW, https://github.com/kubestellar/kubeflex/blob/2476b2ec34826b390d0980840c153b3ebcc585ba/pkg/reconcilers/shared/reconciler.go#L70 explains why every reconcile updates the Synced Condition on that ControlPlane; tenancyv1alpha1.ConditionReconcileSuccess() constructs a Condition with LastTransitionTime and LastUpdateTime == now.

MikeSpreitzer commented 1 week ago

Likewise, https://github.com/kubestellar/kubeflex/blob/2476b2ec34826b390d0980840c153b3ebcc585ba/internal/controller/controlplane_controller.go#L137-L141 shows why every reconcile updates the Ready Condition.

MikeSpreitzer commented 1 week ago

For comparison, I looked at the KubeFlex controller-manager log from a normal run of the demo environment create script in #2719; following are the log entries around the finishing of the setup of its1.

2025-01-11T10:08:38Z    INFO    Got ControlPlane event! {"controller": "controlplane", "controllerGroup": "tenancy.kflex.kubestellar.org", "controllerKind": "ControlPlane", "ControlPlane": {"name":"its1"}, "namespace": "", "name": "its1", "reconcileID": "9ed80b37-18dd-4f13-b227-89f2b960191c"}
2025-01-11T10:08:49Z    INFO    Got ControlPlane event! {"controller": "controlplane", "controllerGroup": "tenancy.kflex.kubestellar.org", "controllerKind": "ControlPlane", "ControlPlane": {"name":"its1"}, "namespace": "", "name": "its1", "reconcileID": "684b3da2-868f-4300-92de-88cdfb90ff9b"}
2025-01-11T10:08:49Z    INFO    Running ReconcileUpdatePostCreateHook   {"controller": "controlplane", "controllerGroup": "tenancy.kflex.kubestellar.org", "controllerKind": "ControlPlane", "ControlPlane": {"name":"its1"}, "namespace": "", "name": "its1", "reconcileID": "684b3da2-868f-4300-92de-88cdfb90ff9b", "post-create-hook": "its-with-clusteradm"}
2025-01-11T10:08:49Z    INFO    Applying    {"controller": "controlplane", "controllerGroup": "tenancy.kflex.kubestellar.org", "controllerKind": "ControlPlane", "ControlPlane": {"name":"its1"}, "namespace": "", "name": "its1", "reconcileID": "684b3da2-868f-4300-92de-88cdfb90ff9b", "object": "[] job.batch/its-with-clusteradm"}
2025-01-11T10:08:49Z    INFO    Got ControlPlane event! {"controller": "controlplane", "controllerGroup": "tenancy.kflex.kubestellar.org", "controllerKind": "ControlPlane", "ControlPlane": {"name":"its1"}, "namespace": "", "name": "its1", "reconcileID": "34ebdbb2-5817-4445-9994-3d639bc4f8c1"}
2025-01-11T10:08:57Z    INFO    Got ControlPlane event! {"controller": "controlplane", "controllerGroup": "tenancy.kflex.kubestellar.org", "controllerKind": "ControlPlane", "ControlPlane": {"name":"wds1"}, "namespace": "", "name": "wds1", "reconcileID": "7d2b91a4-3423-4d36-b050-686e6c2a0fef"}

Note: no evidence of what changed to trigger the update at 2025-01-11T10:08:49Z.

MikeSpreitzer commented 1 week ago

BTW, https://github.com/kubestellar/kubeflex/blob/2476b2ec34826b390d0980840c153b3ebcc585ba/pkg/reconcilers/shared/postcreate_hook.go#L136 always shows the empty string as the namespace. Following is an example log message where the actual namespace is not empty but the log message omits it.

2025-01-11T10:09:08Z    INFO    Applying    {"controller": "controlplane", "controllerGroup": "tenancy.kflex.kubestellar.org", "controllerKind": "ControlPlane", "ControlPlane": {"name":"wds1"}, "namespace": "", "name": "wds1", "reconcileID": "d44caa9d-1022-40bb-a091-57027617ec08", "object": "[] deployment.apps/kubestellar-controller-manager"}

This is because obj is constructed without a specified namespace; the namespace is injected later, at https://github.com/kubestellar/kubeflex/blob/2476b2ec34826b390d0980840c153b3ebcc585ba/pkg/reconcilers/shared/postcreate_hook.go#L142 .

MikeSpreitzer commented 6 days ago

So I updated #2719, to make it pass --v=3 to the kube-apiserver in the KubeFlex hosting cluster, hoping to get more evidence the next time this problem happens. I considered also turning on audit logging in that kube-apiserver, but did not see a way to provide the necessary audit config file without polluting the user's filesystem.

I resumed testing, and the problem happened again, on the same IBM Cloud VM.

mspreitz@mjs-dev7a:~$ kubectl --context kind-kubeflex get pods -A
NAMESPACE            NAME                                                READY   STATUS             RESTARTS       AGE
default              ks-core-8t27z                                       0/1     Completed          0              7h2m
ingress-nginx        ingress-nginx-admission-create-9s8jk                0/1     Completed          0              7h3m
ingress-nginx        ingress-nginx-admission-patch-q7gvk                 0/1     Completed          0              7h3m
ingress-nginx        ingress-nginx-controller-778b6cb6c7-8l2nw           1/1     Running            0              7h3m
its1-system          coredns-68559449b6-8b27n-x-kube-system-x-vcluster   1/1     Running            0              7h1m
its1-system          its-with-clusteradm-7dld6                           0/2     Error              0              7h1m
its1-system          its-with-clusteradm-zr5d2                           0/2     Error              0              7h
its1-system          update-cluster-info-28mb8                           0/1     Completed          0              7h1m
its1-system          update-cluster-info-k6xbk                           0/1     Error              0              7h1m
its1-system          vcluster-0                                          2/2     Running            0              7h1m
kube-system          coredns-76f75df574-j56rb                            1/1     Running            0              7h3m
kube-system          coredns-76f75df574-kdh28                            1/1     Running            0              7h3m
kube-system          etcd-kubeflex-control-plane                         1/1     Running            0              7h4m
kube-system          kindnet-w6jbs                                       1/1     Running            0              7h3m
kube-system          kube-apiserver-kubeflex-control-plane               1/1     Running            0              7h4m
kube-system          kube-controller-manager-kubeflex-control-plane      1/1     Running            0              7h4m
kube-system          kube-proxy-ktkxt                                    1/1     Running            0              7h3m
kube-system          kube-scheduler-kubeflex-control-plane               1/1     Running            0              7h4m
kubeflex-system      kubeflex-controller-manager-7db8894656-z9jfs        2/2     Running            0              7h2m
kubeflex-system      postgres-postgresql-0                               1/1     Running            0              7h2m
local-path-storage   local-path-provisioner-7577fdbbfb-92wnm             1/1     Running            0              7h3m
... the rest don't matter
MikeSpreitzer commented 6 days ago
mspreitz@mjs-dev7a:~$ kubectl  --context kind-kubeflex get ControlPlane its1 -o yaml --show-managed-fields
apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
kind: ControlPlane
metadata:
  annotations:
    meta.helm.sh/release-name: ks-core
    meta.helm.sh/release-namespace: default
  creationTimestamp: "2025-01-18T22:28:23Z"
  finalizers:
  - kflex.kubestellar.org/finalizer
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
    kflex.kubestellar.io/cptype: its
  managedFields:
  - apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:meta.helm.sh/release-name: {}
          f:meta.helm.sh/release-namespace: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/managed-by: {}
      f:spec:
        .: {}
        f:backend: {}
        f:postCreateHook: {}
        f:postCreateHookVars:
          .: {}
          f:ITSSecretName: {}
          f:ITSkubeconfig: {}
        f:type: {}
    manager: helm
    operation: Update
    time: "2025-01-18T22:28:23Z"
  - apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .: {}
          v:"kflex.kubestellar.org/finalizer": {}
        f:labels:
          f:kflex.kubestellar.io/cptype: {}
    manager: Go-http-client
    operation: Update
    time: "2025-01-18T22:29:32Z"
  - apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        .: {}
        f:conditions: {}
        f:observedGeneration: {}
        f:postCreateHooks:
          .: {}
          f:its-with-clusteradm: {}
        f:secretRef:
          .: {}
          f:inClusterKey: {}
          f:key: {}
          f:name: {}
          f:namespace: {}
    manager: Go-http-client
    operation: Update
    subresource: status
    time: "2025-01-18T22:29:32Z"
  name: its1
  resourceVersion: "1217"
  uid: 682c6ced-9b3c-4bf2-bd2e-05b9629d1c44
spec:
  backend: shared
  postCreateHook: its-with-clusteradm
  postCreateHookVars:
    ITSSecretName: vc-vcluster
    ITSkubeconfig: config-incluster
  type: vcluster
status:
  conditions:
  - lastTransitionTime: "2025-01-18T22:29:32Z"
    lastUpdateTime: "2025-01-18T22:29:32Z"
    message: ""
    reason: Available
    status: "True"
    type: Ready
  - lastTransitionTime: "2025-01-18T22:29:32Z"
    lastUpdateTime: "2025-01-18T22:29:32Z"
    message: Secret "vc-vcluster" not found
    reason: ReconcileError
    status: "False"
    type: Synced
  observedGeneration: 0
  postCreateHooks:
    its-with-clusteradm: true
  secretRef:
    inClusterKey: config-incluster
    key: config
    name: vc-vcluster
    namespace: its1-system
MikeSpreitzer commented 6 days ago

No config-incluster yet. Note also that this Secret exists, and was created hours ago, despite the fact that the its1 ControlPlane's Conditions say that the last reconcile failed due to this Secret not existing!

mspreitz@mjs-dev7a:~$ kubectl  --context kind-kubeflex get secret -n its1-system vc-vcluster -o yaml --show-managed-fields
apiVersion: v1
data:
  certificate-authority: <snip/>
  client-certificate: <snip/>
  client-key: <snip/>
  config: <snip/>
kind: Secret
metadata:
  creationTimestamp: "2025-01-18T22:29:32Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        .: {}
        f:certificate-authority: {}
        f:client-certificate: {}
        f:client-key: {}
        f:config: {}
      f:metadata:
        f:ownerReferences:
          .: {}
          k:{"uid":"6c34a6c2-a7da-4482-afdf-1e48630f400e"}: {}
      f:type: {}
    manager: vcluster
    operation: Update
    time: "2025-01-18T22:29:32Z"
  name: vc-vcluster
  namespace: its1-system
  ownerReferences:
  - apiVersion: v1
    controller: false
    kind: Service
    name: vcluster
    uid: 6c34a6c2-a7da-4482-afdf-1e48630f400e
  resourceVersion: "1220"
  uid: 5c794702-a7d0-47d7-b37c-ce2258bbe7d8
type: Opaque
MikeSpreitzer commented 6 days ago
mspreitz@mjs-dev7a:~$ kubectl  --context kind-kubeflex get jobs -A
NAMESPACE       NAME                             COMPLETIONS   DURATION   AGE
default         ks-core                          1/1           8s         7h9m
ingress-nginx   ingress-nginx-admission-create   1/1           15s        7h11m
ingress-nginx   ingress-nginx-admission-patch    1/1           16s        7h11m
its1-system     its-with-clusteradm              0/1           7h8m       7h8m
its1-system     update-cluster-info              1/1           31s        7h8m
MikeSpreitzer commented 6 days ago
mspreitz@mjs-dev7a:~$ bash <(curl -s https://raw.githubusercontent.com/kubestellar/kubestellar/refs/heads/main/scripts/kubestellar-snapshot.sh) -V -Y -L
KubeStellar Snapshot v0.2.0{COLOR_NONE}

Script run on 2025-01-19_05:39:46
Checking script dependencies:
✔ kubectl version v1.29.10 at /usr/local/bin/kubectl
✔ helm version v3.16.3 at /usr/sbin/helm
✔ jq version jq-1.6 at /usr/bin/jq
Using kubeconfig(s): /home/mspreitz/.kube/config
Validating contexts(s): 
✔ cluster1 
✔ cluster2 
✔ its1 *
✔ kind-kubeflex 
✔ wds1 

KubeStellar:
- Helm chart ks-core (v0.26.0-alpha.4) in namespace default in context kind-kubeflex
  - Secret=sh.helm.release.v1.ks-core.v1 in namespace default

KubeFlex:
- kubeflex-system namespace in context kind-kubeflex
- controller-manager: version=0.7.2, pod=kubeflex-controller-manager-7db8894656-z9jfs, status=running
- postgres-postgresql-0: pod=postgres-postgresql-0, status=running

Control Planes:
- its1: type=vcluster, pch=its-with-clusteradm, context=kind-kubeflex, namespace=its1-system
  - Post Create Hook: pod=its-with-clusteradm-7dld6
its-with-clusteradm-zr5d2, ns=its1-system, status=
  - Status addon controller: pod=, ns=its1-system, version=, status=
Error from server (NotFound): namespaces "open-cluster-management" not found
  - Open-cluster-manager: not found
error: expected 'logs [-f] [-p] (POD | TYPE/NAME) [-c CONTAINER]'.
POD or TYPE/NAME is a required argument for the logs command
See 'kubectl logs -h' for help and examples
MikeSpreitzer commented 6 days ago
mspreitz@mjs-dev7a:~$ kubectl --context its1 get ns
NAME              STATUS   AGE
default           Active   7h12m
kube-system       Active   7h12m
kube-public       Active   7h12m
kube-node-lease   Active   7h12m
MikeSpreitzer commented 5 days ago

I used kubectl --context kind-kubeflex logs -n kube-system kube-apiserver-kubeflex-control-plane > /tmp/kas1.log to capture the KubeFlex hosting cluster apiserver log, and attach it below. Unfortunately, it only covers a few minutes. So this will not have the evidence I seek unless I get really lucky.

kas1.log

MikeSpreitzer commented 5 days ago

Aha! https://github.com/kubestellar/kubeflex/blob/v0.7.2/internal/controller/controlplane_controller.go#L170 makes the controller sensitive to Secret objects that are owned by that controller. As shown in https://github.com/kubestellar/kubestellar/issues/2717#issuecomment-2600605292 , the vc-vcluster Secret object is not owned by the KubeFlex controller.

See the comment on the Owns method at https://github.com/kubernetes-sigs/controller-runtime/blob/v0.15.0/pkg/builder/controller.go#L106-L113

MikeSpreitzer commented 5 days ago

Adding an insignificant label, just to prod the controller, gets the ControlPlane into a good state.

mspreitz@mjs-dev7a:~$ date; kubectl --context kind-kubeflex get controlplanes
Sun Jan 19 06:15:05 UTC 2025
NAME   SYNCED   READY   TYPE       AGE
its1   False    True    vcluster   7h46m
wds1   True     True    k8s        7h46m
wds2   True     True    host       7h46m

mspreitz@mjs-dev7a:~$ kubectl --context kind-kubeflex label controlplane its1 kick=me; date
controlplane.tenancy.kflex.kubestellar.org/its1 labeled
Sun Jan 19 06:15:32 UTC 2025

mspreitz@mjs-dev7a:~$ date; kubectl --context kind-kubeflex get controlplanes
Sun Jan 19 06:15:39 UTC 2025
NAME   SYNCED   READY   TYPE       AGE
its1   True     True    vcluster   7h47m
wds1   True     True    k8s        7h47m
wds2   True     True    host       7h47m

And that got the config-incluster data member added to the vc-vcluster Secret.

pdettori commented 4 days ago

thank you @MikeSpreitzer, great catch! So, would you recommend using the builder.MatchEveryOwner in the Owns method for secrets to fix this?

pdettori commented 4 days ago

Created PR https://github.com/kubestellar/kubeflex/pull/309 for that.

MikeSpreitzer commented 4 days ago

@pdettori: that seems a bit excessive. How about adding the ControlPlane as an owner of the Secret? Or does vcluster have a higher-level thing that reflects the creation of that Secret?

@francostellari: There is another problem revealed here too. The ITS initialization Job depends on the KubeFlex controller augmenting the vc-vcluster Secret but does not wait on that.