Open MikeSpreitzer opened 1 week ago
/cc @pdettori
@MikeSpreitzer do you have also a log for the pod running the (failed) job? Also, could this also be related to https://github.com/kubestellar/kubeflex/issues/276 ?
To be clear, the vc-vcluster Secret exists, and has the data element named "config"; it just lacks the data element named "config-incluster".
The state still exists on that VM; I have not made any changes yet.
I looked at the logs from the two failed containers before opening this issue; they complained about authorization failures, saying that the its1-system:default ServiceAccount is not authorized for various things. Which is true, but beside the point (which is that the intended credentials were not picked up, and thus the wrong identity was being used).
mspreitz@mjs-dev7a:~$ kubectl --context kind-kubeflex get pods -n its1-system
NAME READY STATUS RESTARTS AGE
coredns-68559449b6-jnmwn-x-kube-system-x-vcluster 1/1 Running 0 2d9h
its-with-clusteradm-2kcvg 0/2 Error 0 2d9h
its-with-clusteradm-qgqd4 0/2 Error 0 2d9h
update-cluster-info-s6ckj 0/1 Error 0 2d9h
update-cluster-info-tgwpx 0/1 Completed 0 2d9h
vcluster-0 2/2 Running 0 2d9h
Following are the logs from the two containers in the older Pod (the info above does not establish which is older, but other info does).
mspreitz@mjs-dev7a:~$ kubectl --context kind-kubeflex logs -n its1-system its-with-clusteradm-qgqd4
Defaulted container "its-with-clusteradm-clusteradm" out of: its-with-clusteradm-clusteradm, its-with-clusteradm-statusaddon
flag v has been set
flag wait has been set
I0111 10:09:01.591465 1 exec.go:41] "init options:" dry-run=false force=false output-file=""
I0111 10:09:01.591639 1 loader.go:141] Config not found: /etc/kube/config-incluster
I0111 10:09:01.592021 1 merged_client_builder.go:121] Using in-cluster configuration
I0111 10:09:01.592685 1 loader.go:141] Config not found: /etc/kube/config-incluster
I0111 10:09:01.592932 1 merged_client_builder.go:121] Using in-cluster configuration
I0111 10:09:01.593122 1 loader.go:141] Config not found: /etc/kube/config-incluster
I0111 10:09:01.593325 1 merged_client_builder.go:121] Using in-cluster configuration
I0111 10:09:01.593523 1 loader.go:141] Config not found: /etc/kube/config-incluster
Preflight check: HubApiServer check Failed with 0 warnings and 1 errors
Preflight check: cluster-info check Failed with 0 warnings and 1 errors
Error: [preflight] Some fatal errors occurred:
[ERROR HubApiServer check]: failed to find the given Current Context in Contexts of the kubeconfig
[ERROR cluster-info check]: configmaps "cluster-info" is forbidden: User "system:serviceaccount:its1-system:default" cannot get resource "configmaps" in API group "" in the namespace "kube-public"
E0111 10:09:01.602695 1 clusteradm.go:132] "Error:" err=<
[preflight] Some fatal errors occurred:
[ERROR HubApiServer check]: failed to find the given Current Context in Contexts of the kubeconfig
[ERROR cluster-info check]: configmaps "cluster-info" is forbidden: User "system:serviceaccount:its1-system:default" cannot get resource "configmaps" in API group "" in the namespace "kube-public"
>
mspreitz@mjs-dev7a:~$ kubectl --context kind-kubeflex logs -n its1-system its-with-clusteradm-qgqd4 -c its-with-clusteradm-statusaddon
Error: query: failed to query with labels: secrets is forbidden: User "system:serviceaccount:its1-system:default" cannot list resource "secrets" in API group "" in the namespace "open-cluster-management"
Looking now, I see that the data element named "vc-vcluster" was added almost a day after that Secret was created. Following is the output from kubectl --context kind-kubeflex get secret -n its1-system vc-vcluster -o yaml --show-managed-fields
.
I strongly recommend that clients set the FieldManager
to something meaningful when making calls on the apiserver.
apiVersion: v1
data:
certificate-authority: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVIyZ0F3SUJBZ0lCQURBS0JnZ3Foa2pPUFFRREFqQWpNU0V3SHdZRFZRUUREQmhyTTNNdGMyVnkKZG1WeUxXTmhRREUzTXpZMU9UQXhNVGd3SGhjTk1qVXdNVEV4TVRBd09ETTRXaGNOTXpVd01UQTVNVEF3T0RNNApXakFqTVNFd0h3WURWUVFEREJock0zTXRjMlZ5ZG1WeUxXTmhRREUzTXpZMU9UQXhNVGd3V1RBVEJnY3Foa2pPClBRSUJCZ2dxaGtqT1BRTUJCd05DQUFTaVYvZHNLeUpkaWtkSUVHek5peXNuek1VQzFORGlleFRaK2EzMElRQjIKdmRneEZlZmdmek1hVGQ5WU8vYmtkajZIVEN0bHJMRm1NcHZaR0RjcGdrT3dvMEl3UURBT0JnTlZIUThCQWY4RQpCQU1DQXFRd0R3WURWUjBUQVFIL0JBVXdBd0VCL3pBZEJnTlZIUTRFRmdRVVkxMmpqL0VkcE9YMFhHTGRLVlZJCjB3MnB5cVF3Q2dZSUtvWkl6ajBFQXdJRFJ3QXdSQUlnYWpqa3hNQTlQYjFpYkIvQ3IyMlZsVExsOGVVbkxNMlQKOENMQ1lIN2RJVk1DSUFJM012ZmRIb01SY3ZGNXZOblhJQ1pPamFRNGRkcHRObE9VZlpPeTdQcGMKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
client-certificate: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJrVENDQVRlZ0F3SUJBZ0lJZlJSdFJGR09ZWm93Q2dZSUtvWkl6ajBFQXdJd0l6RWhNQjhHQTFVRUF3d1kKYXpOekxXTnNhV1Z1ZEMxallVQXhOek0yTlRrd01URTRNQjRYRFRJMU1ERXhNVEV3TURnek9Gb1hEVEkyTURFeApNVEV3TURnek9Gb3dNREVYTUJVR0ExVUVDaE1PYzNsemRHVnRPbTFoYzNSbGNuTXhGVEFUQmdOVkJBTVRESE41CmMzUmxiVHBoWkcxcGJqQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJMRTRLTmJMZGx4WVJoT3QKcUxueGNSdzVZL0lVSDBTaE9vWEw0Q05xNWR5T2lhT1phcUhRcWVrV085TE9KRHphdjF2YlBFTVJMR1g0Uk1YcApPb01PZ0tlalNEQkdNQTRHQTFVZER3RUIvd1FFQXdJRm9EQVRCZ05WSFNVRUREQUtCZ2dyQmdFRkJRY0RBakFmCkJnTlZIU01FR0RBV2dCUnV6cUo3WHNPcldVWmRDOE9Vc1E0Tjl0UlJmekFLQmdncWhrak9QUVFEQWdOSUFEQkYKQWlCMTdpZWZrQ1JJU0NCcTJWcWh3ekVXc0tlSGNsVkFid2ZYTnd1TVBXNFJsZ0loQUpoazRmOTkyaUhFZ0Y5VApoSm43YWVITGw2aTZUa2xXcE8rQzBxZnFpWTVPCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0KLS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkekNDQVIyZ0F3SUJBZ0lCQURBS0JnZ3Foa2pPUFFRREFqQWpNU0V3SHdZRFZRUUREQmhyTTNNdFkyeHAKWlc1MExXTmhRREUzTXpZMU9UQXhNVGd3SGhjTk1qVXdNVEV4TVRBd09ETTRXaGNOTXpVd01UQTVNVEF3T0RNNApXakFqTVNFd0h3WURWUVFEREJock0zTXRZMnhwWlc1MExXTmhRREUzTXpZMU9UQXhNVGd3V1RBVEJnY3Foa2pPClBRSUJCZ2dxaGtqT1BRTUJCd05DQUFRMUUyLzRZMEFSckFyamhVOFJOUTlvS3pucVFEaGVPQVR5VVpLNnFHcFAKVVhkZWo0UlliZHBJTEdrLzNqY0dVb2tsTEpyYXNlVE1YaklkNHFpUlp3Y3NvMEl3UURBT0JnTlZIUThCQWY4RQpCQU1DQXFRd0R3WURWUjBUQVFIL0JBVXdBd0VCL3pBZEJnTlZIUTRFRmdRVWJzNmllMTdEcTFsR1hRdkRsTEVPCkRmYlVVWDh3Q2dZSUtvWkl6ajBFQXdJRFNBQXdSUUloQUlSRHdDUWJHQ1ZFdDE3aFZuN2lyL0huMGpUSGx1RVIKZVhuYTIycWhEUGN1QWlBbWdpeFB0Y1NWWVFRNHVLdFUySTJiMnlDeXJpLy80aHhTVmo4K1I5SXp3dz09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
client-key: <snip/>
config: <snip/>
config-incluster: <snip/>
kind: Secret
metadata:
creationTimestamp: "2025-01-11T10:08:49Z"
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:data:
.: {}
f:certificate-authority: {}
f:client-certificate: {}
f:client-key: {}
f:config: {}
f:metadata:
f:ownerReferences:
.: {}
k:{"uid":"07ef3d00-332e-4b0a-8176-1232da13bb1a"}: {}
f:type: {}
manager: vcluster
operation: Update
time: "2025-01-11T10:08:49Z"
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:data:
f:config-incluster: {}
manager: Go-http-client
operation: Update
time: "2025-01-12T05:00:16Z"
name: vc-vcluster
namespace: its1-system
ownerReferences:
- apiVersion: v1
controller: false
kind: Service
name: vcluster
uid: 07ef3d00-332e-4b0a-8176-1232da13bb1a
resourceVersion: "135412"
uid: 47c851ea-fb58-40c0-8e19-5e3c95482e4a
type: Opaque
Following is the output from kubectl --context kind-kubeflex get pch its-with-clusteradm -o yaml
.
apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
kind: PostCreateHook
metadata:
annotations:
meta.helm.sh/release-name: ks-core
meta.helm.sh/release-namespace: default
creationTimestamp: "2025-01-11T10:07:48Z"
generation: 1
labels:
app.kubernetes.io/managed-by: Helm
kflex.kubestellar.io/cptype: its
name: its-with-clusteradm
resourceVersion: "766"
uid: e6991989-d68e-4277-ad08-d0f9cf1fe9f9
spec:
templates:
- apiVersion: batch/v1
kind: Job
metadata:
name: '{{.HookName}}'
spec:
backoffLimit: 1
template:
spec:
containers:
- args:
- init
- -v=5
- --wait
env:
- name: KUBECONFIG
value: /etc/kube/{{.ITSkubeconfig}}
image: quay.io/kubestellar/clusteradm:0.9.0
name: '{{.HookName}}-clusteradm'
volumeMounts:
- mountPath: /etc/kube
name: kubeconfig
readOnly: true
- args:
- upgrade
- --install
- status-addon
- oci://ghcr.io/kubestellar/ocm-status-addon-chart
- --version
- v0.2.0-rc14
- --namespace
- open-cluster-management
- --create-namespace
- --set
- controller.verbosity=5
- --set
- agent.hub_burst=10
- --set
- agent.hub_qps=5
- --set
- agent.local_burst=10
- --set
- agent.local_qps=5
- --set
- agent.log_flush_frequency=5s
- --set
- agent.logging_format=text
- --set
- agent.metrics_bind_addr=:8080
- --set
- agent.pprof_bind_addr=:8082
- --set
- agent.v=5
- --set
- agent.vmodule=
env:
- name: HELM_CONFIG_HOME
value: /tmp
- name: HELM_CACHE_HOME
value: /tmp
- name: KUBECONFIG
value: /etc/kube/{{.ITSkubeconfig}}
image: quay.io/kubestellar/helm:3.16.1
name: '{{.HookName}}-statusaddon'
volumeMounts:
- mountPath: /etc/kube
name: kubeconfig
readOnly: true
restartPolicy: Never
volumes:
- name: kubeconfig
secret:
secretName: '{{.ITSSecretName}}'
Investigating Helm chart values:
mspreitz@mjs-dev7a:~$ helm list -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
ks-core default 1 2025-01-11 10:07:46.688841712 +0000 UTC deployed core-chart-0.26.0-alpha.3 0.26.0-alpha.3
postgres kubeflex-system 1 2025-01-11 10:08:00.672202675 +0000 UTC deployed postgresql-13.1.5 16.0.0
vcluster its1-system 1 2025-01-11 10:08:29.984697297 +0000 UTC deployed vcluster-0.16.4 0.16.4
mspreitz@mjs-dev7a:~$ helm get values ks-core
USER-SUPPLIED VALUES:
ITSes:
- name: its1
WDSes:
- name: wds1
- name: wds2
type: host
verbosity:
default: 5
mspreitz@mjs-dev7a:~$ helm get values -n its1-system vcluster
USER-SUPPLIED VALUES:
syncer:
extraArgs:
- --tls-san=its1.localtest.me
- --out-kube-config-server=https://its1.localtest.me:9443
- --tls-san=kubeflex-control-plane
vcluster:
image: rancher/k3s:v1.27.2-k3s1
based on the supplied logs and debug info, it looks like the issue is related to this code: https://github.com/kubestellar/kubeflex/blob/2476b2ec34826b390d0980840c153b3ebcc585ba/pkg/reconcilers/vcluster/reconciler.go#L127-L131
It looks like in the scenario you showed it ran "almost a day after that Secret was created". Perhaps the check should be changed to check just if the secret exists rather than wait for the condition of available. What do you think ?
If the existence of the Secret implies that it has the needed "config" data element, then yes, the call to r.ReconcileKubeconfigSecret
need wait no longer.
But still, it very disturbing that it took about 19 hours for the ControlPlane to become "available". Surely that is a problem too?
Here are the logs from the two containers of the Pod named "vcluster-0" (the only one from its StatefulSet) in namespace "its1-system" in the KubeFlex hosting cluster.
Following is the output from kubectl --context kind-kubeflex get -n its1-system pod vcluster-0 -o yaml --show-managed-fields
.
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2025-01-11T10:08:32Z"
generateName: vcluster-
labels:
app: vcluster
apps.kubernetes.io/pod-index: "0"
controller-revision-hash: vcluster-565cbcbcdd
release: vcluster
statefulset.kubernetes.io/pod-name: vcluster-0
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:generateName: {}
f:labels:
.: {}
f:app: {}
f:apps.kubernetes.io/pod-index: {}
f:controller-revision-hash: {}
f:release: {}
f:statefulset.kubernetes.io/pod-name: {}
f:ownerReferences:
.: {}
k:{"uid":"72ed06ea-bf00-41d2-9e8e-26aab3ab6203"}: {}
f:spec:
f:containers:
k:{"name":"syncer"}:
.: {}
f:args: {}
f:env:
.: {}
k:{"name":"CONFIG"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"POD_IP"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:fieldRef: {}
k:{"name":"VCLUSTER_NODE_NAME"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:fieldRef: {}
k:{"name":"VCLUSTER_TELEMETRY_CONFIG"}:
.: {}
f:name: {}
f:value: {}
f:image: {}
f:imagePullPolicy: {}
f:livenessProbe:
.: {}
f:failureThreshold: {}
f:httpGet:
.: {}
f:path: {}
f:port: {}
f:scheme: {}
f:initialDelaySeconds: {}
f:periodSeconds: {}
f:successThreshold: {}
f:timeoutSeconds: {}
f:name: {}
f:readinessProbe:
.: {}
f:failureThreshold: {}
f:httpGet:
.: {}
f:path: {}
f:port: {}
f:scheme: {}
f:periodSeconds: {}
f:successThreshold: {}
f:timeoutSeconds: {}
f:resources:
.: {}
f:limits:
.: {}
f:cpu: {}
f:memory: {}
f:requests:
.: {}
f:cpu: {}
f:memory: {}
f:securityContext:
.: {}
f:allowPrivilegeEscalation: {}
f:runAsGroup: {}
f:runAsUser: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
f:volumeMounts:
.: {}
k:{"mountPath":"/.cache/helm"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/data"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/etc/coredns/custom"}:
.: {}
f:mountPath: {}
f:name: {}
f:readOnly: {}
k:{"mountPath":"/manifests/coredns"}:
.: {}
f:mountPath: {}
f:name: {}
f:readOnly: {}
k:{"mountPath":"/tmp"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"name":"vcluster"}:
.: {}
f:args: {}
f:command: {}
f:env:
.: {}
k:{"name":"SERVICE_CIDR"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:configMapKeyRef: {}
f:image: {}
f:imagePullPolicy: {}
f:name: {}
f:resources:
.: {}
f:limits:
.: {}
f:memory: {}
f:requests:
.: {}
f:cpu: {}
f:memory: {}
f:securityContext:
.: {}
f:allowPrivilegeEscalation: {}
f:runAsGroup: {}
f:runAsUser: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
f:volumeMounts:
.: {}
k:{"mountPath":"/data"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/etc/rancher"}:
.: {}
f:mountPath: {}
f:name: {}
f:dnsPolicy: {}
f:enableServiceLinks: {}
f:hostname: {}
f:restartPolicy: {}
f:schedulerName: {}
f:securityContext: {}
f:serviceAccount: {}
f:serviceAccountName: {}
f:subdomain: {}
f:terminationGracePeriodSeconds: {}
f:volumes:
.: {}
k:{"name":"config"}:
.: {}
f:emptyDir: {}
f:name: {}
k:{"name":"coredns"}:
.: {}
f:configMap:
.: {}
f:defaultMode: {}
f:name: {}
f:name: {}
k:{"name":"custom-config-volume"}:
.: {}
f:configMap:
.: {}
f:defaultMode: {}
f:name: {}
f:optional: {}
f:name: {}
k:{"name":"data"}:
.: {}
f:name: {}
f:persistentVolumeClaim:
.: {}
f:claimName: {}
k:{"name":"helm-cache"}:
.: {}
f:emptyDir: {}
f:name: {}
k:{"name":"tmp"}:
.: {}
f:emptyDir: {}
f:name: {}
manager: kube-controller-manager
operation: Update
time: "2025-01-11T10:08:32Z"
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:status:
f:conditions:
k:{"type":"ContainersReady"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
k:{"type":"Initialized"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
k:{"type":"PodReadyToStartContainers"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
k:{"type":"Ready"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
f:containerStatuses: {}
f:hostIP: {}
f:hostIPs: {}
f:phase: {}
f:podIP: {}
f:podIPs:
.: {}
k:{"ip":"10.244.0.15"}:
.: {}
f:ip: {}
f:startTime: {}
manager: kubelet
operation: Update
subresource: status
time: "2025-01-11T10:08:49Z"
name: vcluster-0
namespace: its1-system
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: StatefulSet
name: vcluster
uid: 72ed06ea-bf00-41d2-9e8e-26aab3ab6203
resourceVersion: "1219"
uid: 58cecae2-d288-48d8-9a80-ee8ecfb1be54
spec:
containers:
- args:
- -c
- /bin/k3s server --write-kubeconfig=/data/k3s-config/kube-config.yaml --data-dir=/data
--disable=traefik,servicelb,metrics-server,local-storage,coredns --disable-network-policy
--disable-agent --disable-cloud-controller --flannel-backend=none --kube-apiserver-arg=bind-address=127.0.0.1
--disable-scheduler --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle,-ttl
--kube-apiserver-arg=endpoint-reconciler-type=none --service-cidr=$(SERVICE_CIDR)
&& true
command:
- /bin/sh
env:
- name: SERVICE_CIDR
valueFrom:
configMapKeyRef:
key: cidr
name: vc-cidr-vcluster
image: rancher/k3s:v1.27.2-k3s1
imagePullPolicy: IfNotPresent
name: vcluster
resources:
limits:
memory: 2Gi
requests:
cpu: 200m
memory: 256Mi
securityContext:
allowPrivilegeEscalation: false
runAsGroup: 0
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/rancher
name: config
- mountPath: /data
name: data
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-xp8ds
readOnly: true
- args:
- --name=vcluster
- --kube-config=/data/k3s-config/kube-config.yaml
- --service-account=vc-workload-vcluster
- --kube-config-context-name=my-vcluster
- --leader-elect=false
- --sync=-ingressclasses
- --tls-san=its1.localtest.me
- --out-kube-config-server=https://its1.localtest.me:9443
- --tls-san=kubeflex-control-plane
env:
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: VCLUSTER_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CONFIG
value: '---'
- name: VCLUSTER_TELEMETRY_CONFIG
value: '{"disabled":false,"instanceCreator":"helm","instanceCreatorUID":""}'
image: ghcr.io/loft-sh/vcluster:0.16.4
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 60
httpGet:
path: /healthz
port: 8443
scheme: HTTPS
initialDelaySeconds: 60
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 1
name: syncer
readinessProbe:
failureThreshold: 60
httpGet:
path: /readyz
port: 8443
scheme: HTTPS
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "1"
memory: 512Mi
requests:
cpu: 20m
memory: 64Mi
securityContext:
allowPrivilegeEscalation: false
runAsGroup: 0
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /.cache/helm
name: helm-cache
- mountPath: /tmp
name: tmp
- mountPath: /manifests/coredns
name: coredns
readOnly: true
- mountPath: /etc/coredns/custom
name: custom-config-volume
readOnly: true
- mountPath: /data
name: data
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-xp8ds
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: vcluster-0
nodeName: kubeflex-control-plane
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: vc-vcluster
serviceAccountName: vc-vcluster
subdomain: vcluster-headless
terminationGracePeriodSeconds: 10
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: data
persistentVolumeClaim:
claimName: data-vcluster-0
- emptyDir: {}
name: helm-cache
- emptyDir: {}
name: tmp
- emptyDir: {}
name: config
- configMap:
defaultMode: 420
name: vcluster-coredns
name: coredns
- configMap:
defaultMode: 420
name: coredns-custom
optional: true
name: custom-config-volume
- name: kube-api-access-xp8ds
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2025-01-11T10:08:38Z"
status: "True"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2025-01-11T10:08:36Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2025-01-11T10:08:49Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2025-01-11T10:08:49Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2025-01-11T10:08:36Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://6a75ae588524904afd13576b80c05b85b9d9d7971871de14c1692e0416e0c602
image: ghcr.io/loft-sh/vcluster:0.16.4
imageID: docker.io/library/import-2025-01-11@sha256:0c78512e6ad01541738353962c224ed291f0b5a1bc73a3848aa9a0d754576676
lastState: {}
name: syncer
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2025-01-11T10:08:37Z"
- containerID: containerd://02ed8641b158dc1d6cad375e0315f58bfd7e700207823ef4fbfc8d03d6727ba3
image: docker.io/rancher/k3s:v1.27.2-k3s1
imageID: docker.io/library/import-2025-01-11@sha256:5b19941558d264f5e244bdb6fd74fde0f6992629bc0e95d501c9f6cad74cb7d9
lastState: {}
name: vcluster
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2025-01-11T10:08:38Z"
hostIP: 172.18.0.4
hostIPs:
- ip: 172.18.0.4
phase: Running
podIP: 10.244.0.15
podIPs:
- ip: 10.244.0.15
qosClass: Burstable
startTime: "2025-01-11T10:08:36Z"
Here is a more recent extraction of kubectl --context kind-kubeflex logs -n kubeflex-system kubeflex-controller-manager-7db8894656-f6nsk > kfcm2.log
. Note that it extends the earlier one with two log entries at 05:00:16 Jan 12 UTC, the time at which the config-incluster data item was written to the Secret.
Following is the output from kubectl --context kind-kubeflex get ControlPlane its1 -o yaml --show-managed-fields
.
apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
kind: ControlPlane
metadata:
annotations:
meta.helm.sh/release-name: ks-core
meta.helm.sh/release-namespace: default
creationTimestamp: "2025-01-11T10:07:48Z"
finalizers:
- kflex.kubestellar.org/finalizer
generation: 1
labels:
app.kubernetes.io/managed-by: Helm
kflex.kubestellar.io/cptype: its
managedFields:
- apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:meta.helm.sh/release-name: {}
f:meta.helm.sh/release-namespace: {}
f:labels:
.: {}
f:app.kubernetes.io/managed-by: {}
f:spec:
.: {}
f:backend: {}
f:postCreateHook: {}
f:postCreateHookVars:
.: {}
f:ITSSecretName: {}
f:ITSkubeconfig: {}
f:type: {}
manager: helm
operation: Update
time: "2025-01-11T10:07:48Z"
- apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.: {}
v:"kflex.kubestellar.org/finalizer": {}
f:labels:
f:kflex.kubestellar.io/cptype: {}
manager: Go-http-client
operation: Update
time: "2025-01-11T10:08:49Z"
- apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:status:
.: {}
f:conditions: {}
f:observedGeneration: {}
f:postCreateHooks:
.: {}
f:its-with-clusteradm: {}
f:secretRef:
.: {}
f:inClusterKey: {}
f:key: {}
f:name: {}
f:namespace: {}
manager: Go-http-client
operation: Update
subresource: status
time: "2025-01-14T20:16:21Z"
name: its1
resourceVersion: "579804"
uid: c3cada0a-8ba8-4df4-bd50-57c40c87bb3d
spec:
backend: shared
postCreateHook: its-with-clusteradm
postCreateHookVars:
ITSSecretName: vc-vcluster
ITSkubeconfig: config-incluster
type: vcluster
status:
conditions:
- lastTransitionTime: "2025-01-14T20:16:21Z"
lastUpdateTime: "2025-01-14T20:16:21Z"
message: ""
reason: Available
status: "True"
type: Ready
- lastTransitionTime: "2025-01-14T20:16:21Z"
lastUpdateTime: "2025-01-14T20:16:21Z"
message: ""
reason: ReconcileSuccess
status: "True"
type: Synced
observedGeneration: 0
postCreateHooks:
its-with-clusteradm: true
secretRef:
inClusterKey: config-incluster
key: config
name: vc-vcluster
namespace: its1-system
https://github.com/kubestellar/kubeflex/blob/2476b2ec34826b390d0980840c153b3ebcc585ba/internal/controller/controlplane_controller.go#L165-L172 shows the types of objects that can trigger that controller to Reconcile.
kubectl --context kind-kubeflex get cm -n kubeflex-system kubeflex-config -o yaml --show-managed-fields
shows that the ConfigMap was created at 2025-01-11T10:07:47Z and not updated in any later second. Following are its data items.
data:
domain: localtest.me
externalPort: "9443"
hostContainer: kubeflex-control-plane
isOpenShift: "false"
kubectl --context kind-kubeflex get ns its1-system -o yaml --show-managed-fields
shows the Namespace was created at 2025-01-11T10:08:24Z and not updated in any later second.
https://github.com/kubestellar/kubeflex/blob/2476b2ec34826b390d0980840c153b3ebcc585ba/internal/controller/controlplane_controller.go#L165-L172 shows the types of objects that can trigger that controller to Reconcile.
changes of state in the stateful set should have triggered the reconciliation, but perhaps that event was missed (?)
Kubernetes informers are eventually consistent with the apiservers. The only way an informer can stay stale for 19 hours is for there to be communication problems and/or scheduling starvation for 19 hours. There was no CPU overload inside the VM in question.
BTW, https://github.com/kubestellar/kubeflex/blob/2476b2ec34826b390d0980840c153b3ebcc585ba/pkg/reconcilers/shared/reconciler.go#L70 explains why every reconcile updates the Synced
Condition on that ControlPlane; tenancyv1alpha1.ConditionReconcileSuccess()
constructs a Condition with LastTransitionTime and LastUpdateTime == now.
Likewise, https://github.com/kubestellar/kubeflex/blob/2476b2ec34826b390d0980840c153b3ebcc585ba/internal/controller/controlplane_controller.go#L137-L141 shows why every reconcile updates the Ready
Condition.
For comparison, I looked at the KubeFlex controller-manager log from a normal run of the demo environment create script in #2719; following are the log entries around the finishing of the setup of its1.
2025-01-11T10:08:38Z INFO Got ControlPlane event! {"controller": "controlplane", "controllerGroup": "tenancy.kflex.kubestellar.org", "controllerKind": "ControlPlane", "ControlPlane": {"name":"its1"}, "namespace": "", "name": "its1", "reconcileID": "9ed80b37-18dd-4f13-b227-89f2b960191c"}
2025-01-11T10:08:49Z INFO Got ControlPlane event! {"controller": "controlplane", "controllerGroup": "tenancy.kflex.kubestellar.org", "controllerKind": "ControlPlane", "ControlPlane": {"name":"its1"}, "namespace": "", "name": "its1", "reconcileID": "684b3da2-868f-4300-92de-88cdfb90ff9b"}
2025-01-11T10:08:49Z INFO Running ReconcileUpdatePostCreateHook {"controller": "controlplane", "controllerGroup": "tenancy.kflex.kubestellar.org", "controllerKind": "ControlPlane", "ControlPlane": {"name":"its1"}, "namespace": "", "name": "its1", "reconcileID": "684b3da2-868f-4300-92de-88cdfb90ff9b", "post-create-hook": "its-with-clusteradm"}
2025-01-11T10:08:49Z INFO Applying {"controller": "controlplane", "controllerGroup": "tenancy.kflex.kubestellar.org", "controllerKind": "ControlPlane", "ControlPlane": {"name":"its1"}, "namespace": "", "name": "its1", "reconcileID": "684b3da2-868f-4300-92de-88cdfb90ff9b", "object": "[] job.batch/its-with-clusteradm"}
2025-01-11T10:08:49Z INFO Got ControlPlane event! {"controller": "controlplane", "controllerGroup": "tenancy.kflex.kubestellar.org", "controllerKind": "ControlPlane", "ControlPlane": {"name":"its1"}, "namespace": "", "name": "its1", "reconcileID": "34ebdbb2-5817-4445-9994-3d639bc4f8c1"}
2025-01-11T10:08:57Z INFO Got ControlPlane event! {"controller": "controlplane", "controllerGroup": "tenancy.kflex.kubestellar.org", "controllerKind": "ControlPlane", "ControlPlane": {"name":"wds1"}, "namespace": "", "name": "wds1", "reconcileID": "7d2b91a4-3423-4d36-b050-686e6c2a0fef"}
Note: no evidence of what changed to trigger the update at 2025-01-11T10:08:49Z.
BTW, https://github.com/kubestellar/kubeflex/blob/2476b2ec34826b390d0980840c153b3ebcc585ba/pkg/reconcilers/shared/postcreate_hook.go#L136 always shows the empty string as the namespace. Following is an example log message where the actual namespace is not empty but the log message omits it.
2025-01-11T10:09:08Z INFO Applying {"controller": "controlplane", "controllerGroup": "tenancy.kflex.kubestellar.org", "controllerKind": "ControlPlane", "ControlPlane": {"name":"wds1"}, "namespace": "", "name": "wds1", "reconcileID": "d44caa9d-1022-40bb-a091-57027617ec08", "object": "[] deployment.apps/kubestellar-controller-manager"}
This is because obj
is constructed without a specified namespace; the namespace is injected later, at https://github.com/kubestellar/kubeflex/blob/2476b2ec34826b390d0980840c153b3ebcc585ba/pkg/reconcilers/shared/postcreate_hook.go#L142 .
So I updated #2719, to make it pass --v=3
to the kube-apiserver in the KubeFlex hosting cluster, hoping to get more evidence the next time this problem happens. I considered also turning on audit logging in that kube-apiserver, but did not see a way to provide the necessary audit config file without polluting the user's filesystem.
I resumed testing, and the problem happened again, on the same IBM Cloud VM.
mspreitz@mjs-dev7a:~$ kubectl --context kind-kubeflex get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default ks-core-8t27z 0/1 Completed 0 7h2m
ingress-nginx ingress-nginx-admission-create-9s8jk 0/1 Completed 0 7h3m
ingress-nginx ingress-nginx-admission-patch-q7gvk 0/1 Completed 0 7h3m
ingress-nginx ingress-nginx-controller-778b6cb6c7-8l2nw 1/1 Running 0 7h3m
its1-system coredns-68559449b6-8b27n-x-kube-system-x-vcluster 1/1 Running 0 7h1m
its1-system its-with-clusteradm-7dld6 0/2 Error 0 7h1m
its1-system its-with-clusteradm-zr5d2 0/2 Error 0 7h
its1-system update-cluster-info-28mb8 0/1 Completed 0 7h1m
its1-system update-cluster-info-k6xbk 0/1 Error 0 7h1m
its1-system vcluster-0 2/2 Running 0 7h1m
kube-system coredns-76f75df574-j56rb 1/1 Running 0 7h3m
kube-system coredns-76f75df574-kdh28 1/1 Running 0 7h3m
kube-system etcd-kubeflex-control-plane 1/1 Running 0 7h4m
kube-system kindnet-w6jbs 1/1 Running 0 7h3m
kube-system kube-apiserver-kubeflex-control-plane 1/1 Running 0 7h4m
kube-system kube-controller-manager-kubeflex-control-plane 1/1 Running 0 7h4m
kube-system kube-proxy-ktkxt 1/1 Running 0 7h3m
kube-system kube-scheduler-kubeflex-control-plane 1/1 Running 0 7h4m
kubeflex-system kubeflex-controller-manager-7db8894656-z9jfs 2/2 Running 0 7h2m
kubeflex-system postgres-postgresql-0 1/1 Running 0 7h2m
local-path-storage local-path-provisioner-7577fdbbfb-92wnm 1/1 Running 0 7h3m
... the rest don't matter
mspreitz@mjs-dev7a:~$ kubectl --context kind-kubeflex get ControlPlane its1 -o yaml --show-managed-fields
apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
kind: ControlPlane
metadata:
annotations:
meta.helm.sh/release-name: ks-core
meta.helm.sh/release-namespace: default
creationTimestamp: "2025-01-18T22:28:23Z"
finalizers:
- kflex.kubestellar.org/finalizer
generation: 1
labels:
app.kubernetes.io/managed-by: Helm
kflex.kubestellar.io/cptype: its
managedFields:
- apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:meta.helm.sh/release-name: {}
f:meta.helm.sh/release-namespace: {}
f:labels:
.: {}
f:app.kubernetes.io/managed-by: {}
f:spec:
.: {}
f:backend: {}
f:postCreateHook: {}
f:postCreateHookVars:
.: {}
f:ITSSecretName: {}
f:ITSkubeconfig: {}
f:type: {}
manager: helm
operation: Update
time: "2025-01-18T22:28:23Z"
- apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.: {}
v:"kflex.kubestellar.org/finalizer": {}
f:labels:
f:kflex.kubestellar.io/cptype: {}
manager: Go-http-client
operation: Update
time: "2025-01-18T22:29:32Z"
- apiVersion: tenancy.kflex.kubestellar.org/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:status:
.: {}
f:conditions: {}
f:observedGeneration: {}
f:postCreateHooks:
.: {}
f:its-with-clusteradm: {}
f:secretRef:
.: {}
f:inClusterKey: {}
f:key: {}
f:name: {}
f:namespace: {}
manager: Go-http-client
operation: Update
subresource: status
time: "2025-01-18T22:29:32Z"
name: its1
resourceVersion: "1217"
uid: 682c6ced-9b3c-4bf2-bd2e-05b9629d1c44
spec:
backend: shared
postCreateHook: its-with-clusteradm
postCreateHookVars:
ITSSecretName: vc-vcluster
ITSkubeconfig: config-incluster
type: vcluster
status:
conditions:
- lastTransitionTime: "2025-01-18T22:29:32Z"
lastUpdateTime: "2025-01-18T22:29:32Z"
message: ""
reason: Available
status: "True"
type: Ready
- lastTransitionTime: "2025-01-18T22:29:32Z"
lastUpdateTime: "2025-01-18T22:29:32Z"
message: Secret "vc-vcluster" not found
reason: ReconcileError
status: "False"
type: Synced
observedGeneration: 0
postCreateHooks:
its-with-clusteradm: true
secretRef:
inClusterKey: config-incluster
key: config
name: vc-vcluster
namespace: its1-system
No config-incluster yet. Note also that this Secret exists, and was created hours ago, despite the fact that the its1 ControlPlane's Conditions say that the last reconcile failed due to this Secret not existing!
mspreitz@mjs-dev7a:~$ kubectl --context kind-kubeflex get secret -n its1-system vc-vcluster -o yaml --show-managed-fields
apiVersion: v1
data:
certificate-authority: <snip/>
client-certificate: <snip/>
client-key: <snip/>
config: <snip/>
kind: Secret
metadata:
creationTimestamp: "2025-01-18T22:29:32Z"
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:data:
.: {}
f:certificate-authority: {}
f:client-certificate: {}
f:client-key: {}
f:config: {}
f:metadata:
f:ownerReferences:
.: {}
k:{"uid":"6c34a6c2-a7da-4482-afdf-1e48630f400e"}: {}
f:type: {}
manager: vcluster
operation: Update
time: "2025-01-18T22:29:32Z"
name: vc-vcluster
namespace: its1-system
ownerReferences:
- apiVersion: v1
controller: false
kind: Service
name: vcluster
uid: 6c34a6c2-a7da-4482-afdf-1e48630f400e
resourceVersion: "1220"
uid: 5c794702-a7d0-47d7-b37c-ce2258bbe7d8
type: Opaque
mspreitz@mjs-dev7a:~$ kubectl --context kind-kubeflex get jobs -A
NAMESPACE NAME COMPLETIONS DURATION AGE
default ks-core 1/1 8s 7h9m
ingress-nginx ingress-nginx-admission-create 1/1 15s 7h11m
ingress-nginx ingress-nginx-admission-patch 1/1 16s 7h11m
its1-system its-with-clusteradm 0/1 7h8m 7h8m
its1-system update-cluster-info 1/1 31s 7h8m
mspreitz@mjs-dev7a:~$ bash <(curl -s https://raw.githubusercontent.com/kubestellar/kubestellar/refs/heads/main/scripts/kubestellar-snapshot.sh) -V -Y -L
KubeStellar Snapshot v0.2.0{COLOR_NONE}
Script run on 2025-01-19_05:39:46
Checking script dependencies:
✔ kubectl version v1.29.10 at /usr/local/bin/kubectl
✔ helm version v3.16.3 at /usr/sbin/helm
✔ jq version jq-1.6 at /usr/bin/jq
Using kubeconfig(s): /home/mspreitz/.kube/config
Validating contexts(s):
✔ cluster1
✔ cluster2
✔ its1 *
✔ kind-kubeflex
✔ wds1
KubeStellar:
- Helm chart ks-core (v0.26.0-alpha.4) in namespace default in context kind-kubeflex
- Secret=sh.helm.release.v1.ks-core.v1 in namespace default
KubeFlex:
- kubeflex-system namespace in context kind-kubeflex
- controller-manager: version=0.7.2, pod=kubeflex-controller-manager-7db8894656-z9jfs, status=running
- postgres-postgresql-0: pod=postgres-postgresql-0, status=running
Control Planes:
- its1: type=vcluster, pch=its-with-clusteradm, context=kind-kubeflex, namespace=its1-system
- Post Create Hook: pod=its-with-clusteradm-7dld6
its-with-clusteradm-zr5d2, ns=its1-system, status=
- Status addon controller: pod=, ns=its1-system, version=, status=
Error from server (NotFound): namespaces "open-cluster-management" not found
- Open-cluster-manager: not found
error: expected 'logs [-f] [-p] (POD | TYPE/NAME) [-c CONTAINER]'.
POD or TYPE/NAME is a required argument for the logs command
See 'kubectl logs -h' for help and examples
mspreitz@mjs-dev7a:~$ kubectl --context its1 get ns
NAME STATUS AGE
default Active 7h12m
kube-system Active 7h12m
kube-public Active 7h12m
kube-node-lease Active 7h12m
I used kubectl --context kind-kubeflex logs -n kube-system kube-apiserver-kubeflex-control-plane > /tmp/kas1.log
to capture the KubeFlex hosting cluster apiserver log, and attach it below. Unfortunately, it only covers a few minutes. So this will not have the evidence I seek unless I get really lucky.
Aha!
https://github.com/kubestellar/kubeflex/blob/v0.7.2/internal/controller/controlplane_controller.go#L170 makes the controller sensitive to Secret objects that are owned by that controller. As shown in https://github.com/kubestellar/kubestellar/issues/2717#issuecomment-2600605292 , the vc-vcluster
Secret object is not owned by the KubeFlex controller.
See the comment on the Owns
method at https://github.com/kubernetes-sigs/controller-runtime/blob/v0.15.0/pkg/builder/controller.go#L106-L113
Adding an insignificant label, just to prod the controller, gets the ControlPlane into a good state.
mspreitz@mjs-dev7a:~$ date; kubectl --context kind-kubeflex get controlplanes
Sun Jan 19 06:15:05 UTC 2025
NAME SYNCED READY TYPE AGE
its1 False True vcluster 7h46m
wds1 True True k8s 7h46m
wds2 True True host 7h46m
mspreitz@mjs-dev7a:~$ kubectl --context kind-kubeflex label controlplane its1 kick=me; date
controlplane.tenancy.kflex.kubestellar.org/its1 labeled
Sun Jan 19 06:15:32 UTC 2025
mspreitz@mjs-dev7a:~$ date; kubectl --context kind-kubeflex get controlplanes
Sun Jan 19 06:15:39 UTC 2025
NAME SYNCED READY TYPE AGE
its1 True True vcluster 7h47m
wds1 True True k8s 7h47m
wds2 True True host 7h47m
And that got the config-incluster
data member added to the vc-vcluster
Secret.
thank you @MikeSpreitzer, great catch! So, would you recommend using the builder.MatchEveryOwner
in the Owns
method for secrets to fix this?
Created PR https://github.com/kubestellar/kubeflex/pull/309 for that.
@pdettori: that seems a bit excessive. How about adding the ControlPlane as an owner of the Secret? Or does vcluster have a higher-level thing that reflects the creation of that Secret?
@francostellari: There is another problem revealed here too. The ITS initialization Job depends on the KubeFlex controller augmenting the vc-vcluster Secret but does not wait on that.
Describe the bug
While investigating #2579, #2700, #2701, #2702, #2715, and #2716, I did some more testing. This was on the same IBM Cloud VM, and commit 45fc718da51dd5b193840584e3c2db0e271d3892 . This time, after many successful iterations, the script hung waiting for the its-with-cluteradm Job to be completed. But that Job will never complete, it failed all allowed retries. It failed because the Secret named "vc-vcluster" in the "its1-system" namespace in the KubeFlex hosting cluster does NOT have a data entry named "config-incluster" --- but that job's pods are defined to use it as their kubeconfig. In particular, they have an environment variable definition for KUBECONFIG that is the pathname where that data entry would appear (in a mounted volume based on that Secret).
I extracted the log of the KubeFlex controller-manager with the following command.
I have attached that log, along with the full typescript.
kfcm.log
fail8.log
Output from KubeStellar-Snapshot.sh
Steps To Reproduce
I used the following loop.
Expected Behavior
No sign of failure
Want to contribute?
Additional Context
No response