openyurtio / openyurt

OpenYurt - Extending your native Kubernetes to edge(project under CNCF)
https://openyurt.io
Apache License 2.0
1.69k stars 398 forks source link

[BUG]kubevirt deployed VM coulnd't be restarted when network is disconnected #1400

Closed gnunu closed 5 months ago

gnunu commented 1 year ago

What happened: This is a use case of using kubevirt + OpenYurt.

On the worker node, we use kubevirt to have a VM deployed , which is connected with master node successfully. Then we disconnect the network, and reboot the worker node. The problem is that the VM deployed before couln't be started.

What you expected to happen: The deployed VM could be restarted even on network disconnection.

How to reproduce it (as minimally and precisely as possible): 1) deploy OpenYurt cluster with one worker node supporting virtulization. 2) deploy kubevirt. 3) deploy a VM on worker node. 4) disconnect network. 5) reboot the worker node and see if VM deployed could run.

Anything else we need to know?: We think this could be a cloud edge collaboration issue in OpenYurt's concern. So we hope OpenYurt community could solve this issue. :)

Environment:

others

/kind bug

rambohe-ch commented 1 year ago

@gnunu would you be able to upload the detail logs of yurthub component and kubelet component?

gnunu commented 1 year ago

@gnunu would you be able to upload the detail logs of yurthub component and kubelet component?

details should be uploaded a little alter by my colleagues.

joez commented 1 year ago

two nodes cluster, one control-plane (cloud) node, and one worker (edge) node

version of the key components (all the nodes are the same):

box@joez-hce-ub20-vm-virt-m:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

box@joez-hce-ub20-vm-virt-m:~$ uname -a
Linux joez-hce-ub20-vm-virt-m 5.4.0-147-generic #164-Ubuntu SMP Tue Mar 21 14:23:17 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

box@joez-hce-ub20-vm-virt-m:~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:16:20Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:09:57Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"linux/amd64"}

box@joez-hce-ub20-vm-virt-m:~$ docker version
Client: Docker Engine - Community
 Version:           23.0.4
 API version:       1.42
 Go version:        go1.19.8
 Git commit:        f480fb1
 Built:             Fri Apr 14 10:32:23 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          23.0.4
  API version:      1.42 (minimum version 1.12)
  Go version:       go1.19.8
  Git commit:       cbce331
  Built:            Fri Apr 14 10:32:23 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.20
  GitCommit:        2806fc1057397dbaeefbea0e4e17bddfbd388f38
 runc:
  Version:          1.1.5
  GitCommit:        v1.1.5-0-gf19387a
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

box@joez-hce-ub20-vm-virt-m:~$ virtctl version
Client Version: version.Info{GitVersion:"v0.58.0", GitCommit:"6e41ae7787c1b48ac9a633c61a54444ea947242c", GitTreeState:"clean", BuildDate:"2022-10-13T00:33:22Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{GitVersion:"v0.58.0", GitCommit:"6e41ae7787c1b48ac9a633c61a54444ea947242c", GitTreeState:"clean", BuildDate:"2022-10-13T00:33:22Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}

before shutdown cloud node and reboot edge node, everything works fine:

box@joez-hce-ub20-vm-virt-m:~$ kubectl get po -A
NAMESPACE      NAME                                              READY   STATUS    RESTARTS       AGE
default        nginx-85b98978db-hvn2z                            1/1     Running   1 (20m ago)    68m
default        virt-launcher-testvm-7hxv5                        2/2     Running   0              34s
kube-flannel   kube-flannel-ds-dkxk2                             1/1     Running   1 (20m ago)    28h
kube-flannel   kube-flannel-ds-xh79b                             1/1     Running   1 (20m ago)    28h
kube-system    coredns-6d8c4cb4d-67l78                           1/1     Running   1 (20m ago)    28h
kube-system    coredns-6d8c4cb4d-mdhrt                           1/1     Running   1 (7m6s ago)   28h
kube-system    etcd-joez-hce-ub20-vm-virt-m                      1/1     Running   1 (7m6s ago)   28h
kube-system    kube-apiserver-joez-hce-ub20-vm-virt-m            1/1     Running   2 (20m ago)    61m
kube-system    kube-controller-manager-joez-hce-ub20-vm-virt-m   1/1     Running   2 (20m ago)    28h
kube-system    kube-proxy-jph4x                                  1/1     Running   1 (20m ago)    28h
kube-system    kube-proxy-sqqck                                  1/1     Running   1 (20m ago)    28h
kube-system    kube-scheduler-joez-hce-ub20-vm-virt-m            1/1     Running   2 (20m ago)    28h
kube-system    yurt-app-manager-b8677d956-4b9pf                  1/1     Running   6 (20m ago)    27h
kube-system    yurt-controller-manager-7787f67564-jmjcb          1/1     Running   2 (7m6s ago)   3h3m
kube-system    yurt-hub-joez-hce-ub20-vm-virt-w                  1/1     Running   1 (20m ago)    143m
kubevirt       virt-api-69d978dd67-rp8np                         1/1     Running   1 (20m ago)    37m
kubevirt       virt-api-69d978dd67-t4552                         1/1     Running   1 (20m ago)    37m
kubevirt       virt-controller-695cc98c56-fkzsx                  1/1     Running   1 (7m6s ago)   37m
kubevirt       virt-controller-695cc98c56-j4wxv                  1/1     Running   1 (20m ago)    37m
kubevirt       virt-handler-q5sqh                                1/1     Running   1 (20m ago)    37m
kubevirt       virt-handler-wdtp4                                1/1     Running   1 (7m6s ago)   37m
kubevirt       virt-operator-58cb8475bb-6mswb                    1/1     Running   1 (20m ago)    38m
kubevirt       virt-operator-58cb8475bb-t74df                    1/1     Running   1 (20m ago)    38m

box@joez-hce-ub20-vm-virt-w:~$ docker ps
CONTAINER ID   IMAGE                                               COMMAND                  CREATED              STATUS              PORTS     NAMES
149fa086b94e   quay.io/kubevirt/cirros-container-disk-demo         "/usr/bin/container-…"   About a minute ago   Up About a minute             k8s_volumecontainerdisk_virt-launcher-testvm-7hxv5_default_1d656c94-1395-4deb-89f8-0f844d989e52_0
cfdbd721ece4   a3a2b8b0c675                                        "/usr/bin/virt-launc…"   About a minute ago   Up About a minute             k8s_compute_virt-launcher-testvm-7hxv5_default_1d656c94-1395-4deb-89f8-0f844d989e52_0
0b81204195b8   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 About a minute ago   Up About a minute             k8s_POD_virt-launcher-testvm-7hxv5_default_1d656c94-1395-4deb-89f8-0f844d989e52_0
85951e033c26   nginx                                               "/docker-entrypoint.…"   7 minutes ago        Up 7 minutes                  k8s_nginx_nginx-85b98978db-hvn2z_default_b91fe8a0-253e-42e4-843f-199967d87a9d_1
d035475e97e4   c407633b131b                                        "virt-handler --port…"   7 minutes ago        Up 7 minutes                  k8s_virt-handler_virt-handler-q5sqh_kubevirt_e2d38ee6-80f3-4360-8321-cbf3b40d1985_1
ed2063dcca41   f76a3af5e135                                        "virt-controller --l…"   7 minutes ago        Up 7 minutes                  k8s_virt-controller_virt-controller-695cc98c56-j4wxv_kubevirt_1021a9cd-e233-470e-8ca3-4979315c31a4_1
90ec4ddecc9f   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 7 minutes ago        Up 7 minutes                  k8s_POD_virt-controller-695cc98c56-j4wxv_kubevirt_1021a9cd-e233-470e-8ca3-4979315c31a4_4
fdd690ab00f0   a7186007b4a9                                        "/usr/local/bin/yurt…"   7 minutes ago        Up 7 minutes                  k8s_yurt-app-manager_yurt-app-manager-b8677d956-4b9pf_kube-system_8a4afc07-7d42-4e64-b0b9-27b344eec936_6
7d2562e58b79   e05304a0fbaf                                        "virt-operator --por…"   7 minutes ago        Up 7 minutes                  k8s_virt-operator_virt-operator-58cb8475bb-6mswb_kubevirt_46eb2665-d9b0-4c51-9827-f87ac1ab8985_1
69852e3414e7   943b496a674d                                        "virt-api --port 844…"   7 minutes ago        Up 7 minutes                  k8s_virt-api_virt-api-69d978dd67-rp8np_kubevirt_634e5490-d783-4fdb-ba11-3bf1558b37ae_1
c77366a0ac41   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 7 minutes ago        Up 7 minutes                  k8s_POD_yurt-app-manager-b8677d956-4b9pf_kube-system_8a4afc07-7d42-4e64-b0b9-27b344eec936_3
fbaedb517e8b   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 7 minutes ago        Up 7 minutes                  k8s_POD_virt-handler-q5sqh_kubevirt_e2d38ee6-80f3-4360-8321-cbf3b40d1985_4
3ebdbf1da63c   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 7 minutes ago        Up 7 minutes                  k8s_POD_virt-operator-58cb8475bb-6mswb_kubevirt_46eb2665-d9b0-4c51-9827-f87ac1ab8985_4
6009f65b3444   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 7 minutes ago        Up 7 minutes                  k8s_POD_nginx-85b98978db-hvn2z_default_b91fe8a0-253e-42e4-843f-199967d87a9d_3
bb946f591b78   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 7 minutes ago        Up 7 minutes                  k8s_POD_virt-api-69d978dd67-rp8np_kubevirt_634e5490-d783-4fdb-ba11-3bf1558b37ae_3
702909f7a174   11ae74319a21                                        "/opt/bin/flanneld -…"   7 minutes ago        Up 7 minutes                  k8s_kube-flannel_kube-flannel-ds-dkxk2_kube-flannel_37eb9f49-5338-4fb6-bd97-563d0ff098be_1
3a8b0a92ff21   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 7 minutes ago        Up 7 minutes                  k8s_POD_kube-flannel-ds-dkxk2_kube-flannel_37eb9f49-5338-4fb6-bd97-563d0ff098be_1
9c92d6fdedde   e03484a90585                                        "/usr/local/bin/kube…"   7 minutes ago        Up 7 minutes                  k8s_kube-proxy_kube-proxy-sqqck_kube-system_90fecc3f-31b1-4ba6-a825-1c0fa2db64d6_1
7acff52536ea   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 7 minutes ago        Up 7 minutes                  k8s_POD_kube-proxy-sqqck_kube-system_90fecc3f-31b1-4ba6-a825-1c0fa2db64d6_1
b0b78516b422   f4fba699ab86                                        "yurthub --v=2 --ser…"   20 minutes ago       Up 20 minutes                 k8s_yurt-hub_yurt-hub-joez-hce-ub20-vm-virt-w_kube-system_21482483ffe45101b48a34a036517322_1
5d79eea5b086   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 20 minutes ago       Up 20 minutes                 k8s_POD_yurt-hub-joez-hce-ub20-vm-virt-w_kube-system_21482483ffe45101b48a34a036517322_1

then, shutdown cloud node

box@joez-hce-ub20-vm-virt-m:~$ sudo shutdown now
Connection to joez-hce-ub20-vm-virt-m closed by remote host.
Connection to joez-hce-ub20-vm-virt-m closed.

wait for more than 1 minute, both nginx and kubevirt vm workload are still running on edge node

box@joez-hce-ub20-vm-virt-w:~$ ps -ef | grep qemu
root       12371   12346  0 12:12 ?        00:00:00 /usr/bin/virt-launcher-monitor --qemu-timeout 241s --name testvm --uid e67efffd-d2d1-464a-8d3f-9ae347bd9c60 --namespace default --kubevirt-share-dir /var/run/kubevirt --ephemeral-disk-dir /var/run/kubevirt-ephemeral-disks --container-disk-dir /var/run/kubevirt/container-disks --grace-period-seconds 45 --hook-sidecars 0 --ovmf-path /usr/share/OVMF --keep-after-failure
root       12390   12371  0 12:12 ?        00:00:00 /usr/bin/virt-launcher --qemu-timeout 241s --name testvm --uid e67efffd-d2d1-464a-8d3f-9ae347bd9c60 --namespace default --kubevirt-share-dir /var/run/kubevirt --ephemeral-disk-dir /var/run/kubevirt-ephemeral-disks --container-disk-dir /var/run/kubevirt/container-disks --grace-period-seconds 45 --hook-sidecars 0 --ovmf-path /usr/share/OVMF
uuidd      12635   12371  3 12:12 ?        00:00:10 /usr/libexec/qemu-kvm -name guest=default_testvm,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-1-default_testvm/master-key.aes"}

box@joez-hce-ub20-vm-virt-w:~$ docker ps
CONTAINER ID   IMAGE                                               COMMAND                  CREATED          STATUS          PORTS     NAMES
0ba2f3224321   f76a3af5e135                                        "virt-controller --l…"   8 seconds ago    Up 7 seconds              k8s_virt-controller_virt-controller-695cc98c56-j4wxv_kubevirt_1021a9cd-e233-470e-8ca3-4979315c31a4_3
981719c7d9ae   c407633b131b                                        "virt-handler --port…"   15 seconds ago   Up 14 seconds             k8s_virt-handler_virt-handler-q5sqh_kubevirt_e2d38ee6-80f3-4360-8321-cbf3b40d1985_2
149fa086b94e   quay.io/kubevirt/cirros-container-disk-demo         "/usr/bin/container-…"   6 minutes ago    Up 6 minutes              k8s_volumecontainerdisk_virt-launcher-testvm-7hxv5_default_1d656c94-1395-4deb-89f8-0f844d989e52_0
cfdbd721ece4   a3a2b8b0c675                                        "/usr/bin/virt-launc…"   6 minutes ago    Up 6 minutes              k8s_compute_virt-launcher-testvm-7hxv5_default_1d656c94-1395-4deb-89f8-0f844d989e52_0
0b81204195b8   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 6 minutes ago    Up 6 minutes              k8s_POD_virt-launcher-testvm-7hxv5_default_1d656c94-1395-4deb-89f8-0f844d989e52_0
85951e033c26   nginx                                               "/docker-entrypoint.…"   12 minutes ago   Up 12 minutes             k8s_nginx_nginx-85b98978db-hvn2z_default_b91fe8a0-253e-42e4-843f-199967d87a9d_1
90ec4ddecc9f   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 12 minutes ago   Up 12 minutes             k8s_POD_virt-controller-695cc98c56-j4wxv_kubevirt_1021a9cd-e233-470e-8ca3-4979315c31a4_4
69852e3414e7   943b496a674d                                        "virt-api --port 844…"   12 minutes ago   Up 12 minutes             k8s_virt-api_virt-api-69d978dd67-rp8np_kubevirt_634e5490-d783-4fdb-ba11-3bf1558b37ae_1
c77366a0ac41   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 12 minutes ago   Up 12 minutes             k8s_POD_yurt-app-manager-b8677d956-4b9pf_kube-system_8a4afc07-7d42-4e64-b0b9-27b344eec936_3
fbaedb517e8b   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 12 minutes ago   Up 12 minutes             k8s_POD_virt-handler-q5sqh_kubevirt_e2d38ee6-80f3-4360-8321-cbf3b40d1985_4
3ebdbf1da63c   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 12 minutes ago   Up 12 minutes             k8s_POD_virt-operator-58cb8475bb-6mswb_kubevirt_46eb2665-d9b0-4c51-9827-f87ac1ab8985_4
6009f65b3444   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 12 minutes ago   Up 12 minutes             k8s_POD_nginx-85b98978db-hvn2z_default_b91fe8a0-253e-42e4-843f-199967d87a9d_3
bb946f591b78   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 12 minutes ago   Up 12 minutes             k8s_POD_virt-api-69d978dd67-rp8np_kubevirt_634e5490-d783-4fdb-ba11-3bf1558b37ae_3
702909f7a174   11ae74319a21                                        "/opt/bin/flanneld -…"   12 minutes ago   Up 12 minutes             k8s_kube-flannel_kube-flannel-ds-dkxk2_kube-flannel_37eb9f49-5338-4fb6-bd97-563d0ff098be_1
3a8b0a92ff21   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 12 minutes ago   Up 12 minutes             k8s_POD_kube-flannel-ds-dkxk2_kube-flannel_37eb9f49-5338-4fb6-bd97-563d0ff098be_1
9c92d6fdedde   e03484a90585                                        "/usr/local/bin/kube…"   12 minutes ago   Up 12 minutes             k8s_kube-proxy_kube-proxy-sqqck_kube-system_90fecc3f-31b1-4ba6-a825-1c0fa2db64d6_1
7acff52536ea   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 12 minutes ago   Up 12 minutes             k8s_POD_kube-proxy-sqqck_kube-system_90fecc3f-31b1-4ba6-a825-1c0fa2db64d6_1
b0b78516b422   f4fba699ab86                                        "yurthub --v=2 --ser…"   25 minutes ago   Up 25 minutes             k8s_yurt-hub_yurt-hub-joez-hce-ub20-vm-virt-w_kube-system_21482483ffe45101b48a34a036517322_1
5d79eea5b086   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 25 minutes ago   Up 25 minutes             k8s_POD_yurt-hub-joez-hce-ub20-vm-virt-w_kube-system_21482483ffe45101b48a34a036517322_1

now, retart edge node, and keep cloud node down

box@joez-hce-ub20-vm-virt-w:~$ sudo reboot
[sudo] password for box:
Connection to 10.67.108.242 closed by remote host.
Connection to 10.67.108.242 closed.

after reboot, both the nginx and kubevirt vm are not launched

box@joez-hce-ub20-vm-virt-w:~$ docker ps
CONTAINER ID   IMAGE                                               COMMAND                  CREATED          STATUS          PORTS     NAMES
9a5b19370a11   f4fba699ab86                                        "yurthub --v=2 --ser…"   13 minutes ago   Up 13 minutes             k8s_yurt-hub_yurt-hub-joez-hce-ub20-vm-virt-w_kube-system_21482483ffe45101b48a34a036517322_2
cea81f7a1578   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 13 minutes ago   Up 13 minutes             k8s_POD_yurt-hub-joez-hce-ub20-vm-virt-w_kube-system_21482483ffe45101b48a34a036517322_2
joez commented 1 year ago

Here are my steps to setup OpenYurt cluster and deploy KubeVirt:

label nodes and activate node autonomy:

cloud_node=$(kubectl get node -l node-role.kubernetes.io/master -o name | sed -e s:node/::)
edge_node=$(kubectl get node -o name | grep -v $cloud_node | sed -e s:node/::)
kubectl label node $cloud_node openyurt.io/is-edge-worker=false
kubectl label node $edge_node openyurt.io/is-edge-worker=true
kubectl annotate node $edge_node node.beta.openyurt.io/autonomy=true

deploy control-plane components on cloud node:

helm repo add openyurt https://openyurtio.github.io/openyurt-helm

# deploy yurt-app-manager first
helm upgrade --install -n kube-system yurt-app-manager openyurt/yurt-app-manager
# then yurt-controller-manager
helm upgrade --install -n kube-system --version 1.2.0 openyurt openyurt/openyurt

# check the result
helm list -A
# openyurt-1.2.0 1.2.0
#  yurt-app-manager-0.1.3  0.6.0
kubectl get po -A | grep yurt

setup yurthub on edge node:

# find your kube-apiserver and token
kube_api=10.67.108.194:6443
token=0ide56.gzkntj0zwbh2qhfe

# deploy yurthub
curl -LO https://raw.githubusercontent.com/openyurtio/openyurt/release-v1.2/config/setup/yurthub.yaml
sed "s/__kubernetes_master_address__/$kube_api/;s/__bootstrap_token__/$token/" yurthub.yaml | sudo tee /etc/kubernetes/manifests/yurthub.yaml

# create kubeconfig
sudo mkdir -p /var/lib/openyurt
cat << EOF | sudo tee /var/lib/openyurt/kubelet.conf
apiVersion: v1
clusters:
- cluster:
    server: http://127.0.0.1:10261
  name: default-cluster
contexts:
- context:
    cluster: default-cluster
    namespace: default
    user: default-auth
  name: default-context
current-context: default-context
kind: Config
preferences: {}
EOF

# let kubelet to use the new kubeconfig
sudo sed -i.bak 's#KUBELET_KUBECONFIG_ARGS=.*"#KUBELET_KUBECONFIG_ARGS=--kubeconfig=/var/lib/openyurt/kubelet.conf"#g' /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

# restart kubelet
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# check status
sudo systemctl status kubelet

deploy KubeVirt:

VERSION=v0.58.0
kubectl create -f https://github.com/kubevirt/kubevirt/releases/download/${VERSION}/kubevirt-operator.yaml
kubectl create -f https://github.com/kubevirt/kubevirt/releases/download/${VERSION}/kubevirt-cr.yaml

# wait until kubevirt.kubevirt.io/kubevirt is deployed
kubectl get -n kubevirt kv/kubevirt -w

Deploy virtctl:

VERSION=$(kubectl get kubevirt.kubevirt.io/kubevirt -n kubevirt -o=jsonpath="{.status.observedKubeVirtVersion}")
ARCH=$(uname -s | tr A-Z a-z)-$(uname -m | sed 's/x86_64/amd64/')
curl -L -o virtctl https://github.com/kubevirt/kubevirt/releases/download/${VERSION}/virtctl-${VERSION}-${ARCH}
chmod +x virtctl
sudo install virtctl /usr/local/bin

Deploy VM for test:

kubectl apply -f https://kubevirt.io/labs/manifests/vm.yaml
kubectl get vms
# start VM
virtctl start testvm

# check status
kubectl get vmis

# access console
virtctl console testvm
joez commented 1 year ago

@rambohe-ch @gnunu

after boot up the cloud node again, all the pods are started again on the edge node here are the files in cache

root@joez-hce-ub20-vm-virt-w:/etc/kubernetes/cache# find kubelet/ -maxdepth 2
kubelet/
kubelet/leases.v1.coordination.k8s.io
kubelet/leases.v1.coordination.k8s.io/kube-node-lease
kubelet/events.v1.core
kubelet/events.v1.core/kubevirt
kubelet/events.v1.core/default
kubelet/events.v1.core/kube-flannel
kubelet/events.v1.core/kube-system
root@joez-hce-ub20-vm-virt-w:/etc/kubernetes/cache# find yurthub/ -maxdepth 3
yurthub/
yurthub/services.v1.core
yurthub/services.v1.core/kubevirt
yurthub/services.v1.core/kubevirt/kubevirt-operator-webhook
yurthub/services.v1.core/kubevirt/kubevirt-prometheus-metrics
yurthub/services.v1.core/kubevirt/virt-exportproxy
yurthub/services.v1.core/kubevirt/virt-api
yurthub/services.v1.core/default
yurthub/services.v1.core/default/nginx
yurthub/services.v1.core/default/kubernetes
yurthub/services.v1.core/kube-system
yurthub/services.v1.core/kube-system/pool-coordinator-etcd
yurthub/services.v1.core/kube-system/pool-coordinator-apiserver
yurthub/services.v1.core/kube-system/kube-dns
yurthub/services.v1.core/kube-system/yurt-app-manager-webhook
yurthub/nodepools.v1alpha1.apps.openyurt.io
yurthub/nodepools.v1alpha1.apps.openyurt.io/master
yurthub/configmaps.v1.core
yurthub/configmaps.v1.core/kube-system
yurthub/configmaps.v1.core/kube-system/yurt-hub-cfg
gnunu commented 1 year ago

@rambohe-ch @joez in this case, the master node is shutdown, I am not sure if that's considered fully in OpenYurt. When master is down, is yurthub still healthy enough for kubelet?

Congrool commented 1 year ago

in this case, the master node is shutdown, I am not sure if that's considered fully in OpenYurt. When master is down, is yurthub still healthy enough for kubelet?

Yes, we expect that pods on edge can recover even master is down. @gnunu

Firstly, I've to say that openyurt+kubevirt has not been tested yet. From my perspective, yurthub provides an edge local cache for generic usage, and it can support the recovery of kubevirt theorytically. Yurthub will not have resources cache for all edge components, and in this case I think the kubevirt related resources were not cached. You may check the cache-agent configmap to see if you've enable yurthub to make cache for kubevirt.

$ kubectl get cm yurt-hub-cfg -nkube-system -oyaml
apiVersion: v1
data:
  cache_agents: ""
  discardcloudservice: ""
  masterservice: ""
  servicetopology: ""
kind: ConfigMap
metadata:
  creationTimestamp: "2023-04-24T03:14:06Z"
  name: yurt-hub-cfg
  namespace: kube-system
  resourceVersion: "842"
  uid: ad2d8249-b16a-44f6-981b-c410ac93827b

cache_agents: "" means to use default settings. To enable kubevirt cache, just simply edit it as cache_agents: "*", which means enable cache for all edge components.

However, it seems that the openyurt cluster was in an abnormal situtation. We expect that the cache for kubelet should contains pods, configmaps and some others which enable the pod recovery when master has shutdown. It should be like as following:

root@openyurt-e2e-test-worker:/etc/kubernetes/cache# find kubelet/ -maxdepth 2
kubelet/
kubelet/leases.v1.coordination.k8s.io
kubelet/leases.v1.coordination.k8s.io/kube-node-lease
kubelet/nodes.v1.core
kubelet/nodes.v1.core/openyurt-e2e-test-worker
kubelet/csinodes.v1.storage.k8s.io
kubelet/csinodes.v1.storage.k8s.io/openyurt-e2e-test-worker
kubelet/csidrivers.v1.storage.k8s.io
kubelet/services.v1.core
kubelet/services.v1.core/default
kubelet/services.v1.core/kube-system
kubelet/events.v1.core
kubelet/events.v1.core/default
kubelet/events.v1.core/kube-system
kubelet/runtimeclasses.v1.node.k8s.io
kubelet/pods.v1.core
kubelet/pods.v1.core/kube-system
kubelet/configmaps.v1.core
kubelet/configmaps.v1.core/kube-system

Maybe there's something wrong in yurthub. Could you check log of yurthub on worker node when master is running? It should cache theses resources from master when everything is ok. @joez

joez commented 1 year ago

@Congrool let us check the container workload (nginx) first, and KubeVirt VM as the next step

Here is output when master node is connected:

Seems there are no pod objects cached

box@joez-hce-ub20-vm-virt-w:/etc/kubernetes/cache$ sudo find kubelet/ -maxdepth 2
kubelet/
kubelet/leases.v1.coordination.k8s.io
kubelet/leases.v1.coordination.k8s.io/kube-node-lease
kubelet/events.v1.core
kubelet/events.v1.core/kubevirt
kubelet/events.v1.core/default
kubelet/events.v1.core/kube-flannel
kubelet/events.v1.core/kube-system
box@joez-hce-ub20-vm-virt-w:/etc/kubernetes/cache$ sudo find yurthub/ -maxdepth 3
yurthub/
yurthub/services.v1.core
yurthub/services.v1.core/kubevirt
yurthub/services.v1.core/kubevirt/kubevirt-operator-webhook
yurthub/services.v1.core/kubevirt/kubevirt-prometheus-metrics
yurthub/services.v1.core/kubevirt/virt-exportproxy
yurthub/services.v1.core/kubevirt/virt-api
yurthub/services.v1.core/default
yurthub/services.v1.core/default/nginx
yurthub/services.v1.core/default/kubernetes
yurthub/services.v1.core/kube-system
yurthub/services.v1.core/kube-system/pool-coordinator-etcd
yurthub/services.v1.core/kube-system/pool-coordinator-apiserver
yurthub/services.v1.core/kube-system/kube-dns
yurthub/services.v1.core/kube-system/yurt-app-manager-webhook
yurthub/nodepools.v1alpha1.apps.openyurt.io
yurthub/nodepools.v1alpha1.apps.openyurt.io/master
yurthub/configmaps.v1.core
yurthub/configmaps.v1.core/kube-system
yurthub/configmaps.v1.core/kube-system/yurt-hub-cfg

The log of yurthub: yurthub-normal.txt And node labels and annotations:

                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=joez-hce-ub20-vm-virt-w
                    kubernetes.io/os=linux
                    kubevirt.io/schedulable=true
                    openyurt.io/is-edge-worker=true
Annotations:        flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"3e:c2:32:7b:8a:0b"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 10.67.108.242
                    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    kubevirt.io/heartbeat: 2023-04-24T06:21:10Z
                    node.alpha.kubernetes.io/ttl: 0
                    node.beta.openyurt.io/autonomy: true
                    volumes.kubernetes.io/controller-managed-attach-detach: true

Addresses:
  InternalIP:  10.67.108.242
  Hostname:    joez-hce-ub20-vm-virt-w

Current yurt-hub-cfg:

box@joez-hce-ub20-vm-virt-m:~$ kubectl get cm yurt-hub-cfg -nkube-system -oyaml
apiVersion: v1
data:
  cache_agents: ""
  discardcloudservice: ""
  masterservice: ""
  servicetopology: ""
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: openyurt
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2023-04-22T01:09:02Z"
  labels:
    app.kubernetes.io/managed-by: Helm
  name: yurt-hub-cfg
  namespace: kube-system
  resourceVersion: "208527"
  uid: 4754f00f-314b-4311-91bc-3e1778de2d95
joez commented 1 year ago

After enabling cache for all edge components by setting cache_agents: "*", I can see the cache as bellows:

Most of the objects are in the go-http-client/ folder

box@joez-hce-ub20-vm-virt-w:/etc/kubernetes/cache$ find kubelet/ go-http-client/ yurthub/ -maxdepth 2
kubelet/
kubelet/leases.v1.coordination.k8s.io
kubelet/leases.v1.coordination.k8s.io/kube-node-lease
kubelet/events.v1.core
kubelet/events.v1.core/kubevirt
kubelet/events.v1.core/default
kubelet/events.v1.core/kube-flannel
kubelet/events.v1.core/kube-system
go-http-client/
go-http-client/services.v1.core
go-http-client/services.v1.core/kubevirt
go-http-client/services.v1.core/default
go-http-client/services.v1.core/kube-system
go-http-client/leases.v1.coordination.k8s.io
go-http-client/leases.v1.coordination.k8s.io/kube-node-lease
go-http-client/csidrivers.v1.storage.k8s.io
go-http-client/csinodes.v1.storage.k8s.io
go-http-client/csinodes.v1.storage.k8s.io/joez-hce-ub20-vm-virt-w
go-http-client/pods.v1.core
go-http-client/pods.v1.core/kubevirt
go-http-client/pods.v1.core/default
go-http-client/pods.v1.core/kube-flannel
go-http-client/pods.v1.core/kube-system
go-http-client/secrets.v1.core
go-http-client/secrets.v1.core/kubevirt
go-http-client/secrets.v1.core/kube-system
go-http-client/runtimeclasses.v1.node.k8s.io
go-http-client/configmaps.v1.core
go-http-client/configmaps.v1.core/kubevirt
go-http-client/configmaps.v1.core/default
go-http-client/configmaps.v1.core/kube-flannel
go-http-client/configmaps.v1.core/kube-system
go-http-client/nodes.v1.core
go-http-client/nodes.v1.core/joez-hce-ub20-vm-virt-w
yurthub/
yurthub/services.v1.core
yurthub/services.v1.core/kubevirt
yurthub/services.v1.core/default
yurthub/services.v1.core/kube-system
yurthub/nodepools.v1alpha1.apps.openyurt.io
yurthub/nodepools.v1alpha1.apps.openyurt.io/master
yurthub/configmaps.v1.core
yurthub/configmaps.v1.core/kube-system

And then disconnect edge node from cloud node by appling iptables rules on the cloud node:

box@joez-hce-ub20-vm-virt-m:~$ kubectl get no -o wide
NAME                      STATUS   ROLES                  AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
joez-hce-ub20-vm-virt-m   Ready    control-plane,master   3d15h   v1.23.0   10.67.108.194   <none>        Ubuntu 20.04.5 LTS   5.4.0-147-generic   docker://23.0.4
joez-hce-ub20-vm-virt-w   Ready    <none>                 3d15h   v1.23.0   10.67.108.242   <none>        Ubuntu 20.04.5 LTS   5.4.0-147-generic   docker://23.0.4

box@joez-hce-ub20-vm-virt-m:~$ sudo iptables -I OUTPUT -d 10.67.108.242 -j DROP

box@joez-hce-ub20-vm-virt-m:~$ kubectl get node
NAME                      STATUS     ROLES                  AGE     VERSION
joez-hce-ub20-vm-virt-m   Ready      control-plane,master   3d15h   v1.23.0
joez-hce-ub20-vm-virt-w   NotReady   <none>                 3d15h   v1.23.0

After reboot the edge node, more pod are launched, but most of them exit immediately

box@joez-hce-ub20-vm-virt-w:/etc/kubernetes/cache$ docker ps
CONTAINER ID   IMAGE                                               COMMAND                  CREATED          STATUS          PORTS     NAMES
74d04d4c85a1   e03484a90585                                        "/usr/local/bin/kube…"   9 minutes ago    Up 9 minutes              k8s_kube-proxy_kube-proxy-sqqck_kube-system_90fecc3f-31b1-4ba6-a825-1c0fa2db64d6_9
9c8abe396f2a   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 9 minutes ago    Up 9 minutes              k8s_POD_kube-proxy-sqqck_kube-system_90fecc3f-31b1-4ba6-a825-1c0fa2db64d6_9
7a34be3ac8d4   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 9 minutes ago    Up 9 minutes              k8s_POD_kube-flannel-ds-dkxk2_kube-flannel_37eb9f49-5338-4fb6-bd97-563d0ff098be_9
01057f3db2a8   f4fba699ab86                                        "yurthub --v=2 --ser…"   10 minutes ago   Up 10 minutes             k8s_yurt-hub_yurt-hub-joez-hce-ub20-vm-virt-w_kube-system_21482483ffe45101b48a34a036517322_9
cfc07e1e3a66   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 10 minutes ago   Up 10 minutes             k8s_POD_yurt-hub-joez-hce-ub20-vm-virt-w_kube-system_21482483ffe45101b48a34a036517322_9
box@joez-hce-ub20-vm-virt-w:/etc/kubernetes/cache$ docker ps -a | head
CONTAINER ID   IMAGE                                               COMMAND                  CREATED          STATUS                              PORTS     NAMES
860d95171ea6   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 1 second ago     Exited (0) Less than a second ago             k8s_POD_virt-controller-695cc98c56-j4wxv_kubevirt_1021a9cd-e233-470e-8ca3-4979315c31a4_523
1a115156e86e   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 1 second ago     Exited (0) Less than a second ago             k8s_POD_nginx-85b98978db-hvn2z_default_b91fe8a0-253e-42e4-843f-199967d87a9d_520
079a609b6ff5   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 1 second ago     Exited (0) Less than a second ago             k8s_POD_virt-operator-58cb8475bb-6mswb_kubevirt_46eb2665-d9b0-4c51-9827-f87ac1ab8985_518
ce7af82dc85b   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 1 second ago     Exited (0) Less than a second ago             k8s_POD_virt-api-69d978dd67-rp8np_kubevirt_634e5490-d783-4fdb-ba11-3bf1558b37ae_526
1ffa1e77096f   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 2 seconds ago    Exited (0) Less than a second ago             k8s_POD_yurt-app-manager-b8677d956-4b9pf_kube-system_8a4afc07-7d42-4e64-b0b9-27b344eec936_521
1a5d0c203575   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 2 seconds ago    Exited (0) Less than a second ago             k8s_POD_virt-handler-q5sqh_kubevirt_e2d38ee6-80f3-4360-8321-cbf3b40d1985_525
5681f0ea254c   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 3 seconds ago    Exited (0) 1 second ago                       k8s_POD_virt-api-69d978dd67-rp8np_kubevirt_634e5490-d783-4fdb-ba11-3bf1558b37ae_525
1f4dca4c6dae   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 3 seconds ago    Exited (0) 1 second ago                       k8s_POD_yurt-app-manager-b8677d956-4b9pf_kube-system_8a4afc07-7d42-4e64-b0b9-27b344eec936_520
efe33c3130f5   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                 3 seconds ago    Exited (0) 1 second ago                       k8s_POD_nginx-85b98978db-hvn2z_default_b91fe8a0-253e-42e4-843f-199967d87a9d_519
box@joez-hce-ub20-vm-virt-w:/etc/kubernetes/cache$ docker logs 1a115156e86e
Shutting down, got signal: Terminated

Flannel is failed to start:

box@joez-hce-ub20-vm-virt-w:/etc/kubernetes/cache$ docker ps -a | grep flannel
86fc9c59c029   11ae74319a21                                        "/opt/bin/flanneld -…"   29 seconds ago   Exited (1) 28 seconds ago               k8s_kube-flannel_kube-flannel-ds-dkxk2_kube-flannel_37eb9f49-5338-4fb6-bd97-563d0ff098be_19
...
box@joez-hce-ub20-vm-virt-w:/etc/kubernetes/cache$ docker logs 86fc9c59c029
W0424 15:31:36.399679       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
E0424 15:31:36.587953       1 main.go:228] Failed to create SubnetManager: error retrieving pod spec for 'kube-flannel/kube-flannel-ds-dkxk2': Get "https://10.96.0.1:443/api/v1/namespaces/kube-flannel/pods/kube-flannel-ds-dkxk2": dial tcp 10.96.0.1:443: connect: connection refused

Can't connect to api-server via kube-proxy

box@joez-hce-ub20-vm-virt-w:/etc/kubernetes/cache$ nc -zv 10.96.0.1 443
nc: connect to 10.96.0.1 port 443 (tcp) failed: Connection refused
box@joez-hce-ub20-vm-virt-w:/etc/kubernetes/cache$ sudo iptables-save | grep -w 10.96.0.1

# OK on cloud node
box@joez-hce-ub20-vm-virt-m:~$ nc -zv 10.96.0.1 443
Connection to 10.96.0.1 443 port [tcp/https] succeeded!

box@joez-hce-ub20-vm-virt-m:~$ sudo iptables-save | grep -w 10.96.0.1
-A KUBE-SERVICES -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SVC-NPX46M4PTMTKRN6Y ! -s 10.244.0.0/16 -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ

Check kube-proxy:

box@joez-hce-ub20-vm-virt-w:/etc/kubernetes/cache$ docker ps -a | grep proxy
74d04d4c85a1   e03484a90585                                        "/usr/local/bin/kube…"   41 minutes ago           Up 41 minutes                                 k8s_kube-proxy_kube-proxy-sqqck_kube-system_90fecc3f-31b1-4ba6-a825-1c0fa2db64d6_9
9

box@joez-hce-ub20-vm-virt-w:/etc/kubernetes/cache$ docker logs 74d04d4c85a1 2>&1 | less
E0424 15:08:24.338500       1 node.go:152] Failed to retrieve node info: Get "https://10.67.108.194:6443/api/v1/nodes/joez-hce-ub20-vm-virt-w": dial tcp 10.67.108.194:6443: i/o timeout
E0424 15:09:11.170905       1 node.go:152] Failed to retrieve node info: Get "https://10.67.108.194:6443/api/v1/nodes/joez-hce-ub20-vm-virt-w": dial tcp 10.67.108.194:6443: i/o timeout
I0424 15:09:11.171167       1 server.go:843] "Can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag"
I0424 15:09:11.171257       1 server_others.go:138] "Detected node IP" address="127.0.0.1"
I0424 15:09:11.171865       1 server_others.go:561] "Unknown proxy mode, assuming iptables proxy" proxyMode=""
I0424 15:09:11.234842       1 server_others.go:206] "Using iptables Proxier"
I0424 15:09:11.234963       1 server_others.go:213] "kube-proxy running in dual-stack mode" ipFamily=IPv4
I0424 15:09:11.234992       1 server_others.go:214] "Creating dualStackProxier for iptables"
I0424 15:09:11.235069       1 server_others.go:491] "Detect-local-mode set to ClusterCIDR, but no IPv6 cluster CIDR defined, , defaulting to no-op detect-local for IPv6"
I0424 15:09:11.237531       1 server.go:656] "Version info" version="v1.23.0"
I0424 15:09:11.243601       1 conntrack.go:52] "Setting nf_conntrack_max" nf_conntrack_max=131072
I0424 15:09:11.243792       1 conntrack.go:100] "Set sysctl" entry="net/netfilter/nf_conntrack_tcp_timeout_close_wait" value=3600
I0424 15:09:11.244956       1 config.go:317] "Starting service config controller"
I0424 15:09:11.245283       1 config.go:226] "Starting endpoint slice config controller"
I0424 15:09:11.245647       1 shared_informer.go:240] Waiting for caches to sync for service config
I0424 15:09:11.246236       1 shared_informer.go:240] Waiting for caches to sync for endpoint slice config
W0424 15:09:41.247874       1 reflector.go:324] k8s.io/client-go/informers/factory.go:134: failed to list *v1.Service: Get "https://10.67.108.194:6443/api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&limit=500&resourceVersion=0": dial tcp 10.67.108.194:6443: i/o timeout

The kube-proxy is still trying to get information from cloud node, instead of yurthub, is it the expected behavior?

Congrool commented 1 year ago

Thanks for your detailed logs. I'm not sure why the component "go-http-client" list for pods and configmaps, and what it is.

I0424 06:12:11.015474       1 util.go:248] go-http-client list pods: /api/v1/pods?fieldSelector=spec.nodeName%3Djoez-hce-ub20-vm-virt-w&limit=500&resourceVersion=0 with status code 200, spent 6.076415ms

It is expected to be "kubelet list pods". I noticed that the kubernetes cluster is v1.23.0. Maybe it's the compatibility problem between openyurt v1.2.x and kubernetes v1.23.x

I0425 03:15:19.952553       1 util.go:255] kubelet list pods: /api/v1/pods?fieldSelector=spec.nodeName%3Dopenyurt-e2e-test-worker&limit=500&resourceVersion=0 with status code 200, spent 9.565219ms

I think this can explain some of the problems we encountered.

Why does the kubelet component cache is incomplete?

Because kubelet also use another User-Agent called go-http-client, which we do not cache by default.

Why does kube-proxy still connect to the cloud node?

This is not what we expected. kube-proxy should fetch resources through yurthub. We use filter in yurthub to do it. In normal case, we may find such log of yurthub:

I0425 03:15:19.965994       1 handler.go:79] kubeconfig in configmap(kube-system/kube-proxy) has been commented, new config.conf: 
  #kubeconfig: /var/lib/kube-proxy/kubeconfig.conf

So the configmap mounted by kube-proxy should be modified by yurthub to make the kube-proxy use InClusterConfig, which will enable it fetch resource through yurthub. This configmap is fetched by kubelet, thus in yurthub we identify this configmap using the User-Agent of kubelet requests, which originally should be User-Agent: kubelet. However the User-Agent seems to be go-http-client in v1.23.0, and cannot be recognized by yurthub. Thus, the kube-proxy will use the unmodified kubeconfig, which directly connect to the cloud APIServer.

So in summary, these problems seems to be introduced by kubernetes v1.23.x, and I think we have to find a solution for it. Currently, to make it work around, could you please use kubernetes v1.22.x to have a try? @joez

joez commented 1 year ago

Thanks for your explaination, I will have kubernetes 1.22.0 a try, and keep you posted later

BTW. I choose 1.23.0 because getting-started says it supports

OpenYurt supports Kubernetes versions up to 1.23. Using higher Kubernetes versions may cause compatibility issues.

joez commented 1 year ago

With the new cluster with k8s 1.22.0, the cache is as expected now.

box@joez-hce-ub20-vm-oykv-w:/etc/kubernetes/cache$ ls
_apis_discovery.k8s.io_v1  _apis_discovery.k8s.io_v1beta1  _internal  flanneld  kubelet  version  yurthub

box@joez-hce-ub20-vm-oykv-w:/etc/kubernetes/cache$ find kubelet/ -maxdepth 2
kubelet/
kubelet/services.v1.core
kubelet/services.v1.core/kubevirt
kubelet/services.v1.core/default
kubelet/services.v1.core/kube-system
kubelet/leases.v1.coordination.k8s.io
kubelet/leases.v1.coordination.k8s.io/kube-node-lease
kubelet/csidrivers.v1.storage.k8s.io
kubelet/csinodes.v1.storage.k8s.io
kubelet/csinodes.v1.storage.k8s.io/joez-hce-ub20-vm-oykv-w
kubelet/pods.v1.core
kubelet/pods.v1.core/kubevirt
kubelet/pods.v1.core/default
kubelet/pods.v1.core/kube-flannel
kubelet/pods.v1.core/kube-system
kubelet/secrets.v1.core
kubelet/secrets.v1.core/kubevirt
kubelet/secrets.v1.core/kube-system
kubelet/runtimeclasses.v1.node.k8s.io
kubelet/configmaps.v1.core
kubelet/configmaps.v1.core/kubevirt
kubelet/configmaps.v1.core/default
kubelet/configmaps.v1.core/kube-flannel
kubelet/configmaps.v1.core/kube-system
kubelet/events.v1.core
kubelet/events.v1.core/kubevirt
kubelet/events.v1.core/default
kubelet/events.v1.core/kube-flannel
kubelet/events.v1.core/kube-system
kubelet/nodes.v1.core
kubelet/nodes.v1.core/joez-hce-ub20-vm-oykv-w

But kube-proxy is still trying to connect to kube-apiserver instead of yurthub So I try to deploy the yurthub on the cloud node too, now I can see the configuration of the kube-proxy is changed:

I0425 07:28:48.980817       1 filter.go:92] kubeconfig in configmap(kube-system/kube-proxy) has been commented, new config.conf:
apiVersion: kubeproxy.config.k8s.io/v1alpha1
...
  #kubeconfig: /var/lib/kube-proxy/kubeconfig.conf
  qps: 0
clusterCIDR: 10.244.0.0/16

And kube-proxy as well as flannel, nginx can be launched

box@joez-hce-ub20-vm-oykv-w:~$ docker ps | grep nginx | grep -v POD | awk '{print $1}'
13887b771982
box@joez-hce-ub20-vm-oykv-w:~$ docker exec 13887b771982 cat /proc/net/fib_trie | awk '/32 host/ { print i } {i=$2}' | grep -v 127.0 | uniq
10.244.1.37
box@joez-hce-ub20-vm-oykv-w:~$ no_proxy='*' curl -s 10.244.1.37:80 | grep Welcome
<title>Welcome to nginx!</title>

But there are still two problems:

# no VM is running
box@joez-hce-ub20-vm-oykv-w:~$ ps -ef | grep qemu | grep -v grep

box@joez-hce-ub20-vm-oykv-w:~$ docker ps | grep virt-handler | grep -v POD | awk '{print $1}'
90da7040e86a
box@joez-hce-ub20-vm-oykv-w:~$ docker logs 90da7040e86a 2>&1 | less
W0425 14:03:54.447851    8650 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
{"Unable to mark node as unschedulable":"can not cache for go-http-client patch nodes: /api/v1/nodes/joez-hce-ub20-vm-oykv-w","component":"virt-handler","level":"error","pos":"virt-handler.go:179","timestamp":"2023-04-25T14:03:54.503437Z"}
{"component":"virt-handler","level":"info","msg":"set verbosity to 2","pos":"virt-handler.go:471","timestamp":"2023-04-25T14:03:54.505689Z"}
...

# kube-proxy can't get service objects

box@joez-hce-ub20-vm-oykv-w:~$ docker ps | grep kube-proxy | grep -v POD | awk '{print $1}'
440188760b30
box@joez-hce-ub20-vm-oykv-w:~$ docker logs 440188760b30 2>&1 | less
I0425 14:03:47.152925       1 server.go:553] Neither kubeconfig file nor master URL was specified. Falling back to in-cluster config.

E0425 14:04:34.012669       1 node.go:161] Failed to retrieve node info: Get "https://169.254.2.1:10268/api/v1/nodes/joez-hce-ub20-vm-oykv-w": Service Unavailable
I0425 14:04:34.012744       1 server.go:836] can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag
I0425 14:04:34.013173       1 server_others.go:140] Detected node IP 127.0.0.1
W0425 14:04:34.013292       1 server_others.go:565] Unknown proxy mode "", assuming iptables proxy
I0425 14:04:34.071984       1 server_others.go:206] kube-proxy running in dual-stack mode, IPv4-primary
I0425 14:04:34.072103       1 server_others.go:212] Using iptables Proxier.
I0425 14:04:34.072131       1 server_others.go:219] creating dualStackProxier for iptables.
W0425 14:04:34.072193       1 server_others.go:495] detect-local-mode set to ClusterCIDR, but no IPv6 cluster CIDR defined, , defaulting to no-op detect-local for IPv6
I0425 14:04:34.073932       1 server.go:649] Version: v1.22.0
I0425 14:04:34.078625       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0425 14:04:34.078684       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0425 14:04:34.079048       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I0425 14:04:34.079245       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I0425 14:04:34.080453       1 config.go:315] Starting service config controller
I0425 14:04:34.080490       1 shared_informer.go:240] Waiting for caches to sync for service config
I0425 14:04:34.080534       1 config.go:224] Starting endpoint slice config controller
I0425 14:04:34.080542       1 shared_informer.go:240] Waiting for caches to sync for endpoint slice config
E0425 14:04:37.012366       1 event_broadcaster.go:262] Unable to write event: 'Post "https://169.254.2.1:10268/apis/events.k8s.io/v1/namespaces/default/events": Service Unavailable' (may retry after sleeping)
E0425 14:04:37.012666       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://169.254.2.1:10268/api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&limit=500&resourceVersion=0": Service Unavailable
...
E0425 15:00:28.395624       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://169.254.2.1:10268/apis/discovery.k8s.io/v1/endpointslices?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&limit=500&resourceVersion=0": Service Unavailable

box@joez-hce-ub20-vm-oykv-w:~$ ip a
...
4: yurthub-dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
    link/ether 56:f9:05:f9:60:8b brd ff:ff:ff:ff:ff:ff
    inet 169.254.2.1/32 scope global yurthub-dummy0
       valid_lft forever preferred_lft forever

box@joez-hce-ub20-vm-oykv-w:~$ sudo ss -lntp
State             Recv-Q            Send-Q                       Local Address:Port                         Peer Address:Port            Process
LISTEN            0                 4096                           169.254.2.1:10261                             0.0.0.0:*                users:(("yurthub",pid=1600,fd=9))
LISTEN            0                 4096                             127.0.0.1:10261                             0.0.0.0:*                users:(("yurthub",pid=1600,fd=8))
LISTEN            0                 4096                         127.0.0.53%lo:53                                0.0.0.0:*                users:(("systemd-resolve",pid=690,fd=13))
LISTEN            0                 128                                0.0.0.0:22                                0.0.0.0:*                users:(("sshd",pid=1746,fd=3))
LISTEN            0                 4096                             127.0.0.1:10267                             0.0.0.0:*                users:(("yurthub",pid=1600,fd=7))
LISTEN            0                 4096                           169.254.2.1:10268                             0.0.0.0:*                users:(("yurthub",pid=1600,fd=10))
LISTEN            0                 4096                             127.0.0.1:10248                             0.0.0.0:*                users:(("kubelet",pid=718,fd=18))
LISTEN            0                 4096                             127.0.0.1:10249                             0.0.0.0:*                users:(("kube-proxy",pid=2686,fd=18))
LISTEN            0                 4096                             127.0.0.1:34505                             0.0.0.0:*                users:(("kubelet",pid=718,fd=12))
LISTEN            0                 4096                                     *:10256                                   *:*                users:(("kube-proxy",pid=2686,fd=19))
LISTEN            0                 128                                   [::]:22                                   [::]:*                users:(("sshd",pid=1746,fd=4))
LISTEN            0                 4096                                     *:10250                                   *:*                users:(("kubelet",pid=718,fd=35))
joez commented 1 year ago

@rambohe-ch @Congrool would you help to check the kube-proxy issue? This issue prevent service to service communication from working

As I mentioned last time, the kube-proxy does not work as expected:

box@joez-hce-ub20-vm-oykv-w:~$ sudo iptables -t nat -n -L KUBE-SERVICES
iptables: No chain/target/match by that name.

It should have setup iptables rules in normal case, like following:

box@joez-hce-ub20-vm-virt-w:~$ sudo iptables -t nat -n -L KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-SVC-OVTWZ4GROBJZO4C5  tcp  --  0.0.0.0/0            10.96.165.12         /* default/nginx:80-80 cluster IP */ tcp dpt:80
KUBE-SVC-EIEVNBW5YXUIDXZD  tcp  --  0.0.0.0/0            10.96.186.205        /* kubevirt/kubevirt-prometheus-metrics:metrics cluster IP */ tcp dpt:443
KUBE-SVC-JD5MR3NA4I4DYORP  tcp  --  0.0.0.0/0            10.96.0.10           /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
KUBE-SVC-LON7267IY6XCAPHT  tcp  --  0.0.0.0/0            10.96.62.36          /* kube-system/yurt-app-manager-webhook:https cluster IP */ tcp dpt:443
KUBE-SVC-GXXJIUUZRDUOXB4K  tcp  --  0.0.0.0/0            10.96.28.38          /* kubevirt/kubevirt-operator-webhook:webhooks cluster IP */ tcp dpt:443
KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  0.0.0.0/0            10.96.0.1            /* default/kubernetes:https cluster IP */ tcp dpt:443
KUBE-SVC-UDPDOKU2AFJKWYNL  tcp  --  0.0.0.0/0            10.96.123.232        /* kubevirt/virt-api cluster IP */ tcp dpt:443
KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  0.0.0.0/0            10.96.0.10           /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  0.0.0.0/0            10.96.0.10           /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53
KUBE-NODEPORTS  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

So, maybe something wrong which cause kube-proxy can't get enough information from yurt-hub to setup iptables rules.

Congrool commented 1 year ago

@joez Hey, sorry for late reply. It seems that the kube-proxy cannot access yurthub server. We may need to check if yurthub server still works.

You can use following cmd on your host joez-hce-ub20-vm-virt-w when worker is disconnected from the master.

curl -H "User-Agent: kube-proxy" http://127.0.0.1:10261/api/v1/nodes/joez-hce-ub20-vm-virt-w

In normal case, yurthub will use the node cache of kube-proxy component under /etc/kubernetes/cache/kube-proxy/nodes.v1.core to response the request, and you can get json output of such node. You can also check if there is such cache at the path.

And, could you post your kube-proxy version? BTW, in my cluster, the kube-proxy is v1.22.7.

joez commented 1 year ago

I have two k8s cluster currently, joez-hce-ub20-vm-virt-{m,w} is v1.23.0 and joez-hce-ub20-vm-oykv-{m,w} is v1.22.0, let us focus on the later one

The kube-proxy version is v1.22.0

I0425 06:34:34.107118       1 server.go:649] Version: v1.22.0

There is no such cache

root@joez-hce-ub20-vm-oykv-w:/home/box# ls /etc/kubernetes/cache/kube-proxy/nodes.v1.core
ls: cannot access '/etc/kubernetes/cache/kube-proxy/nodes.v1.core': No such file or directory
root@joez-hce-ub20-vm-oykv-w:/home/box# ls /etc/kubernetes/cache/
_apis_discovery.k8s.io_v1  _apis_discovery.k8s.io_v1beta1  flanneld  go-http-client  _internal  kubelet  version  virt-api  virt-controller  yurt-app-manager  yurthub

Access to the port 10261 is OK by cURL:

box@joez-hce-ub20-vm-oykv-w:~$ no_proxy='*' curl -H "User-Agent: kube-proxy" -o /dev/null -s -w '%{http_code}\n' http://127.0.0.1:10261/api/v1/nodes/joez-hce-ub20-vm-oykv-w
200

Accessing 127.0.0.1:10261 is the same as 169.254.2.1:10268? I see the error in the logs:

E0425 14:04:34.012669       1 node.go:161] Failed to retrieve node info: Get "https://169.254.2.1:10268/api/v1/nodes/joez-hce-ub20-vm-oykv-w": Service Unavailable

The no_proxy variable in the kube-proxy container does not cover 169.254.2.1, maybe I need to add Automatic Private IP Addressing (APIPA) range into it.

box@joez-hce-ub20-vm-oykv-w:~$ docker exec 4bcb029c86f7 env | grep no_proxy
no_proxy=.svc,.svc.cluster.local,10.244.0.0/16,10.96.0.0/16,localhost,joez-hce-ub20-vm-openyurt-m,sh.intel.com,istio-system.svc,127.0.0.0/8,172.16.0.0/12,192.168.0.0/16,10.0.0.0/8
Congrool commented 1 year ago

There is no such cache

Well, It's strange that there's no cache for kube-proxy. It should be there to make kube-proxy work when worker offline, like:

# ls /etc/kubernetes/cache/
_apis_discovery.k8s.io_v1  _internal  coredns  kube-proxy  kubelet  version  yurthub

# ls /etc/kubernetes/cache/kube-proxy/
endpointslices.v1.discovery.k8s.io  events.v1.events.k8s.io  nodes.v1.core  services.v1.core

You can make the worker connect to the master, and then restart the kube-proxy on worker node at which time yurthub will cache the response from master. Could you have a try? After cache being created, kube-proxy can restart and work even worker is offline.

Accessing 127.0.0.1:10261 is the same as 169.254.2.1:10268?

Yes, actually yurthub server listens on both addresses with same handler.

Access to the port 10261 is OK by cURL

It should not get only the status code, but also the json data of node resource, like:

# curl -H "User-Agent: kube-proxy" http://127.0.0.1:10261/api/v1/nodes/openyurt-e2e-test-worker
{"kind":"Node","apiVersion":"v1","metadata":{"name":"openyurt-e2e-test-worker","uid":"fb53f206-0ba0-44c5-a0eb-253d953a925b","resourceVersion":"903","creationTimestamp":"2023-04-25T03:13:31Z","labels":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"openyurt-e2e-test-worker","kubernetes.io/os":"linux","openyurt.io/is-edge-worker":"true"},"annotations":{"kubeadm.alpha.kubernetes.io/cri-socket":"unix:///run/containerd/containerd.sock","node.alpha.kubernetes.io/ttl":"0","node.beta.openyurt.io/autonomy":"false","volumes.kubernetes.io/controller-managed-attach-detach":"true"},"managedFields":[{"manager":"kubelet","operation":"Update","apiVersion":"v1","time":"2023-04-25T03:13:31Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:volumes.kubernetes.io/controller-managed-attach-detach":{}},"f:labels":{".":{},"f:beta.kubernetes.io/arch":{},"f:beta.kubernetes.io/os":{},"f:kubernetes.io/arch":{},"f:kubernetes.io/hostname":{},"f:kubernetes.io/os":{}}},"f:spec":{"f:providerID":{}}}},{"manager":"kubeadm","operation":"Update","apiVersion":"v1","time":"2023-04-25T03:13:32Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{"f:kubeadm.alpha.kubernetes.io/cri-socket":{}}}}},{"manager":"kubelet","operation":"Update","apiVersion":"v1","time":"2023-04-25T03:14:31Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:conditions":{"k:{\"type\":\"DiskPressure\"}":{"f:lastHeartbeatTime":{}},"k:{\"type\":\"MemoryPressure\"}":{"f:lastHeartbeatTime":{}},"k:{\"type\":\"PIDPressure\"}":{"f:lastHeartbeatTime":{}},"k:{\"type\":\"Ready\"}":{"f:lastHeartbeatTime":{},"f:lastTransitionTime":{},"f:message":{},"f:reason":{},"f:status":{}}},"f:images":{}}},"subresource":"status"},{"manager":"yurtctl","operation":"Update","apiVersion":"v1","time":"2023-04-25T03:14:57Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{"f:node.beta.openyurt.io/autonomy":{}},"f:labels":{"f:openyurt.io/is-edge-worker":{}}}}},{"manager":"kube-controller-manager","operation":"Update","apiVersion":"v1","time":"2023-04-25T03:15:19Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{"f:node.alpha.kubernetes.io/ttl":{}}},"f:spec":{"f:podCIDR":{},"f:podCIDRs":{".":{},"v:\"10.244.1.0/24\"":{}}}}}]},"spec":{"podCIDR":"10.244.1.0/24","podCIDRs":["10.244.1.0/24"],"providerID":"kind://docker/openyurt-e2e-test/openyurt-e2e-test-worker"},"status":{"capacity":{"cpu":"8","ephemeral-storage":"102350Mi","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"40971612Ki","pods":"110"},"allocatable":{"cpu":"8","ephemeral-storage":"102350Mi","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"40971612Ki","pods":"110"},"conditions":[{"type":"MemoryPressure","status":"False","lastHeartbeatTime":"2023-04-25T03:15:19Z","lastTransitionTime":"2023-04-25T03:13:31Z","reason":"KubeletHasSufficientMemory","message":"kubelet has sufficient memory available"},{"type":"DiskPressure","status":"False","lastHeartbeatTime":"2023-04-25T03:15:19Z","lastTransitionTime":"2023-04-25T03:13:31Z","reason":"KubeletHasNoDiskPressure","message":"kubelet has no disk pressure"},{"type":"PIDPressure","status":"False","lastHeartbeatTime":"2023-04-25T03:15:19Z","lastTransitionTime":"2023-04-25T03:13:31Z","reason":"KubeletHasSufficientPID","message":"kubelet has sufficient PID available"},{"type":"Ready","status":"True","lastHeartbeatTime":"2023-04-25T03:15:19Z","lastTransitionTime":"2023-04-25T03:15:19Z","reason":"KubeletReady","message":"kubelet is posting ready status"}],"addresses":[{"type":"InternalIP","address":"172.19.0.2"},{"type":"Hostname","address":"openyurt-e2e-test-worker"}],"daemonEndpoints":{"kubeletEndpoint":{"Port":10250}},"nodeInfo":{"machineID":"374ad63edf4d4470a07e6974619f9364","systemUUID":"6b87df04-568e-400a-82fe-8e6b79a81dcc","bootID":"6cfe89bd-e735-4b1c-90f7-3c683e412759","kernelVersion":"5.4.0-146-generic","osImage":"Ubuntu 21.10","containerRuntimeVersion":"containerd://1.5.10","kubeletVersion":"v1.22.7","kubeProxyVersion":"v1.22.7","operatingSystem":"linux","architecture":"amd64"},"images":[{"names":["k8s.gcr.io/kube-proxy:v1.22.7"],"sizeBytes":105458887},{"names":["k8s.gcr.io/etcd:3.5.0-0"],"sizeBytes":99868722},{"names":["k8s.gcr.io/kube-apiserver:v1.22.7"],"sizeBytes":74670034},{"names":["k8s.gcr.io/kube-controller-manager:v1.22.7"],"sizeBytes":67522360},{"names":["docker.io/openyurt/yurthub:v1.2.1"],"sizeBytes":57765800},{"names":["k8s.gcr.io/kube-scheduler:v1.22.7"],"sizeBytes":53923640},{"names":["docker.io/openyurt/yurt-tunnel-agent:v1.2.1"],"sizeBytes":44572610},{"names":["docker.io/kindest/kindnetd:v20211122-a2c10462"],"sizeBytes":40928505},{"names":["docker.io/openyurt/node-servant:v1.2.1"],"sizeBytes":38556748},{"names":["k8s.gcr.io/build-image/debian-base:buster-v1.7.2"],"sizeBytes":21133992},{"names":["k8s.gcr.io/coredns/coredns:v1.8.4"],"sizeBytes":13707249},{"names":["docker.io/rancher/local-path-provisioner:v0.0.14"],"sizeBytes":13367922},{"names":["k8s.gcr.io/pause:3.6"],"sizeBytes":301773}]}}
joez commented 1 year ago

The kube-proxy issue is because of the wrong no_proxy setting, add the APIPA range into it, kube-proxy works fine now:

box@joez-hce-ub20-vm-oykv-w:~$ docker exec 82963d60adbd env | grep no_proxy
no_proxy=.svc,.svc.cluster.local,10.244.0.0/16,10.96.0.0/16,localhost,joez-hce-ub20-vm-openyurt-m,sh.intel.com,istio-system.svc,127.0.0.0/8,169.254.0.0/16,172.16.0.0/12,192.168.0.0/16,10.0.0.0/8

box@joez-hce-ub20-vm-oykv-w:~$ docker logs 82963d60adbd
I0505 07:22:06.897972       1 server.go:553] Neither kubeconfig file nor master URL was specified. Falling back to in-cluster config.
I0505 07:22:06.919089       1 node.go:172] Successfully retrieved node IP: 10.67.109.173
I0505 07:22:06.919132       1 server_others.go:140] Detected node IP 10.67.109.173
W0505 07:22:06.919172       1 server_others.go:565] Unknown proxy mode "", assuming iptables proxy
I0505 07:22:06.964667       1 server_others.go:206] kube-proxy running in dual-stack mode, IPv4-primary
I0505 07:22:06.964704       1 server_others.go:212] Using iptables Proxier.
I0505 07:22:06.964715       1 server_others.go:219] creating dualStackProxier for iptables.
W0505 07:22:06.964730       1 server_others.go:495] detect-local-mode set to ClusterCIDR, but no IPv6 cluster CIDR defined, , defaulting to no-op detect-local for IPv6
I0505 07:22:06.965123       1 server.go:649] Version: v1.22.0
I0505 07:22:06.969845       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0505 07:22:06.969873       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0505 07:22:06.969957       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I0505 07:22:06.969989       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I0505 07:22:06.970561       1 config.go:315] Starting service config controller
I0505 07:22:06.970704       1 config.go:224] Starting endpoint slice config controller
I0505 07:22:06.970710       1 shared_informer.go:240] Waiting for caches to sync for service config
I0505 07:22:06.970717       1 shared_informer.go:240] Waiting for caches to sync for endpoint slice config
I0505 07:22:07.070911       1 shared_informer.go:247] Caches are synced for service config
I0505 07:22:07.071021       1 shared_informer.go:247] Caches are synced for endpoint slice config

box@joez-hce-ub20-vm-oykv-w:~$ sudo iptables -t nat -n -L KUBE-SERVICES
[sudo] password for box:
Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-SVC-OVTWZ4GROBJZO4C5  tcp  --  0.0.0.0/0            10.96.193.48         /* default/nginx:80-80 cluster IP */ tcp dpt:80
KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  0.0.0.0/0            10.96.0.10           /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53
KUBE-SVC-LON7267IY6XCAPHT  tcp  --  0.0.0.0/0            10.96.135.118        /* kube-system/yurt-app-manager-webhook:https cluster IP */ tcp dpt:443
KUBE-SVC-UDPDOKU2AFJKWYNL  tcp  --  0.0.0.0/0            10.96.221.157        /* kubevirt/virt-api cluster IP */ tcp dpt:443
KUBE-SVC-EIEVNBW5YXUIDXZD  tcp  --  0.0.0.0/0            10.96.197.127        /* kubevirt/kubevirt-prometheus-metrics:metrics cluster IP */ tcp dpt:443
KUBE-SVC-JD5MR3NA4I4DYORP  tcp  --  0.0.0.0/0            10.96.0.10           /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  0.0.0.0/0            10.96.0.10           /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
KUBE-SVC-GXXJIUUZRDUOXB4K  tcp  --  0.0.0.0/0            10.96.49.55          /* kubevirt/kubevirt-operator-webhook:webhooks cluster IP */ tcp dpt:443
KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  0.0.0.0/0            10.96.0.1            /* default/kubernetes:https cluster IP */ tcp dpt:443
KUBE-NODEPORTS  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

box@joez-hce-ub20-vm-oykv-w:~$ ls /etc/kubernetes/cache/kube-proxy/
endpointslices.v1.discovery.k8s.io  events.v1.events.k8s.io  nodes.v1.core  services.v1.core

It's time to check virt-handler now, I think it is still trying to talk to kube-apiserver

...
W0505 07:22:12.698969    6593 reflector.go:324] pkg/controller/virtinformers.go:331: failed to list *v1.VirtualMachineInstance: can not cache for go-http-client list virtualmachineinstances: /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FmigrationTargetNodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0
E0505 07:22:12.699042    6593 reflector.go:138] pkg/controller/virtinformers.go:331: Failed to watch *v1.VirtualMachineInstance: failed to list *v1.VirtualMachineInstance: can not cache for go-http-client list virtualmachineinstances: /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FmigrationTargetNodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0
{"component":"virt-handler","level":"info","msg":"failed to dial cmd socket: //pods/95f6251c-07e3-4987-a14a-ae0af2e0b43a/volumes/kubernetes.io~empty-dir/sockets/launcher-sock","pos":"client.go:303","reason":"context deadline exceeded","timestamp":"2023-05-05T07:22:13.544301Z"}
{"component":"virt-handler","level":"error","msg":"failed to connect to cmd client socket","pos":"cache.go:526","reason":"context deadline exceeded","timestamp":"2023-05-05T07:22:13.544584Z"}
Congrool commented 1 year ago

To be honest, I'm not familiar with kubevirt. I can only check the situation from the view of yurthub. I'm not sure how many kinds of kubevirt related components will run on worker nodes. I saw that the cache has already contained something like virt-api, virt-controller. Does virt-handler use one of them? I mean we may need to check what's the User-Agent of virt-handler when it sends requests.

And another question, does virt-handler have its kubeconfig? If so, I think we can remove such kubeconfig to make it use InClusterConfig, which will enable virt-handler to send requests to yurthub instead of apiserver.

joez commented 1 year ago

@Congrool Thank you very much. Maybe I am the first one to use KubeVirt on OpenYurt, but I think more and more users will choose KubeVirt if they want to orchestrate VM workload (such as app on Windows), and OpenYurt if they require edge autonomy.

I will check KubeVirt further, the solution should be similar as kube-proxy, we need to customize it, as you mentioned to use InClusterConfig. But I don't understand how kube-proxy work with yurt-hub, would you show me some document in details?

Congrool commented 1 year ago

I don't understand how kube-proxy work with yurt-hub, would you show me some document in details?

@joez I can give you some details of it. You can check the doc of yurthub which gives a rough description of the Data Filtering Framework. But it contains more filters than what we post here. The two main filters that have effects on kube-proxy is MasterService Filter and InClusterConfig Filter.

The previous one mainly affects the kubelet when it creates pods and sets envs in it. To be specific, it will change clusterIP and port of the kuberentes service that kubelet got from kube-apiserver. You can verify it by

cat /etc/kubernetes/cache/kubelet/services.v1.core/default/kubernetes

whose spec.ClusterIP has been changed to 169.254.2.1 and spec.Ports has been changed to 10268. Then when kubelet creates pods, it will set envs for pods like KUBERNETES_SERVICE_PORT=10268, KUBERNETES_SERVICE_HOST=169.254.2.1 which components using InClusterConfig will send request to. As a result, yurthub will serve these requests. You can check these envs within pods.

Based on the MasterService Filter, what we need to do is that making sure all components use InClusterConfig, while the kube-proxy will use configmap kube-proxy as it kubeconfig in default settings. So we need InClusterConfig Filter which takes the responsibility of removing kubeconfig.conf from kube-proxy configmap. Thus, kubelet will only get the modified kube-proxy configmap and then mount it for kube-proxy pod enabling it to use InClusterConfig.

joez commented 1 year ago

From the error message of virt-handler:

E0506 05:52:02.097983    6593 reflector.go:138] pkg/controller/virtinformers.go:331: Failed to watch *v1.VirtualMachineInstance: failed to list *v1.VirtualMachineInstance: can not cache for go-http-client list virtualmachineinstances: /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FmigrationTargetNodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0

The user agent is go-http-client, I can find the cached object:

root@joez-hce-ub20-vm-oykv-w:/etc/kubernetes/cache# ls go-http-client/virtualmachines.v1alpha3.kubevirt.io/default
testvm

The kubernetes service has already set to yurt-hub:

root@joez-hce-ub20-vm-oykv-w:/etc/kubernetes/cache# docker exec  c487b5a6a2e7 env | grep KUBERNETES_SERVICE | sort
KUBERNETES_SERVICE_HOST=169.254.2.1
KUBERNETES_SERVICE_PORT=10268
KUBERNETES_SERVICE_PORT_HTTPS=10268

Don't know why it is still failed to get the virtualmachineinstances object Try to get it via yurt-hub directly, the same error:


root@joez-hce-ub20-vm-oykv-w:/etc/kubernetes/cache# no_proxy='*' curl -H "User-Agent: go-http-client" -v -L 'http://127.0.0.1:10261/apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FmigrationTargetNodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0'
* Uses proxy env variable no_proxy == '*'
*   Trying 127.0.0.1:10261...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 10261 (#0)
> GET /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FmigrationTargetNodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0 HTTP/1.1
> Host: 127.0.0.1:10261
> Accept: */*
> User-Agent: go-http-client
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 400 Bad Request
< Content-Type: application/json
< Date: Sat, 06 May 2023 10:11:46 GMT
< Content-Length: 351
<
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"can not cache for go-http-client list virtualmachineinstances: /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FmigrationTargetNodeName+in+%28joez-hce-ub20-vm-oykv-w%29\u0026limit=500\u0026resourceVersion=0","reason":"BadRequest","code":400}
* Connection #0 to host 127.0.0.1 left intact

The error is from pkg/yurthub/proxy/local/local.go:

// localReqCache handles Get/List/Update requests when remote servers are unhealthy
func (lp *LocalProxy) localReqCache(w http.ResponseWriter, req *http.Request) error {
    if !lp.cacheMgr.CanCacheFor(req) {
        klog.Errorf("can not cache for %s", hubutil.ReqString(req))
        return apierrors.NewBadRequest(fmt.Sprintf("can not cache for %s", hubutil.ReqString(req)))
    }

@Congrool Would you shed some light on why the request can not be cached?

joez commented 1 year ago

I think I am almost approaching our target, except the "can not cache" error from yurthub as mentioned last time:

@rambohe-ch @Congrool Would you help on this? I don't know why these resources can't be cached

box@joez-hce-ub20-vm-oykv-w:~$ docker logs fb5695ee3d8b
...
{"component":"virt-handler","level":"info","msg":"STARTING informer vmiInformer-targets","pos":"virtinformers.go:330","timestamp":"2023-05-10T04:25:54.720361Z"}
W0510 04:25:54.751331    7102 reflector.go:324] pkg/controller/virtinformers.go:331: failed to list *v1.VirtualMachineInstance: can not cache for go-http-client list virtualmachineinstances: /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FnodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0
E0510 04:25:54.751694    7102 reflector.go:138] pkg/controller/virtinformers.go:331: Failed to watch *v1.VirtualMachineInstance: failed to list *v1.VirtualMachineInstance: can not cache for go-http-client list virtualmachineinstances: /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FnodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0

Here are the related logs from yurthub:

I0510 04:25:54.719455       1 util.go:289] start proxying: get /apis/apiextensions.k8s.io/v1/customresourcedefinitions?limit=500&resourceVersion=0, in flight requests: 26
I0510 04:25:54.726858       1 util.go:289] start proxying: get /api/v1/namespaces/kubevirt/configmaps?fieldSelector=metadata.name%3Dkubevirt-ca&limit=500&resourceVersion=0, in flight requests: 27
I0510 04:25:54.727759       1 util.go:248] go-http-client list configmaps: /api/v1/namespaces/kubevirt/configmaps?fieldSelector=metadata.name%3Dkubevirt-ca&limit=500&resourceVersion=0 with status code 200, spent 724.906µs
I0510 04:25:54.729074       1 util.go:289] start proxying: get /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FnodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0, in flight requests: 27
I0510 04:25:54.729183       1 util.go:289] start proxying: get /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FmigrationTargetNodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0, in flight requests: 28
W0510 04:25:54.729291       1 cache_manager.go:769] list requests that have the same path but with different selector, skip cache for go-http-client list virtualmachineinstances: /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FnodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0
E0510 04:25:54.729347       1 local.go:217] can not cache for go-http-client list virtualmachineinstances: /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FnodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0
E0510 04:25:54.729383       1 local.go:87] could not proxy local for go-http-client list virtualmachineinstances: /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FnodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0, can not cache for go-http-client list virtualmachineinstances: /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FnodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0
I0510 04:25:54.729564       1 util.go:248] go-http-client list virtualmachineinstances: /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FnodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0 with status code 400, spent 295.259µs

What I have done are:

box@joez-hce-ub20-vm-oykv-w:~$ docker exec -it 2deb25f5272a sh
/ # nslookup nginx
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      nginx
Address 1: 10.96.2.116 nginx.default.svc.cluster.local
/ # nslookup virt-api.kubevirt
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      virt-api.kubevirt
Address 1: 10.96.206.147 virt-api.kubevirt.svc.cluster.local
/ # ping virt-api.kubevirt
PING virt-api.kubevirt (10.96.206.147): 56 data bytes
64 bytes from 10.96.206.147: seq=0 ttl=241 time=184.707 ms
64 bytes from 10.96.206.147: seq=1 ttl=241 time=202.301 ms
^C
--- virt-api.kubevirt ping statistics ---
3 packets transmitted, 2 packets received, 33% packet loss
round-trip min/avg/max = 184.707/193.504/202.301 ms
/ # exit

box@joez-hce-ub20-vm-oykv-w:~$ docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED         STATUS         PORTS     NAMES
fb5695ee3d8b   c407633b131b           "virt-handler --port…"   3 minutes ago   Up 3 minutes             k8s_virt-handler_virt-handler-7q56b_kubevirt_60b59abb-295d-455f-b8a1-a248a5656d7a_8
38068327ac79   alpine                 "/bin/sh"                3 minutes ago   Up 3 minutes             k8s_test_test_default_962446d6-f99d-41d8-b650-6af03f8a007f_10
d6a7f0c799bf   nginx                  "/docker-entrypoint.…"   3 minutes ago   Up 3 minutes             k8s_nginx_nginx-6799fc88d8-j7fwc_default_13d7e957-bc7d-4bfc-8db2-507c70fd240f_14
6e78501b808f   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_virt-operator-55989d567c-p5nkl_kubevirt_d407b21d-1eb7-47ea-8bd0-6daec1bbe747_50
436b91002bab   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_virt-handler-7q56b_kubevirt_60b59abb-295d-455f-b8a1-a248a5656d7a_47
0dc549a31e7a   943b496a674d           "virt-api --port 844…"   3 minutes ago   Up 3 minutes             k8s_virt-api_virt-api-5474cf649d-rlcrw_kubevirt_0218bfee-c273-4ed5-8626-43a88d0f5267_8
13a093ee9642   943b496a674d           "virt-api --port 844…"   3 minutes ago   Up 3 minutes             k8s_virt-api_virt-api-5474cf649d-xp658_kubevirt_f0805cfd-2132-4515-acbc-68c967ab2b22_8
2deb25f5272a   8c811b4aec35           "sleep 3600"             3 minutes ago   Up 3 minutes             k8s_debug_debug_default_8051f632-a8bf-4ca8-801a-4aeca8bcb824_4
d77d27497658   8d147537fb7d           "/coredns -conf /etc…"   3 minutes ago   Up 3 minutes             k8s_coredns_coredns-klpms_kube-system_9c4821ad-f91d-4174-af9a-dfecdbe2321e_5
71ae8a256f11   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_virt-operator-55989d567c-t2n76_kubevirt_dc3b91e2-3751-48af-ac7c-2e5d060b0349_46
1cb7fbce79cf   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_virt-api-5474cf649d-xp658_kubevirt_f0805cfd-2132-4515-acbc-68c967ab2b22_45
c6c7d9405810   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_virt-api-5474cf649d-rlcrw_kubevirt_0218bfee-c273-4ed5-8626-43a88d0f5267_46
0fae5f38ff65   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_virt-controller-7f8ff6cdc4-wcvvb_kubevirt_403a3601-b8e7-4df4-88e1-f93a6a94939c_47
ebd626f49650   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_debug_default_8051f632-a8bf-4ca8-801a-4aeca8bcb824_25
bf519a40f8ef   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_test_default_962446d6-f99d-41d8-b650-6af03f8a007f_77
ee5297767dd0   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_coredns-klpms_kube-system_9c4821ad-f91d-4174-af9a-dfecdbe2321e_37
57e2183d1c56   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_virt-controller-7f8ff6cdc4-hd4ft_kubevirt_1946d9c4-8aa8-498d-b6c0-7fa0812c2da9_51
dddc0e3fdba2   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_nginx-6799fc88d8-j7fwc_default_13d7e957-bc7d-4bfc-8db2-507c70fd240f_92
bdd0259f6407   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_yurt-app-manager-6fd8dcd6b4-9gp6n_kube-system_a117d8a4-da40-4568-ba1a-61b1979a76ed_98
83e7610f3578   11ae74319a21           "/opt/bin/flanneld -…"   3 minutes ago   Up 3 minutes             k8s_kube-flannel_kube-flannel-ds-9mrs8_kube-flannel_2459bd62-295b-4806-a751-ad70a2660c29_16
31ebde671e70   bbad1636b30d           "/usr/local/bin/kube…"   3 minutes ago   Up 3 minutes             k8s_kube-proxy_kube-proxy-9ktvr_kube-system_9404c203-bca0-4598-9aec-6f371e699df4_15
129fcf09327e   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_kube-proxy-9ktvr_kube-system_9404c203-bca0-4598-9aec-6f371e699df4_15
7daf002b5da3   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_kube-flannel-ds-9mrs8_kube-flannel_2459bd62-295b-4806-a751-ad70a2660c29_15
10f0ab49723e   60fb0e90cdfb           "yurthub --v=2 --ser…"   3 minutes ago   Up 3 minutes             k8s_yurt-hub_yurt-hub-joez-hce-ub20-vm-oykv-w_kube-system_dd10f5ec226508a076ff4cffac748add_15
65f925802c20   k8s.gcr.io/pause:3.5   "/pause"                 3 minutes ago   Up 3 minutes             k8s_POD_yurt-hub-joez-hce-ub20-vm-oykv-w_kube-system_dd10f5ec226508a076ff4cffac748add_15
Congrool commented 1 year ago

I saw that in yurthub logs

I0510 04:25:54.729074       1 util.go:289] start proxying: get /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FnodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0, in flight requests: 27
I0510 04:25:54.729183       1 util.go:289] start proxying: get /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FmigrationTargetNodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0, in flight requests: 28
W0510 04:25:54.729291       1 cache_manager.go:769] list requests that have the same path but with different selector, skip cache for go-http-client list virtualmachineinstances: /apis/kubevirt.io/v1alpha3/virtualmachineinstances?labelSelector=kubevirt.io%2FnodeName+in+%28joez-hce-ub20-vm-oykv-w%29&limit=500&resourceVersion=0

Yurthub got the warning because it can only cache list/watch requests from same component for same resource with different selector. In this case, when go-http-client has list/watch virtualmachineinstances with selector A, yurthub will cache all virtualmachineinstances that match the selector A. Then, when go-http-client wants to list/watch virtualmachineinstances with selector B(that is different from A), yurthub will get a conflict that whether to cache resources matching A or cache resources matching B for go-http-client. Currently, we only remain the cache of first request. Thus when the second request comes, it will be refused.

Now, lets come to the solution. Firstly, I've to say that it seems cannot be solved just through configuring. We should check if the virt-handler really need list/watch virtualmachineinstances with different selector. I saw that we have virt-handler and virt-api running on the same node. We should check:

  1. do they both use go-http-client sending request to yurthub
  2. if 1 is true, we need to check if they list/watch virtualmachineinstances but with different selectors.

If 2 is true, we can change the User-Agent for virt-handler and virt-api to make them different. @joez

joez commented 1 year ago

@Congrool Sorry for late, I spent some time to learn the kubevirt code, the conclusion is that, the current implementation of yurt-hub can't support kubevirt very well, virt-handler needs support on same path with different selector

In the normal scenario, you can find logs in virt-handler:

{"component":"virt-handler","level":"info","msg":"Starting virt-handler controller.","pos":"vm.go:1387"}

But you can't find it in the disconnected scenario, from the code pkg/virt-handler/vm.go:

func (c *VirtualMachineController) Run(threadiness int, stopCh chan struct{}) {
    defer c.Queue.ShutDown()
    log.Log.Info("Starting virt-handler controller.")

    go c.deviceManagerController.Run(stopCh)

    cache.WaitForCacheSync(stopCh, c.domainInformer.HasSynced, c.vmiSourceInformer.HasSynced, c.vmiTargetInformer.HasSynced, c.gracefulShutdownInformer.HasSynced)
...

VirtualMachineController is created and run by virt-handler, code at cmd/virt-handler/virt-handler.go

cmd/virt-handler/virt-handler.go

func (app *virtHandlerApp) Run() {
...
    vmiSourceInformer := factory.VMISourceHost(app.HostOverride)
    vmiTargetInformer := factory.VMITargetHost(app.HostOverride)
...
    vmController := virthandler.NewController(
        recorder,
        app.virtCli,
        app.HostOverride,
        migrationIpAddress,
        app.VirtShareDir,
        app.VirtPrivateDir,
        vmiSourceInformer,
        vmiTargetInformer,
        domainSharedInformer,
        gracefulShutdownInformer,
...
    cache.WaitForCacheSync(stop, vmiSourceInformer.HasSynced, factory.CRD().HasSynced)

    go vmController.Run(10, stop)

In the disconnected scenario, virt-handler is blocked at cache.WaitForCacheSync, that is why the VM is not launched.

Both vmiSourceInformer and vmiTargetInformer should be success, but they have the same target path, only the selector is different, related code at pkg/controller/virtinformers.go:


func (f *kubeInformerFactory) VMISourceHost(hostName string) cache.SharedIndexInformer {
    labelSelector, err := labels.Parse(fmt.Sprintf(kubev1.NodeNameLabel+" in (%s)", hostName))
    if err != nil {
        panic(err)
    }

    return f.getInformer("vmiInformer-sources", func() cache.SharedIndexInformer {
        lw := NewListWatchFromClient(f.restClient, "virtualmachineinstances", k8sv1.NamespaceAll, fields.Everything(), labelSelector)
        return cache.NewSharedIndexInformer(lw, &kubev1.VirtualMachineInstance{}, f.defaultResync, cache.Indexers{
            cache.NamespaceIndex: cache.MetaNamespaceIndexFunc,
            "node": func(obj interface{}) (strings []string, e error) {
                return []string{obj.(*kubev1.VirtualMachineInstance).Status.NodeName}, nil
            },
        })
    })
}

func (f *kubeInformerFactory) VMITargetHost(hostName string) cache.SharedIndexInformer {
    labelSelector, err := labels.Parse(fmt.Sprintf(kubev1.MigrationTargetNodeNameLabel+" in (%s)", hostName))
    if err != nil {
        panic(err)
    }

    return f.getInformer("vmiInformer-targets", func() cache.SharedIndexInformer {
        lw := NewListWatchFromClient(f.restClient, "virtualmachineinstances", k8sv1.NamespaceAll, fields.Everything(), labelSelector)
        return cache.NewSharedIndexInformer(lw, &kubev1.VirtualMachineInstance{}, f.defaultResync, cache.Indexers{
            cache.NamespaceIndex: cache.MetaNamespaceIndexFunc,
            "node": func(obj interface{}) (strings []string, e error) {
                return []string{obj.(*kubev1.VirtualMachineInstance).Status.NodeName}, nil
            },
        })
    })
}

Details in the attached logs-openyurt-kubevirt.zip

Congrool commented 1 year ago

the conclusion is that, the current implementation of yurt-hub can't support kubevirt very well, virt-handler needs support on same path with different selector

@joez Hi, I'm sorry to hear that. If you don't mind modifying the source code, there's still two solutions:

  1. enhance the cache capability of yurthub.
  2. split the client that list/watch VirtualMachineInstance in virt-handler, one client for each selector and assign different User-Agent for them, such as virt-handler-MigrationTargetNodeNameLabel and virt-handler-NodeNameLabel. Then yurthub can cache them respectively.

To quickly make it work arround, the option 2 is recommanded. Option 1 is somewhat hardly pushed forward, because we may need a refectoring on yurthub cache framework which is a big job. Anyway, it's the cache limitation that the community have already recognized, we need to come out a final solution to break it.

joez commented 1 year ago

@Congrool Let me figure out which way is feasible, I will try option 2 first. This is a big challenge for me, because I have no programming experience for Kubernetes. May I know what is the main reason for the current yurt-hub design? Shall we cache all the resources under a path and then filter on-the-fly when proxing, to support the current use case: same path with different selector.

Congrool commented 1 year ago

May I know what is the main reason for the current yurt-hub design? Shall we cache all the resources under a path and then filter on-the-fly when proxing, to support the current use case: same path with different selector.

As far as I know, the original thought of yurthub is cache resources as least as possible considering the limited hardware resource of edge nodes. Thus, we separate cache for different components, and only cache some of them by default(e.g. kubelet, flannel, kube-proxy, coredns) which construct the minimal infrastructure set which business bases on(failure recovery, container network, service discovery, dns resolution, respectively). Other components, in this case, virt-handler, was not token into consideration in cloud-edge scenario.

But, hm, this feature emerged at the very early stage. Maybe @rambohe-ch can give more details.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Congrool commented 9 months ago

PR linked: #1614

stale[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.