Closed ymc101 closed 8 months ago
@ymc101, please provide following information for us to understand the problem:
sudo lsof
on master VM)Hi @leokondrashov , the VMs I was previously running has been terminated, I will replicate the setup later and get back to you with the information.
Hi @leokondrashov , below is the requested information:
ifconfig -a (Worker VM):
enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.0.2.15 netmask 255.255.255.0 broadcast 10.0.2.255
inet6 fe80::d05:3fef:736a:ea3e prefixlen 64 scopeid 0x20<link>
ether 08:00:27:f0:ba:60 txqueuelen 1000 (Ethernet)
RX packets 620461 bytes 870904986 (870.9 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 96079 bytes 6573845 (6.5 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 555 bytes 46847 (46.8 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 555 bytes 46847 (46.8 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth0-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.255.252 broadcast 0.0.0.0
inet6 fe80::d0bc:5ff:fe84:7041 prefixlen 64 scopeid 0x20<link>
ether d2:bc:05:84:70:41 txqueuelen 1000 (Ethernet)
RX packets 14 bytes 1076 (1.0 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 59 bytes 7193 (7.1 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth1-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.5 netmask 255.255.255.252 broadcast 0.0.0.0
inet6 fe80::b41b:b1ff:fe72:2311 prefixlen 64 scopeid 0x20<link>
ether b6:1b:b1:72:23:11 txqueuelen 1000 (Ethernet)
RX packets 14 bytes 1076 (1.0 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 59 bytes 7193 (7.1 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth2-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.9 netmask 255.255.255.252 broadcast 0.0.0.0
inet6 fe80::6076:b5ff:fed5:92ca prefixlen 64 scopeid 0x20<link>
ether 62:76:b5:d5:92:ca txqueuelen 1000 (Ethernet)
RX packets 14 bytes 1076 (1.0 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 60 bytes 7264 (7.2 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth3-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.13 netmask 255.255.255.252 broadcast 0.0.0.0
inet6 fe80::587b:78ff:feaa:9dd9 prefixlen 64 scopeid 0x20<link>
ether 5a:7b:78:aa:9d:d9 txqueuelen 1000 (Ethernet)
RX packets 14 bytes 1076 (1.0 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 60 bytes 7270 (7.2 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth4-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.17 netmask 255.255.255.252 broadcast 0.0.0.0
inet6 fe80::c0cf:3ff:fe99:2c1a prefixlen 64 scopeid 0x20<link>
ether c2:cf:03:99:2c:1a txqueuelen 1000 (Ethernet)
RX packets 14 bytes 1076 (1.0 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 61 bytes 7360 (7.3 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth5-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.21 netmask 255.255.255.252 broadcast 0.0.0.0
inet6 fe80::3072:fbff:fe84:a7e0 prefixlen 64 scopeid 0x20<link>
ether 32:72:fb:84:a7:e0 txqueuelen 1000 (Ethernet)
RX packets 14 bytes 1076 (1.0 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 60 bytes 7270 (7.2 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth6-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.25 netmask 255.255.255.252 broadcast 0.0.0.0
inet6 fe80::2cf5:9dff:fefb:654f prefixlen 64 scopeid 0x20<link>
ether 2e:f5:9d:fb:65:4f txqueuelen 1000 (Ethernet)
RX packets 14 bytes 1076 (1.0 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 60 bytes 7270 (7.2 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth7-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.29 netmask 255.255.255.252 broadcast 0.0.0.0
inet6 fe80::38cc:d8ff:fe46:366a prefixlen 64 scopeid 0x20<link>
ether 3a:cc:d8:46:36:6a txqueuelen 1000 (Ethernet)
RX packets 14 bytes 1076 (1.0 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 60 bytes 7270 (7.2 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth8-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.33 netmask 255.255.255.252 broadcast 0.0.0.0
inet6 fe80::3494:6eff:fea8:931d prefixlen 64 scopeid 0x20<link>
ether 36:94:6e:a8:93:1d txqueuelen 1000 (Ethernet)
RX packets 14 bytes 1076 (1.0 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 60 bytes 7270 (7.2 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth9-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.37 netmask 255.255.255.252 broadcast 0.0.0.0
inet6 fe80::f8c3:5bff:fe8f:9410 prefixlen 64 scopeid 0x20<link>
ether fa:c3:5b:8f:94:10 txqueuelen 1000 (Ethernet)
RX packets 14 bytes 1076 (1.0 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 60 bytes 7270 (7.2 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ifconfig -a (Master VM):
enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.0.2.15 netmask 255.255.255.0 broadcast 10.0.2.255
inet6 fe80::9bbc:268f:5cad:ba73 prefixlen 64 scopeid 0x20<link>
ether 08:00:27:8a:97:2a txqueuelen 1000 (Ethernet)
RX packets 729420 bytes 1041691890 (1.0 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 161699 bytes 10474864 (10.4 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 206981 bytes 32255875 (32.2 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 206981 bytes 32255875 (32.2 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
sudo lsof info: There are many lines where it shows some process listening to localhost:6443
~/.kube/config (Master VM)
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakNDQWVhZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJME1ESXdNVEUwTURrMU9Wb1hEVE0wTURFeU9URTBNRGsxT1Zvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBSks0CmpLTlVTSEIveWtjRGYvZmJWVWZ2Q1I3Sk01Qm12MVY4SmdGdGpMUENucEEzY0FJcmRIMUlpcmhHZkR2a1VyQ0IKdjZGL3pLS0FGaWZubzllSG8vbE1NakxPZkE5Z1ZLYjVDZkZRbzIrNXZtM2Exc0xmZHlDMDBvbkdVRmxZNFhrRgp3RFNCSW0vUk8vZ1NHZGcwaGUweUFQalg3Y2x5S216Wm92M0lYaGZlbEFFWU1iOTJtOFcyU2RVUnJtNXk1K2d1CitnMXB5dGxuNjVOZmprMm1xU0plMlJMWFZwVTdMSHpnTEdtSS9QV0hwM3c4enlES08xUHJrZmlKaHNtb0lqM0sKc2c5MFU2RGhDYVZ4bGlLdXJZVmYzVlllNkk5MnY3QS9UQkV6UGlzZ0dNYlZoaXB3T05BRU5yL001YURETGNMbQpwTEU2RWxxNzNuNytJa1FQVk1FQ0F3RUFBYU5aTUZjd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZNVTVYbitONEtzRStjTXVETlRnNlplQWdKOXRNQlVHQTFVZEVRUU8KTUF5Q0NtdDFZbVZ5Ym1WMFpYTXdEUVlKS29aSWh2Y05BUUVMQlFBRGdnRUJBRUV5bkVTckFNcUYyWHpYd1JobgpVbVFKWUVLRVZBOE96Vm1vdUlrcmowbTMrSmhZdGlUZVVJd1JVMDdEVzh4Ui9IQSsrTUlkN0FsMW5VN2tlN29NCkJsMGxQZUxXekcwN1NYNW5ZVDg2WjlubVoyNWEzTDFsaDFiTFN6YW9sanRsTnJ2ZFBNRkZGZmN3RW9QWERUbm0KZUFpaU9RTW1HbXRyZkcvT09hOVZ0NDEyOUo3NEJQMm0vL0lZeHdBeWkxaEpRWGJBQWorS3NPK1FpWmhtSHFrZQpxRFoyQmhUZnF2bHZjR29pNGJhRCtESnVodHZhU2VDcUtDek5Td0NCSEhjT0Z5RWw3Q1dPcWh4czgvVkxVYTNXCjVjVW13YTRsVWJkREFJY3FoWlZoWmhvVmk3UTdGbDVTZEl4Y1dnL0lqemhEVXBTeE8zdUxvM280a0RUaTVpUUcKTDIwPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
server: https://10.0.2.15:6443
name: kubernetes
contexts:
- context:
cluster: kubernetes
user: kubernetes-admin
name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}
users:
- name: kubernetes-admin
user:
client-certificate-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURJVENDQWdtZ0F3SUJBZ0lJS3ZGRVZ6UHRISVV3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TkRBeU1ERXhOREE1TlRsYUZ3MHlOVEF4TXpFeE5ERXdNREphTURReApGekFWQmdOVkJBb1REbk41YzNSbGJUcHRZWE4wWlhKek1Sa3dGd1lEVlFRREV4QnJkV0psY201bGRHVnpMV0ZrCmJXbHVNSUlCSWpBTkJna3Foa2lHOXcwQkFRRUZBQU9DQVE4QU1JSUJDZ0tDQVFFQXgzTE5XcE5PY3dHd3BWODYKWVgzNlNIZUJoTWZZYTgyT25GdWpjdFQ3emJRM2c3Nk55RWtBUlZKeFl1VkVORTRMblZ0clc1bzdoQUxWbW8wNQpFSlRybnk3UFVuZ0Y1Wm9JZmpkdmJJaHk5WlY2VWg0ODByKzVvVlFDVDEzdCtRSW8yWWt1c3VQSWZnK01JRnJQCjNibldRQUhaeG5PMnFacVJES2dIaXBJU1RkRzNIeWFYUlJmTi9LdFhxUk02S3h4c0Y3b2paUWp6T2Vwb09LT0oKNUZqNCt5TzVnVjkweGE3V0hHSUN6YVhtZzdnajZ4UkNTUmRjbDJUeEtxak0xRlJkSEE3L0VvcmNpZU9PQWVBRgpPTW4xZE13R3FQQ2gyYUI5V2xKcnF4SWI3anI5NStsWFlTTzYrTi9ZczM0OXJyS2pPM3Jja2U3dFl0TTZWeW1xCjR2L2hCd0lEQVFBQm8xWXdWREFPQmdOVkhROEJBZjhFQkFNQ0JhQXdFd1lEVlIwbEJBd3dDZ1lJS3dZQkJRVUgKQXdJd0RBWURWUjBUQVFIL0JBSXdBREFmQmdOVkhTTUVHREFXZ0JURk9WNS9qZUNyQlBuRExnelU0T21YZ0lDZgpiVEFOQmdrcWhraUc5dzBCQVFzRkFBT0NBUUVBTU0ydTduVlFaTFZVWTl4QXZ6WG9aSHJVZmRNL1N0L2w3RkxVCnc3aENXN3JpZ1R5L0ovTloyNC9VRjVqMGo2WWM0cTEyRWpYb0gydEZkczN0MFlmdUNrRWU2TVF4MXliNG82M2QKVTNVUDgvVE9BREx5UEJEcERXK1Q2YkhGSDc4TTdYSHl6SU40SXVNY3dOTkhmWXl5R2RmYWZub0RqVEFYZnBJcgpxdENYKzYwQUN1T3AyOUZ5Ui81MElPSmRKSnRrM1gra0NlbFc5V3ZObThqMGZkTkxVbFBvaldTYzF3NERZeHJWCmJHTjBuL1hRTVJPQ2NrRGFmdzRUT1ptZ3FTVGNyV2lTMkhsdHYrK1BYRkxOSW5XNktvbFpMbjIxRTFBdGtGRkQKejJxc3dQY0NnNWsxL2F0b0hWUlgrdjhFVHNJSUU0ZnF2NW5BVXVYWXNyc25nbXdiaEE9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
client-key-data: LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQpNSUlFcEFJQkFBS0NBUUVBeDNMTldwTk9jd0d3cFY4NllYMzZTSGVCaE1mWWE4Mk9uRnVqY3RUN3piUTNnNzZOCnlFa0FSVkp4WXVWRU5FNExuVnRyVzVvN2hBTFZtbzA1RUpUcm55N1BVbmdGNVpvSWZqZHZiSWh5OVpWNlVoNDgKMHIrNW9WUUNUMTN0K1FJbzJZa3VzdVBJZmcrTUlGclAzYm5XUUFIWnhuTzJxWnFSREtnSGlwSVNUZEczSHlhWApSUmZOL0t0WHFSTTZLeHhzRjdvalpRanpPZXBvT0tPSjVGajQreU81Z1Y5MHhhN1dIR0lDemFYbWc3Z2o2eFJDClNSZGNsMlR4S3FqTTFGUmRIQTcvRW9yY2llT09BZUFGT01uMWRNd0dxUENoMmFCOVdsSnJxeEliN2pyOTUrbFgKWVNPNitOL1lzMzQ5cnJLak8zcmNrZTd0WXRNNlZ5bXE0di9oQndJREFRQUJBb0lCQVFDODE3SWdKSUdPMnZiSwpYZFFGSXlhckhwdi9nTWtscVVkeVBFSVNKQjhXc2FBdW1XbmRUV0Y0UVlzaVBEbkwzR21hNEVoU1AwSkN4L3cvCmpaK09WN0tRMGQxekZEbGhIK3NTdHFKRmZSeDc4c0FTcUphbVpPbjZHblRsZU9ZdGN5SUNkcVZFcysvTmpDTDkKTDM3SlRYL1NzdTNqdlFRaXFqclVaUFJlKzlkZzNaZEFrWTVaZGNzQnZZRGxvQm1waCtRTitCWXQveHJvMTNqKwpuRUhJSnNOb3BZeSs2MFlIaWdRbFZSWlZwcHdjcVYvMnpOaE9Ma1JCeXAyVmhDZGliMDZDR21PNEo1QURmczdmCkJ5cmhXSUZJQno2TXRXajJnRkF5TEY1czYyZkZzY0RMYUc5SG5NemRVTE5qSmpnWEt1T2dtajFuNHE2dWdHTmwKWTM3ME9yUUJBb0dCQU5MMDhOWWVPNVVTZy9HckhNdVp5ellLRkZHUmx2bGQ5Yjh3REdUenpNS0pJelhEYkdPTwp3L1JOeENqOWR5Zy9FNXB2R0JZQldDODBwNjJKTURNV2JKcGdybzVIaGxRQ3czekh0L3BwK3JXSFBJbWJHd1lOCmg4Um9VYTVldTA4aXB1SWxWMWZURVQveHJnbnlpcFd4SzNrb3hhOHhWcUpwMGNzc2FTaXhEbmlIQW9HQkFQSUkKemxiUXF3OWhYVVI2djdKZWF2SzdtUWVkWlJtNXIrZDRPU1ZJMVN2TnhHeU42VWNJaTc5bkN3dy8yWVNsS0pHOApGOVdoZWNJei9qUUpHU29DdWFwM3Z5UEVNRkF0S042b0ZoUnhiUGUrcUtrTW9VR2xSeVlVeVpIZ0Y4N1NDb2o3CjRLL0VKbDBtZGI2aHNKZ1RrQkJseWY2dS96VUExTjdIK1U3YmkvT0JBb0dBRlVvdS9BejFDbWhoOUlQR1ZpM2gKT2tUdUpBVkRiVXMwUCtWRGV2UzMxM0lyb1lObGJ1NjdpKzVGTzdYSXpzRCs0M2tPdnpuSGdvd1gyQVdlWGFtSApzRlROaVFKaTVodVpTd0NFNnJyRFdJcWJhMi9CM0d5RkpTYzZCeFQ4WmxJaThYTy9TdGU4UiszR0dLN25tWS9WCnlWWjZET0kzMGhCSDRlOUxkWlhZMWdVQ2dZRUE3M0Fxd05QYUJuTXAwNDhqaVkvQ2ViT0E1bm1kQk9BZjF2dW0KZk80YWhTVWhCc3MxVmlKc0xjUUF0L09LZXFEeEM0dHFnTnNvR3lsWWQ1M3dtUkR0SUdrcVhIVy8zZkZ2RnladQpBWGRjZDVMVVE3ak01cVpkUnAwVjlBd2ZRV21sSm5NWGlvcWY4Vk1VOUt2OGlkWUFsVmc5aG9rVXpCaXdmbHlTCmxLSzVSd0VDZ1lCRXVUWTFzSnVEZzlIVGRPZ2w5QnhudDgzYVF2WjY2SDVaeVpBVVFqOEJTdkJhQnFWU2xFbkwKaER5d3pjWkk0YmNSS1NDeDZqemc0aHE1UHdHNm9oTWIxWThodjU2akhickp4VVpsaDFIVk44cjZxUzBHTGZkYgpxZ1pPNWE3elZ0eGNBS2RWTkxCV1NRVzlkTFFWRXE2Q0g0dW1rbUQvR1lqZ0hEayt0NGZqZXc9PQotLS0tLUVORCBSU0EgUFJJVkFURSBLRVktLS0tLQo=
/etc/kubernetes/admin.conf (Master VM):
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvak>
server: https://10.0.2.15:6443
name: kubernetes
contexts:
- context:
cluster: kubernetes
user: kubernetes-admin
name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}
users:
- name: kubernetes-admin
user:
client-certificate-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURJVENDQ>
client-key-data: LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQpNSUlFcEFJQkFBS>
Worker VM: .kube folder not present, etc/kubernetes/admin.conf file content is empty
Hi, @ymc101. The data provided looks fine. The only reason I can think of is that the firewall is in place and/or ports are blocked. Can you check if port 6443 is whitelisted?
Hi @leokondrashov , the ports were apparently blocked due to the VM configuration. After solving it, the worker node was able to join with the master node, but there were some Error with configuring MetalLB. Below are some Terminal logs:
Worker node when joining:
tee: /tmp/vhive-logs/kubeadm_join.stdout: No such file or directory
tee: /tmp/vhive-logs/kubeadm_join.stderr: No such file or directory
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
Here I forgot to create the temp log files before running, but it seems that it did not affect the joining of the node.
Master node:
[16:53:29] [Warn] All nodes need to be joined in the cluster. Have you joined all nodes? (y/n): y
[17:00:02] [Success] All nodes successfully joined!(user confirmed)
[17:00:02] [Info] Set up master node
[17:00:02] [Info] Installing pod network >>>>> [17:00:06] [Success]
[17:00:06] [Info] Installing and configuring MetalLB >>>>> [17:03:12] [Error] [exit 1] -> error: timed out waiting for the condition on deployments/controller
[17:03:12] [Error] Failed to install and configure MetalLB!
[17:03:12] [Error] Failed to set up master node!
[17:03:12] [Error] Faild subcommand: create_multinode_cluster!
[17:03:12] [Info] Cleaning up temporary directory >>>>> [17:03:12] [Success]
I had automatic ssh set up for both master and worker node. Do you have an idea what might be causing this error?
It's good that the networking issue was that simple to resolve. We should add it to the troubleshooting guide.
Regarding the metallb, I saw that stuff once. I think it's just a sporadic error. First, check the available pods, metallb might be there, just too late to report to the script. Either way, try to rerun ./setup_tool setup_master_node firecracker
. It might be exacerbated by slow network, so also check the connection speed to VMs.
The networking issue was caused by a VM setting in VirtualBox, so nothing to do with vHive itself.
Regarding the metallb error, i tried running the setup tool command again and it passed the check, however this time there is an error with deploying the istio operator:
[20:25:23] [Info] Deploying istio operator >>>>> [20:36:14] [Error] [exit 1] -> ! values.global.jwtPolicy is deprecated; use Values.global.jwtPolicy=third-party-jwt. See https://istio.io/latest/docs/ops/best-practices/security/#configure-third-party-service-account-tokens for more information instead
- Processing resources for Istio core.
✔ Istio core installed
- Processing resources for Istiod.
- Processing resources for Istiod. Waiting for Deployment/istio-system/i...
✘ Istiod encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition
- Processing resources for Ingress gateways.
- Processing resources for Ingress gateways. Waiting for Deployment/isti...
✘ Ingress gateways encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition
Deployment/istio-system/cluster-local-gateway (container failed to start: ContainerCreating: )
Deployment/istio-system/istio-ingressgateway (container failed to start: ContainerCreating: )
- Pruning removed resourcesError: failed to install manifests: errors occurred during operation
[20:36:14] [Error] Failed to deploy istio operator!
@leokondrashov do you have any idea for this one?
Thanks.
Can you please check the pods in the istio-system
(kubectl get pods -n istio-system
)? It might be the same problem as with metallb: the pods are on the way, but too late to fit in the timeout. If they are not ready (but they should be at least listed there in non-ready state), we can try to check the logs of the pods (kubectl logs <pod name> -n istio-system
).
I ran into a different metallb error this time:
[18:41:53] [Error] [exit 1] -> Error from server (InternalError): error when creating "/home/vboxuser/vhive/configs/metallb/metallb-ipaddresspool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": context deadline exceeded
[18:41:53] [Error] Failed to install and configure MetalLB!
[18:41:53] [Error] Failed to set up master node!
[18:41:53] [Error] Faild subcommand: create_multinode_cluster!
When i run the setup ./setup_tool setup_master_node firecracker
script and it fails, i realised that when i run it again i get an index out of range error. So what i do is i run the cleanup script ./scripts/github_runner/clean_cri_runner.sh
, and then run all the sudo screen commands (containerd
, firecracker-containerd
, vhive
for worker, and containerd
for master), and then run the ./setup_tool setup_master_node firecracker
script again. Do you know if i am missing or doing any wrong steps?
I'm not very confident in using the clean cri script for a multi-node setup. It's better to start from clean nodes.
Let's figure out the problems that we face. Please run from the start and document the errors that you encounter. For istio and metallb failures, please provide the output of kubectl get pods -n <>-system
(substitute istio or metallb).
Most of our problems are timeouts of resources not being ready. Can you check the networking speed of the VMs?
Using speedtest-cli, this is the networking speed from one of my VMs:
Testing download speed................................................................................
Download: 601.03 Mbit/s
Testing upload speed......................................................................................................
Upload: 572.86 Mbit/s
Is this within sufficient ranges with respect to the timeout timers in the master node setup script?
Additionally, can I ask if there is a script or a method to reset a node after setup failure, or after usage of a node? It is quite time consuming to tear down the VM and set up a new one each time due to the initial OS setup. So far I have tried using the single-node clean cri script as I could not find any other cleanup script from the quickstart guide.
That seems to be a fair speed for the setup.
You are using VirtualBox, right? Can you create a snapshot of the VM right after the boot? That should speed up the process.
Yes, I am using VirtualBox, I was previously not aware of this feature, thanks for the suggestion.
I tried running it from scratch and got the same metallb timeout error, and when i tried to rerun the command i get this index out of range panic:
panic: runtime error: index out of range [1] with length 1
goroutine 1 [running]:
github.com/vhive-serverless/vHive/scripts/cluster.ExtractMasterNodeInfo()
/home/vboxuser/vhive/scripts/cluster/create_multinode_cluster.go:146 +0x66c
github.com/vhive-serverless/vHive/scripts/cluster.CreateMultinodeCluster({0x7fff55e8a3bd, 0xb})
/home/vboxuser/vhive/scripts/cluster/create_multinode_cluster.go:50 +0x53
main.main()
/home/vboxuser/vhive/scripts/setup.go:151 +0xfb7
If the metallb was ready but not in time for the script, resetting the VM snapshot might produce the same error. Do you have any ideas or suggestion? Right now I can only think of modifying the code on the master node to increase the timeout threshold for the metallb and istio setup, but im not sure what might be causing this, especially when the download and upload speed doesn't seem to be the bottleneck here.
I saw the timeout issue previously with network congestion, which is not the case here. However, there might also be the problem of not enough CPU to install it in time. What is the VM size?
The solution with more time would work, although the current limit of 3 minutes should be more than enough. Not sure that it can be done for istio (that also experienced timeouts), so a more permanent solution might include increasing the VM size.
I allocated 3 CPU cores and 8GB of RAM to this VM. Ill try again with bigger VM size and then see if there is the same issue. This system has 12 cores and 32GB RAM, I can give each VM about 4-5 cores at most and about 12GB RAM.
The script encountered the same error with metallb, even with 5 cores and 12GB RAM, which is the maximum I can allocate to each VM before exceeding the system's hardware resources. May I know what was the specs of the nodes that you have tested on before?
We commonly use nodes with around 10 cores and 64GB, but your configuration should be enough.
Can you supply the content of the create_multinode_cluster_*.log
in directory where you ran setup_tool? Maybe even add --v 5
to increase verbosity of failing commands (https://github.com/vhive-serverless/vHive/blob/main/scripts/cluster/setup_master_node.go#L125-L133).
Below are the content of the 2 log files i just ran with the verbosity flag:
create_multinode_cluster_common.log:
INFO: 18:21:50 logs.go:88:
INFO: 18:21:50 logs.go:88: Stdout Log -> /home/vboxuser/vhive/create_multinode_cluster_common.log
INFO: 18:21:50 logs.go:88: Stderr Log -> /home/vboxuser/vhive/create_multinode_cluster_error.log
INFO: 18:21:50 system.go:81: Executing shell command: git rev-parse --show-toplevel
INFO: 18:21:50 system.go:82: Stdout from shell:
/home/vboxuser/vhive
INFO: 18:21:50 logs.go:100: vHive repo Path: /home/vboxuser/vhive
INFO: 18:21:50 logs.go:100: Loading config files from /home/vboxuser/vhive/configs/setup >>>>>
INFO: 18:21:50 logs.go:88:
INFO: 18:21:50 logs.go:100: Create multinode cluster
INFO: 18:21:50 logs.go:100: Creating kubelet service >>>>>
INFO: 18:21:50 system.go:81: Executing shell command: sudo mkdir -p /etc/sysconfig
INFO: 18:21:50 system.go:82: Stdout from shell:
INFO: 18:21:50 system.go:81: Executing shell command: sudo sh -c 'cat <<EOF > /etc/sysconfig/kubelet
KUBELET_EXTRA_ARGS="--container-runtime=remote --v=0 --runtime-request-timeout=15m --container-runtime-endpoint=unix:///run/containerd/containerd.sock"
EOF'
INFO: 18:21:50 system.go:82: Stdout from shell:
INFO: 18:21:51 system.go:81: Executing shell command: sudo systemctl daemon-reload
INFO: 18:21:51 system.go:82: Stdout from shell:
INFO: 18:21:51 logs.go:88:
INFO: 18:21:51 logs.go:100: Deploying Kubernetes(version 1.25.9) >>>>>
INFO: 18:21:51 system.go:81: Executing shell command: ip route | awk '{print $(NF)}' | awk '/^10\..*/'
INFO: 18:21:51 system.go:82: Stdout from shell:
INFO: 18:25:03 system.go:81: Executing shell command: sudo kubeadm init --v=0 \
--apiserver-advertise-address= \
--cri-socket /run/containerd/containerd.sock \
--kubernetes-version 1.25.9 \
--pod-network-cidr="192.168.0.0/16" | tee /tmp/vHive_tmp1120224528/masterNodeInfo
INFO: 18:25:03 system.go:82: Stdout from shell:
[init] Using Kubernetes version: v1.25.9
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local vhivemaster] and IPs [10.96.0.1 10.100.184.85]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [localhost vhivemaster] and IPs [10.100.184.85 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [localhost vhivemaster] and IPs [10.100.184.85 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[apiclient] All control plane components are healthy after 69.271402 seconds
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system with the configuration for the kubelets in the cluster
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node vhivemaster as control-plane by adding the labels: [node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node vhivemaster as control-plane by adding the taints [node-role.kubernetes.io/control-plane:NoSchedule]
[bootstrap-token] Using token: wpefqy.ta83yflrwktaqneg
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root user, you can run:
export KUBECONFIG=/etc/kubernetes/admin.conf
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join 10.100.184.85:6443 --token wpefqy.ta83yflrwktaqneg \
--discovery-token-ca-cert-hash sha256:37acf38fa718d3019ea68467411f7b455f0c21bfb6a5e9360913cd54e4e139e1
INFO: 18:25:03 logs.go:88:
INFO: 18:25:03 logs.go:100: Making kubectl work for non-root user >>>>>
INFO: 18:25:03 system.go:81: Executing shell command: mkdir -p /home/vboxuser/.kube && sudo cp -i /etc/kubernetes/admin.conf /home/vboxuser/.kube/config && sudo chown $(id -u):$(id -g) /home/vboxuser/.kube/config
INFO: 18:25:03 system.go:82: Stdout from shell:
INFO: 18:25:03 logs.go:88:
INFO: 18:25:03 logs.go:100: Extracting master node information from logs >>>>>
INFO: 18:25:03 system.go:81: Executing shell command: sed -n '/.*kubeadm join.*/p' < /tmp/vHive_tmp1120224528/masterNodeInfo | sed -n 's/.*join \(.*\):\(\S*\) --token \(\S*\).*/\1 \2 \3/p'
INFO: 18:25:03 system.go:82: Stdout from shell:
10.100.184.85 6443 wpefqy.ta83yflrwktaqneg
INFO: 18:25:03 system.go:81: Executing shell command: sed -n '/.*sha256:.*/p' < /tmp/vHive_tmp1120224528/masterNodeInfo | sed -n 's/.*\(sha256:\S*\).*/\1/p'
INFO: 18:25:03 system.go:82: Stdout from shell:
sha256:37acf38fa718d3019ea68467411f7b455f0c21bfb6a5e9360913cd54e4e139e1
INFO: 18:25:03 logs.go:88:
INFO: 18:25:03 logs.go:100: Creating masterKey.yaml with master node information >>>>>
INFO: 18:25:03 logs.go:88:
INFO: 18:25:03 logs.go:88: Join cluster from worker nodes with command: sudo kubeadm join 10.100.184.85:6443 --token wpefqy.ta83yflrwktaqneg --discovery-token-ca-cert-hash sha256:37acf38fa718d3019ea68467411f7b455f0c21bfb6a5e9360913cd54e4e139e1
INFO: 18:25:03 logs.go:76: All nodes need to be joined in the cluster. Have you joined all nodes? (y/n):
INFO: 18:31:33 logs.go:88: All nodes successfully joined!(user confirmed)
INFO: 18:31:33 logs.go:100: Set up master node
INFO: 18:31:33 logs.go:100: Installing pod network >>>>>
INFO: 18:31:52 system.go:81: Executing shell command: kubectl apply -f /home/vboxuser/vhive/configs/calico/canal.yaml
INFO: 18:31:52 system.go:82: Stdout from shell:
poddisruptionbudget.policy/calico-kube-controllers created
serviceaccount/calico-kube-controllers created
serviceaccount/calico-node created
configmap/calico-config created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/blockaffinities.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/caliconodestatuses.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamblocks.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipreservations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/kubecontrollersconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
clusterrole.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/calico-node created
deployment.apps/calico-kube-controllers created
INFO: 18:31:52 logs.go:88:
INFO: 18:31:52 logs.go:100: Installing and configuring MetalLB >>>>>
INFO: 18:31:54 system.go:81: Executing shell command: kubectl get configmap kube-proxy -n kube-system -o yaml | sed -e "s/strictARP: false/strictARP: true/" | kubectl apply -f - -n kube-system
INFO: 18:31:54 system.go:82: Stdout from shell:
configmap/kube-proxy configured
INFO: 18:32:32 system.go:81: Executing shell command: kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.9/config/manifests/metallb-native.yaml
INFO: 18:32:32 system.go:82: Stdout from shell:
namespace/metallb-system created
customresourcedefinition.apiextensions.k8s.io/addresspools.metallb.io created
customresourcedefinition.apiextensions.k8s.io/bfdprofiles.metallb.io created
customresourcedefinition.apiextensions.k8s.io/bgpadvertisements.metallb.io created
customresourcedefinition.apiextensions.k8s.io/bgppeers.metallb.io created
customresourcedefinition.apiextensions.k8s.io/communities.metallb.io created
customresourcedefinition.apiextensions.k8s.io/ipaddresspools.metallb.io created
customresourcedefinition.apiextensions.k8s.io/l2advertisements.metallb.io created
serviceaccount/controller created
serviceaccount/speaker created
role.rbac.authorization.k8s.io/controller created
role.rbac.authorization.k8s.io/pod-lister created
clusterrole.rbac.authorization.k8s.io/metallb-system:controller created
clusterrole.rbac.authorization.k8s.io/metallb-system:speaker created
rolebinding.rbac.authorization.k8s.io/controller created
rolebinding.rbac.authorization.k8s.io/pod-lister created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:speaker created
secret/webhook-server-cert created
service/webhook-service created
deployment.apps/controller created
daemonset.apps/speaker created
validatingwebhookconfiguration.admissionregistration.k8s.io/metallb-webhook-configuration created
INFO: 18:35:33 system.go:81: Executing shell command: kubectl -n metallb-system wait deploy controller --timeout=180s --for=condition=Available
INFO: 18:35:33 system.go:82: Stdout from shell:
INFO: 18:35:33 logs.go:100: Cleaning up temporary directory >>>>>
INFO: 18:35:33 logs.go:88:
create_multinode_cluster_error.log:
ERROR: 18:21:50 system.go:85: Executing shell command: git rev-parse --show-toplevel
ERROR: 18:21:50 system.go:86: Stderr from shell:
ERROR: 18:21:50 system.go:85: Executing shell command: sudo mkdir -p /etc/sysconfig
ERROR: 18:21:50 system.go:86: Stderr from shell:
ERROR: 18:21:50 system.go:85: Executing shell command: sudo sh -c 'cat <<EOF > /etc/sysconfig/kubelet
KUBELET_EXTRA_ARGS="--container-runtime=remote --v=0 --runtime-request-timeout=15m --container-runtime-endpoint=unix:///run/containerd/containerd.sock"
EOF'
ERROR: 18:21:50 system.go:86: Stderr from shell:
ERROR: 18:21:51 system.go:85: Executing shell command: sudo systemctl daemon-reload
ERROR: 18:21:51 system.go:86: Stderr from shell:
ERROR: 18:21:51 system.go:85: Executing shell command: ip route | awk '{print $(NF)}' | awk '/^10\..*/'
ERROR: 18:21:51 system.go:86: Stderr from shell:
ERROR: 18:25:03 system.go:85: Executing shell command: sudo kubeadm init --v=0 \
--apiserver-advertise-address= \
--cri-socket /run/containerd/containerd.sock \
--kubernetes-version 1.25.9 \
--pod-network-cidr="192.168.0.0/16" | tee /tmp/vHive_tmp1120224528/masterNodeInfo
ERROR: 18:25:03 system.go:86: Stderr from shell:
W0216 18:21:51.426910 32124 initconfiguration.go:119] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/run/containerd/containerd.sock". Please update your configuration!
ERROR: 18:25:03 system.go:85: Executing shell command: mkdir -p /home/vboxuser/.kube && sudo cp -i /etc/kubernetes/admin.conf /home/vboxuser/.kube/config && sudo chown $(id -u):$(id -g) /home/vboxuser/.kube/config
ERROR: 18:25:03 system.go:86: Stderr from shell:
ERROR: 18:25:03 system.go:85: Executing shell command: sed -n '/.*kubeadm join.*/p' < /tmp/vHive_tmp1120224528/masterNodeInfo | sed -n 's/.*join \(.*\):\(\S*\) --token \(\S*\).*/\1 \2 \3/p'
ERROR: 18:25:03 system.go:86: Stderr from shell:
ERROR: 18:25:03 system.go:85: Executing shell command: sed -n '/.*sha256:.*/p' < /tmp/vHive_tmp1120224528/masterNodeInfo | sed -n 's/.*\(sha256:\S*\).*/\1/p'
ERROR: 18:25:03 system.go:86: Stderr from shell:
ERROR: 18:31:52 system.go:85: Executing shell command: kubectl apply -f /home/vboxuser/vhive/configs/calico/canal.yaml
ERROR: 18:31:52 system.go:86: Stderr from shell:
ERROR: 18:31:54 system.go:85: Executing shell command: kubectl get configmap kube-proxy -n kube-system -o yaml | sed -e "s/strictARP: false/strictARP: true/" | kubectl apply -f - -n kube-system
ERROR: 18:31:54 system.go:86: Stderr from shell:
Warning: resource configmaps/kube-proxy is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
ERROR: 18:32:32 system.go:85: Executing shell command: kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.9/config/manifests/metallb-native.yaml
ERROR: 18:32:32 system.go:86: Stderr from shell:
ERROR: 18:35:33 system.go:85: Executing shell command: kubectl -n metallb-system wait deploy controller --timeout=180s --for=condition=Available
ERROR: 18:35:33 system.go:86: Stderr from shell:
error: timed out waiting for the condition on deployments/controller
ERROR: 18:35:33 logs.go:64: [exit 1] -> error: timed out waiting for the condition on deployments/controller
ERROR: 18:35:33 logs.go:64: Failed to install and configure MetalLB!
ERROR: 18:35:33 logs.go:64: Failed to set up master node!
ERROR: 18:35:33 logs.go:64: Faild subcommand: create_multinode_cluster!
Can you provide the output of the command kubectl describe pod controller -n metallb-system
after it actually places the pod? In the end of the output there should be the events that might explain why the deployment is delayed.
Do I run that command on the worker node right after it joins the cluster, and before I provide the user prompt on the master node (./setup_tool create_multinode_cluster firecracker
) to confirm all pods have joined the cluster?
After it fails to deploy the metallb services.
This is the output i got:
Name: controller-844979dcdc-hhk5d
Namespace: metallb-system
Priority: 0
Service Account: controller
Node: vhiveworker/10.100.183.218
Start Time: Mon, 19 Feb 2024 22:17:42 +0800
Labels: app=metallb
component=controller
pod-template-hash=844979dcdc
Annotations: cni.projectcalico.org/containerID: 3ab842fcceec22b99646560f9eed7bb655ff0553beb9c23d23749c7db5e99171
cni.projectcalico.org/podIP: 192.168.104.66/32
cni.projectcalico.org/podIPs: 192.168.104.66/32
prometheus.io/port: 7472
prometheus.io/scrape: true
Status: Running
IP: 192.168.104.66
IPs:
IP: 192.168.104.66
Controlled By: ReplicaSet/controller-844979dcdc
Containers:
controller:
Container ID: containerd://6a218a54ce4a2c49d81d79f8ffdf6ed76ed471381de56e5001196d9a4b97ebf7
Image: quay.io/metallb/controller:v0.13.9
Image ID: quay.io/metallb/controller@sha256:c9ffd7215dcf93ff69b474c9bc5889ac69da395c62bd693110ba3b57fcecc28c
Ports: 7472/TCP, 9443/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--port=7472
--log-level=info
State: Running
Started: Mon, 19 Feb 2024 22:19:55 +0800
Ready: False
Restart Count: 0
Liveness: http-get http://:monitoring/metrics delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:monitoring/metrics delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
METALLB_ML_SECRET_NAME: memberlist
METALLB_DEPLOYMENT: controller
Mounts:
/tmp/k8s-webhook-server/serving-certs from cert (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8mqp5 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
cert:
Type: Secret (a volume populated by a Secret)
SecretName: webhook-server-cert
Optional: false
kube-api-access-8mqp5:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4m56s default-scheduler 0/2 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
Normal Scheduled 3m7s default-scheduler Successfully assigned metallb-system/controller-844979dcdc-hhk5d to vhiveworker
Warning FailedCreatePodSandBox 2m39s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "17c4426b244715eef19caaa0a4da5fa0ebad35b4ceabea2cb15c26ebfb5ab0dd": plugin type="calico" failed (add): stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Normal SandboxChanged 2m2s (x4 over 2m39s) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulling 101s kubelet Pulling image "quay.io/metallb/controller:v0.13.9"
Normal Pulled 58s kubelet Successfully pulled image "quay.io/metallb/controller:v0.13.9" in 22.5950977s (43.359467031s including waiting)
Warning Unhealthy 15s (x3 over 35s) kubelet Liveness probe failed: Get "http://192.168.104.66:7472/metrics": dial tcp 192.168.104.66:7472: connect: connection refused
Normal Killing 15s kubelet Container controller failed liveness probe, will be restarted
Normal Pulled 12s kubelet Container image "quay.io/metallb/controller:v0.13.9" already present on machine
Warning Unhealthy 5s (x5 over 35s) kubelet Readiness probe failed: Get "http://192.168.104.66:7472/metrics": dial tcp 192.168.104.66:7472: connect: connection refused
Normal Created 4s (x2 over 54s) kubelet Created container controller
Normal Started 3s (x2 over 53s) kubelet Started container controller
Does it mention why there is error with metallb setup? Im not very sure how to interpret this log
I see several minutes of a wait due to the worker node not being ready (between the first two events). Other stuff is not that big (only the image pull that took 40s, but I have no idea how to improve that). Can you also add the output of kubectl describe node
and kubectl describe deploy controller -n metallb-system
?
I suppose you can try to continue the setup with ./setup_tool setup_master_node firecracker
and record the similar data for failed pods: kubectl describe pod cluster-local-gateway -n istio-system
and kubectl describe pod istio-ingressgateway -n istio-system
if the output of istio deployment step complains about them not being ready.
kubectl describe node
:
Name: vhivemaster
Roles: control-plane
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=vhivemaster
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node.kubernetes.io/exclude-from-external-load-balancers=
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 10.100.176.138/20
projectcalico.org/IPv4IPIPTunnelAddr: 192.168.148.64
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 20 Feb 2024 19:24:14 +0800
Taints: node-role.kubernetes.io/control-plane:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: vhivemaster
AcquireTime: <unset>
RenewTime: Tue, 20 Feb 2024 20:18:37 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Tue, 20 Feb 2024 19:58:38 +0800 Tue, 20 Feb 2024 19:58:38 +0800 CalicoIsUp Calico is running on this node
MemoryPressure False Tue, 20 Feb 2024 20:18:21 +0800 Tue, 20 Feb 2024 19:24:14 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 20 Feb 2024 20:18:21 +0800 Tue, 20 Feb 2024 19:24:14 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 20 Feb 2024 20:18:21 +0800 Tue, 20 Feb 2024 19:24:14 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 20 Feb 2024 20:18:21 +0800 Tue, 20 Feb 2024 19:57:33 +0800 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.100.176.138
Hostname: vhivemaster
Capacity:
cpu: 5
ephemeral-storage: 102107096Ki
hugepages-2Mi: 0
memory: 13192552Ki
pods: 110
Allocatable:
cpu: 5
ephemeral-storage: 94101899518
hugepages-2Mi: 0
memory: 13090152Ki
pods: 110
System Info:
Machine ID: fbeb15dcad234a4e9fa40fff05b39056
System UUID: 98bcd0f4-c159-7547-a558-f8569d3a9b4c
Boot ID: e657f05f-a8ec-4c4d-8ee5-ea232a912126
Kernel Version: 5.15.0-94-generic
OS Image: Ubuntu 20.04 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.18
Kubelet Version: v1.25.9
Kube-Proxy Version: v1.25.9
PodCIDR: 192.168.0.0/24
PodCIDRs: 192.168.0.0/24
Non-terminated Pods: (7 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-node-xzzv7 250m (5%) 0 (0%) 0 (0%) 0 (0%) 23m
kube-system etcd-vhivemaster 100m (2%) 0 (0%) 100Mi (0%) 0 (0%) 53m
kube-system kube-apiserver-vhivemaster 250m (5%) 0 (0%) 0 (0%) 0 (0%) 53m
kube-system kube-controller-manager-vhivemaster 200m (4%) 0 (0%) 0 (0%) 0 (0%) 54m
kube-system kube-proxy-28mhk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 53m
kube-system kube-scheduler-vhivemaster 100m (2%) 0 (0%) 0 (0%) 0 (0%) 53m
metallb-system speaker-qm62w 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 900m (18%) 0 (0%)
memory 100Mi (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 53m kube-proxy
Normal Starting 53m kubelet Starting kubelet.
Warning InvalidDiskCapacity 53m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 53m kubelet Node vhivemaster status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 53m kubelet Node vhivemaster status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 53m kubelet Node vhivemaster status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 53m kubelet Updated Node Allocatable limit across pods
Normal RegisteredNode 53m node-controller Node vhivemaster event: Registered Node vhivemaster in Controller
Normal NodeReady 21m kubelet Node vhivemaster status is now: NodeReady
Name: vhiveworker
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=vhiveworker
kubernetes.io/os=linux
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 10.100.183.218/20
projectcalico.org/IPv4IPIPTunnelAddr: 192.168.104.64
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 20 Feb 2024 19:54:10 +0800
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: vhiveworker
AcquireTime: <unset>
RenewTime: Tue, 20 Feb 2024 20:18:39 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Tue, 20 Feb 2024 19:58:30 +0800 Tue, 20 Feb 2024 19:58:30 +0800 CalicoIsUp Calico is running on this node
MemoryPressure False Tue, 20 Feb 2024 20:17:53 +0800 Tue, 20 Feb 2024 19:54:10 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 20 Feb 2024 20:17:53 +0800 Tue, 20 Feb 2024 19:54:10 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 20 Feb 2024 20:17:53 +0800 Tue, 20 Feb 2024 19:54:10 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 20 Feb 2024 20:17:53 +0800 Tue, 20 Feb 2024 19:57:20 +0800 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.100.183.218
Hostname: vhiveworker
Capacity:
cpu: 5
ephemeral-storage: 102107096Ki
hugepages-2Mi: 0
memory: 13125980Ki
pods: 110
Allocatable:
cpu: 5
ephemeral-storage: 94101899518
hugepages-2Mi: 0
memory: 13023580Ki
pods: 110
System Info:
Machine ID: cbca3566c9694b7da50585efbf6f6d3d
System UUID: 30faaeaf-fd56-0a41-9b2e-da93571da3af
Boot ID: 8235a938-4d7c-48a2-823f-c06b12bdf3d9
Kernel Version: 5.15.0-94-generic
OS Image: Ubuntu 20.04 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.18
Kubelet Version: v1.25.9
Kube-Proxy Version: v1.25.9
PodCIDR: 192.168.1.0/24
PodCIDRs: 192.168.1.0/24
Non-terminated Pods: (7 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-kube-controllers-567c56ff98-ppjhv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 23m
kube-system calico-node-b6fls 250m (5%) 0 (0%) 0 (0%) 0 (0%) 23m
kube-system coredns-565d847f94-c6wdp 100m (2%) 0 (0%) 70Mi (0%) 170Mi (1%) 53m
kube-system coredns-565d847f94-g6kgg 100m (2%) 0 (0%) 70Mi (0%) 170Mi (1%) 53m
kube-system kube-proxy-5gm46 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24m
metallb-system controller-844979dcdc-zdrmz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22m
metallb-system speaker-rx55c 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 450m (9%) 0 (0%)
memory 140Mi (1%) 340Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 23m kube-proxy
Normal NodeHasSufficientMemory 24m (x8 over 24m) kubelet Node vhiveworker status is now: NodeHasSufficientMemory
Normal RegisteredNode 24m node-controller Node vhiveworker event: Registered Node vhiveworker in Controller
kubectl describe deploy controller -n metallb-system
:
Name: controller
Namespace: metallb-system
CreationTimestamp: Tue, 20 Feb 2024 19:55:45 +0800
Labels: app=metallb
component=controller
Annotations: deployment.kubernetes.io/revision: 1
Selector: app=metallb,component=controller
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=metallb
component=controller
Annotations: prometheus.io/port: 7472
prometheus.io/scrape: true
Service Account: controller
Containers:
controller:
Image: quay.io/metallb/controller:v0.13.9
Ports: 7472/TCP, 9443/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--port=7472
--log-level=info
Liveness: http-get http://:monitoring/metrics delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:monitoring/metrics delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
METALLB_ML_SECRET_NAME: memberlist
METALLB_DEPLOYMENT: controller
Mounts:
/tmp/k8s-webhook-server/serving-certs from cert (ro)
Volumes:
cert:
Type: Secret (a volume populated by a Secret)
SecretName: webhook-server-cert
Optional: false
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: controller-844979dcdc (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 24m deployment-controller Scaled up replica set controller-844979dcdc to 1
May I check what do you mean by continuing the setup with ./setup_tool setup_master_node firecracker
? If i rerun the setup command after the metallb failure i will encounter an index out of range error, and restarting the process with vm snapshot would just reproduce the metallb setup error, so i dont think i can get data on the istio deployment yet.
As I remember, in this comment is the result of ./setup_tool setup_master_node firecracker
after the metallb failure. So, it worked previously. Rerunning after that had the problem you mentioned.
For now, I don't know what's wrong with the node, it just takes more time. So, possibly, the only solution is to increase the timeout on this line to 600s just to fix this problem.
Ok, i will attempt increasing the timeout and try again. For the comment you referenced where it passed the metallb setup, I believe I tried running the cleanup script for single node cluster and tried running the command again. As you previously mentioned that the cleanup script was not really meant for multi-node clusters, I have stopped trying that method.
It seems that both the metallb error and istio error did not appear this time. Is this supposed to be the full output when successful?
[21:01:48] [Info] Set up master node
[21:01:48] [Info] Installing pod network >>>>> [21:02:08] [Success]
[21:02:08] [Info] Installing and configuring MetalLB >>>>> [21:07:53] [Success]
[21:07:53] [Info] Downloading istio >>>>> [21:07:57] [Success]
[21:07:57] [Info] Extracting istio >>>>> [21:07:57] [Success]
[21:07:58] [Info] Deploying istio operator >>>>> [21:11:21] [Success]
[21:11:21] [Info] Installing Knative Serving component (firecracker mode) >>>>> [21:12:28] [Success]
[21:12:28] [Info] Installing local cluster registry >>>>> [21:12:43] [Success]
[21:12:43] [Info] Configuring Magic DNS >>>>> [21:12:52] [Success]
[21:12:52] [Info] Deploying istio pods >>>>> [21:13:23] [Success]
[21:13:24] [Info] Installing Knative Eventing component >>>>> [21:15:05] [Success]
[21:15:06] [Info] Installing a default Channel (messaging) layer >>>>> [21:15:37] [Success]
[21:15:37] [Info] Installing a Broker layer >>>>> [21:16:09] [Success]
[21:16:09] [Info] Cleaning up temporary directory >>>>> [21:16:09] [Success]
Every 2.0s: kubectl get pods --all-namespaces vHiveMaster: Wed Feb 21 23:04:57 2024
NAMESPACE NAME READY STATUS RESTARTS AGE
istio-system cluster-local-gateway-76bbc4bf78-jk25v 1/1 Running 0 115m
istio-system istio-ingressgateway-dbcbdd6d5-jpxj5 1/1 Running 0 115m
istio-system istiod-657b54846b-h4vgb 1/1 Running 0 116m
knative-eventing eventing-controller-6697c6d9b6-wh27j 1/1 Running 0 110m
knative-eventing eventing-webhook-6f9cff4954-78x25 1/1 Running 0 110m
knative-eventing imc-controller-7848bc9cdb-dqrk9 1/1 Running 0 109m
knative-eventing imc-dispatcher-6ccc6b7db9-v8zlv 1/1 Running 0 109m
knative-eventing mt-broker-controller-cd9b99bd5-cmfmz 1/1 Running 0 108m
knative-eventing mt-broker-filter-cf84c449c-nwg6w 1/1 Running 0 109m
knative-eventing mt-broker-ingress-58c4fdd87b-lql66 1/1 Running 0 109m
knative-serving activator-64fd97c6bd-d788p 1/1 Running 0 113m
knative-serving autoscaler-78bd654674-cfv2v 1/1 Running 0 113m
knative-serving controller-67fbfcfc76-w9nmx 1/1 Running 0 112m
knative-serving default-domain-dx7zj 0/1 Completed 0 112m
knative-serving domain-mapping-874f6d4d8-nqnmz 1/1 Running 0 112m
knative-serving domainmapping-webhook-67f5d487b7-8d5cr 1/1 Running 0 112m
knative-serving net-istio-controller-7466f95bb6-nhqw4 1/1 Running 0 111m
knative-serving net-istio-webhook-69946ffc7d-746lj 1/1 Running 0 111m
knative-serving webhook-9bbf89ffb-f4sjh 1/1 Running 0 112m
kube-system calico-kube-controllers-567c56ff98-mhhrg 1/1 Running 0 122m
kube-system calico-node-b9n62 1/1 Running 0 122m
kube-system calico-node-pc65c 1/1 Running 0 122m
kube-system coredns-565d847f94-lv4br 1/1 Running 0 125m
kube-system coredns-565d847f94-nqxcc 1/1 Running 0 125m
kube-system etcd-vhivemaster 1/1 Running 0 125m
kube-system kube-apiserver-vhivemaster 1/1 Running 0 125m
kube-system kube-controller-manager-vhivemaster 1/1 Running 0 126m
kube-system kube-proxy-jhh2z 1/1 Running 0 125m
kube-system kube-proxy-p5j74 1/1 Running 0 123m
kube-system kube-scheduler-vhivemaster 1/1 Running 0 126m
metallb-system controller-844979dcdc-m6p4b 1/1 Running 1 (117m ago) 122m
metallb-system speaker-lc8gd 1/1 Running 0 120m
metallb-system speaker-x47pn 1/1 Running 0 120m
registry docker-registry-pod-b4nxs 1/1 Running 0 112m
registry registry-etc-hosts-update-7kssg 1/1 Running 0 112m
Yes, that is the correct setup result. It seems that these errors are just flaky; the solution is to increase MetalLB timeout and hope that Istio is installed in time. I suppose we can close the issue then.
Thanks for all your help. Before we close the issue, I have one more question on function deployment. According to the recorded tutorial session on youtube, there is supposed to be a deployer directory in ./examples/
to automate the deployment, but the directory seems to be missing. Can I check if there is a location to do function deployment in vHive, or is it meant to be done from vSwarm directories instead?
Yes, we moved them to the vSwarm repository now. You can check with our quickstart guide. It has the most up-to-date instructions, including examples of how to use these tools.
Hi, could I ask a few questions regarding function deployment for my setup?
When running the deployer client, I am getting some error messages:
WARN[0602] Failed to deploy function helloworld-0, /home/vboxuser/vhive/configs/knative_workloads/helloworld.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'helloworld-0' in namespace 'default':
2.963s The Route is still working to reflect the latest desired specification.
5.347s Configuration "helloworld-0" is waiting for a Revision to become ready.
Error: timeout: service 'helloworld-0' not ready after 600 seconds
Run 'kn --help' for usage
INFO[0602] Deployed function helloworld-0
WARN[0602] Failed to deploy function pyaes-1, /home/vboxuser/vhive/configs/knative_workloads/pyaes.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'pyaes-1' in namespace 'default':
1.442s The Route is still working to reflect the latest desired specification.
4.234s Configuration "pyaes-1" is waiting for a Revision to become ready.
Error: timeout: service 'pyaes-1' not ready after 600 seconds
Run 'kn --help' for usage
INFO[0602] Deployed function pyaes-1
WARN[0603] Failed to deploy function pyaes-0, /home/vboxuser/vhive/configs/knative_workloads/pyaes.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'pyaes-0' in namespace 'default':
4.206s The Route is still working to reflect the latest desired specification.
5.117s ...
5.621s Configuration "pyaes-0" is waiting for a Revision to become ready.
Error: timeout: service 'pyaes-0' not ready after 600 seconds
Run 'kn --help' for usage
INFO[0603] Deployed function pyaes-0
WARN[0603] Failed to deploy function rnn-serving-1, /home/vboxuser/vhive/configs/knative_workloads/rnn_serving.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'rnn-serving-1' in namespace 'default':
0.751s The Route is still working to reflect the latest desired specification.
3.778s Configuration "rnn-serving-1" is waiting for a Revision to become ready.
Error: timeout: service 'rnn-serving-1' not ready after 600 seconds
Run 'kn --help' for usage
INFO[0603] Deployed function rnn-serving-1
WARN[0603] Failed to deploy function rnn-serving-0, /home/vboxuser/vhive/configs/knative_workloads/rnn_serving.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'rnn-serving-0' in namespace 'default':
2.567s The Route is still working to reflect the latest desired specification.
4.244s Configuration "rnn-serving-0" is waiting for a Revision to become ready.
Error: timeout: service 'rnn-serving-0' not ready after 600 seconds
Run 'kn --help' for usage
INFO[0603] Deployed function rnn-serving-0
WARN[1207] Failed to deploy function rnn-serving-2, /home/vboxuser/vhive/configs/knative_workloads/rnn_serving.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'rnn-serving-2' in namespace 'default':
2.081s The Route is still working to reflect the latest desired specification.
3.126s ...
5.313s Configuration "rnn-serving-2" is waiting for a Revision to become ready.
Error: timeout: service 'rnn-serving-2' not ready after 600 seconds
Run 'kn --help' for usage
Could those be ignored or is there an issue getting in the way of the deployment?
When running the invoker client I got this error regarding go versioning:
go: go.mod file indicates go 1.21, but maximum version supported by tidy is 1.19
is there a way to fix this?
Thanks.
Errors are definitely bad. It shouldn't timeout. Please send over the description of the pods: kubectl describe pod helloworld-0
.
The invoker problem with Go is known; we will update the Go version in the next update so it will be fixed. For now, you can reinstall go: rm -rf /usr/local/go
, change the version in scripts/setup/system.json
to 1.21.6 and rerun the scripts/install_go.sh
I got a pod not found error; i tried the describe pod command for other functions as well but it is returning the same error:
vboxuser@vHiveMaster:~/vswarm$ kubectl describe pod helloworld-0
Error from server (NotFound): pods "helloworld-0" not found
vboxuser@vHiveMaster:~/vswarm$ kubectl describe pod pyaes-0
Error from server (NotFound): pods "pyaes-0" not found
vboxuser@vHiveMaster:~/vswarm$ kubectl describe pod pyaes-1
Error from server (NotFound): pods "pyaes-1" not found
Then describe the deployment (kubectl describe deployment helloworld-0
)
Name: helloworld-0-00001-deployment
Namespace: default
CreationTimestamp: Tue, 27 Feb 2024 15:14:38 +0800
Labels: app=helloworld-0-00001
service.istio.io/canonical-name=helloworld-0
service.istio.io/canonical-revision=helloworld-0-00001
serving.knative.dev/configuration=helloworld-0
serving.knative.dev/configurationGeneration=1
serving.knative.dev/configurationUID=36b65317-e523-4ec3-8ea6-8734ebdf4d7b
serving.knative.dev/revision=helloworld-0-00001
serving.knative.dev/revisionUID=933839c6-a4fd-4bcf-907b-725a455a2503
serving.knative.dev/service=helloworld-0
serving.knative.dev/serviceUID=c8e131fc-8a06-46a1-8895-b7fd8d9ada06
Annotations: autoscaling.knative.dev/target: 1
deployment.kubernetes.io/revision: 1
serving.knative.dev/creator: kubernetes-admin
Selector: serving.knative.dev/revisionUID=933839c6-a4fd-4bcf-907b-725a455a2503
Replicas: 0 desired | 0 updated | 0 total | 0 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 0 max unavailable, 25% max surge
Pod Template:
Labels: app=helloworld-0-00001
service.istio.io/canonical-name=helloworld-0
service.istio.io/canonical-revision=helloworld-0-00001
serving.knative.dev/configuration=helloworld-0
serving.knative.dev/configurationGeneration=1
serving.knative.dev/configurationUID=36b65317-e523-4ec3-8ea6-8734ebdf4d7b
serving.knative.dev/revision=helloworld-0-00001
serving.knative.dev/revisionUID=933839c6-a4fd-4bcf-907b-725a455a2503
serving.knative.dev/service=helloworld-0
serving.knative.dev/serviceUID=c8e131fc-8a06-46a1-8895-b7fd8d9ada06
Annotations: autoscaling.knative.dev/target: 1
serving.knative.dev/creator: kubernetes-admin
Containers:
user-container:
Image: index.docker.io/crccheck/hello-world@sha256:0404ca69b522f8629d7d4e9034a7afe0300b713354e8bf12ec9657581cf59400
Port: 50051/TCP
Host Port: 0/TCP
Environment:
GUEST_PORT: 50051
GUEST_IMAGE: ghcr.io/ease-lab/helloworld:var_workload
PORT: 50051
K_REVISION: helloworld-0-00001
K_CONFIGURATION: helloworld-0
K_SERVICE: helloworld-0
Mounts: <none>
queue-proxy:
Image: ghcr.io/vhive-serverless/queue-39be6f1d08a095bd076a71d288d295b6@sha256:41259c52c99af616fae4e7a44e40c0e90eb8f5593378a4f3de5dbf35ab1df49c
Ports: 8022/TCP, 9090/TCP, 9091/TCP, 8013/TCP, 8112/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
Requests:
cpu: 25m
Readiness: http-get http://:8013/ delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
SERVING_NAMESPACE: default
SERVING_SERVICE: helloworld-0
SERVING_CONFIGURATION: helloworld-0
SERVING_REVISION: helloworld-0-00001
QUEUE_SERVING_PORT: 8013
QUEUE_SERVING_TLS_PORT: 8112
CONTAINER_CONCURRENCY: 0
REVISION_TIMEOUT_SECONDS: 300
REVISION_RESPONSE_START_TIMEOUT_SECONDS: 0
REVISION_IDLE_TIMEOUT_SECONDS: 0
SERVING_POD: (v1:metadata.name)
SERVING_POD_IP: (v1:status.podIP)
SERVING_LOGGING_CONFIG:
SERVING_LOGGING_LEVEL:
SERVING_REQUEST_LOG_TEMPLATE: {"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}", "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s", "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header "X-B3-Traceid"}}"}
SERVING_ENABLE_REQUEST_LOG: false
SERVING_REQUEST_METRICS_BACKEND: prometheus
TRACING_CONFIG_BACKEND: none
TRACING_CONFIG_ZIPKIN_ENDPOINT:
TRACING_CONFIG_DEBUG: false
TRACING_CONFIG_SAMPLE_RATE: 0.1
USER_PORT: 50051
SYSTEM_NAMESPACE: knative-serving
METRICS_DOMAIN: knative.dev/internal/serving
SERVING_READINESS_PROBE: {"tcpSocket":{"port":50051,"host":"127.0.0.1"},"successThreshold":1}
ENABLE_PROFILING: false
SERVING_ENABLE_PROBE_REQUEST_LOG: false
METRICS_COLLECTOR_ADDRESS:
CONCURRENCY_STATE_ENDPOINT:
CONCURRENCY_STATE_TOKEN_PATH: /var/run/secrets/tokens/state-token
HOST_IP: (v1:status.hostIP)
ENABLE_HTTP2_AUTO_DETECTION: false
ROOT_CA:
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: helloworld-0-00001-deployment-85b6cd4698 (0/0 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 55m deployment-controller Scaled up replica set helloworld-0-00001-deployment-85b6cd4698 to 1
Normal ScalingReplicaSet 45m deployment-controller Scaled down replica set helloworld-0-00001-deployment-85b6cd4698 to 0 from 1
Weird. It says that deployment was scaled up and down. What about revisions? The original error was about revision not being ready.
Sorry, what do you mean by revisions?
kubectl get revisions
and kubectl describe revision <name>
Name: helloworld-0-00001
Namespace: default
Labels: serving.knative.dev/configuration=helloworld-0
serving.knative.dev/configurationGeneration=1
serving.knative.dev/configurationUID=36b65317-e523-4ec3-8ea6-8734ebdf4d7b
serving.knative.dev/routingState=active
serving.knative.dev/service=helloworld-0
serving.knative.dev/serviceUID=c8e131fc-8a06-46a1-8895-b7fd8d9ada06
Annotations: autoscaling.knative.dev/target: 1
serving.knative.dev/creator: kubernetes-admin
serving.knative.dev/routes: helloworld-0
serving.knative.dev/routingStateModified: 2024-02-27T07:14:33Z
API Version: serving.knative.dev/v1
Kind: Revision
Metadata:
Creation Timestamp: 2024-02-27T07:14:33Z
Generation: 1
Managed Fields:
API Version: serving.knative.dev/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:autoscaling.knative.dev/target:
f:serving.knative.dev/creator:
f:serving.knative.dev/routes:
f:serving.knative.dev/routingStateModified:
f:labels:
.:
f:serving.knative.dev/configuration:
f:serving.knative.dev/configurationGeneration:
f:serving.knative.dev/configurationUID:
f:serving.knative.dev/routingState:
f:serving.knative.dev/service:
f:serving.knative.dev/serviceUID:
f:ownerReferences:
.:
k:{"uid":"36b65317-e523-4ec3-8ea6-8734ebdf4d7b"}:
f:spec:
.:
f:containerConcurrency:
f:containers:
f:enableServiceLinks:
f:timeoutSeconds:
Manager: controller
Operation: Update
Time: 2024-02-27T07:14:33Z
API Version: serving.knative.dev/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:actualReplicas:
f:conditions:
f:containerStatuses:
f:observedGeneration:
Manager: controller
Operation: Update
Subresource: status
Time: 2024-02-27T07:25:29Z
Owner References:
API Version: serving.knative.dev/v1
Block Owner Deletion: true
Controller: true
Kind: Configuration
Name: helloworld-0
UID: 36b65317-e523-4ec3-8ea6-8734ebdf4d7b
Resource Version: 24730
UID: 933839c6-a4fd-4bcf-907b-725a455a2503
Spec:
Container Concurrency: 0
Containers:
Env:
Name: GUEST_PORT
Value: 50051
Name: GUEST_IMAGE
Value: ghcr.io/ease-lab/helloworld:var_workload
Image: crccheck/hello-world:latest
Name: user-container
Ports:
Container Port: 50051
Name: h2c
Protocol: TCP
Readiness Probe:
Success Threshold: 1
Tcp Socket:
Port: 0
Resources:
Enable Service Links: false
Timeout Seconds: 300
Status:
Actual Replicas: 0
Conditions:
Last Transition Time: 2024-02-27T07:25:29Z
Message: The target is not receiving traffic.
Reason: NoTraffic
Severity: Info
Status: False
Type: Active
Last Transition Time: 2024-02-27T07:14:40Z
Reason: Deploying
Status: Unknown
Type: ContainerHealthy
Last Transition Time: 2024-02-27T07:24:50Z
Message: Failed to get/pull image: failed to prepare extraction snapshot "extract-305755493-hrFD sha256:5216338b40a7b96416b8b9858974bbe4acc3096ee60acbc4dfb1ee02aecceb10": context deadline exceeded
Reason: CreateContainerError
Status: False
Type: Ready
Last Transition Time: 2024-02-27T07:24:50Z
Message: Failed to get/pull image: failed to prepare extraction snapshot "extract-305755493-hrFD sha256:5216338b40a7b96416b8b9858974bbe4acc3096ee60acbc4dfb1ee02aecceb10": context deadline exceeded
Reason: CreateContainerError
Status: False
Type: ResourcesAvailable
Container Statuses:
Image Digest: index.docker.io/crccheck/hello-world@sha256:0404ca69b522f8629d7d4e9034a7afe0300b713354e8bf12ec9657581cf59400
Name: user-container
Observed Generation: 1
Events: <none>
I've never seen such errors: "Failed to get/pull image: failed to prepare extraction snapshot". Please open a separate issue and attach firecracker logs from worker nodes. It seems that it is the issue with Firecracker now.
Is there a command or file location i can access the firecrackers logs from in the worker node?
Describe the bug Error connecting worker node to Kubernetes cluster when executing following command:
when following standard deployment steps in quickstart guide.
To Reproduce: Setting up 1 master and 1 worker node on 2 VMs running on the same computer (using VirtualBox), both running on Ubuntu 20.04, and following the steps in quickstart guide to "Setup a Serverless (Knative) Cluster" (standard setup, non-stargz).
Expected behaviour: Success message as shown in the quickstart guide:
Logs: Error message after running above-mentioned command:
stack trace: