Closed mloskot closed 1 year ago
@aravindhp Happy to help, if I can.
The make all
is still running for me, so the section title Run in WSL from Windows host filesystem: Success above is not entirely true as it's not completed with success yet i.e. two-node cluster running. I'm updating the output with some Cannot find path
failures as you can already see yourself. However, those failures seem not related to the actual issue this PR addresses.
I'm seeing the make all
procedure being stuck in loop trying to resolve this issue - it's repeated the same step at least three times already - but as mentioned, this must be a different issue
...
cni: calico
calico: 3.25.0; containerd: 1.6.15
==> winw1: Running provisioner: shell...
winw1: Running: sync/windows/0-containerd.ps1 as C:\tmp\vagrant-shell.ps1
winw1: Stopping ContainerD & Kubelet
winw1: Downloading Calico using ContainerD - [calico: 3.25] [containerd: 1.6.15]
winw1: Installing 7Zip
winw1: Getting ContainerD binaries
winw1: Downloading https://github.com/containerd/containerd/releases/download/v1.6.15/containerd-1.6.15-windows-amd64.tar.gz to C:\Program Files\containerd\containerd.tar.gz
winw1: x containerd-shim-runhcs-v1.exe
winw1: x ctr.exe
winw1: x containerd-stress.exe
winw1: x containerd.exe
winw1: Registering ContainerD as a service
winw1: Starting ContainerD service
winw1: time="2023-04-14T13:23:12.295397500-07:00" level=fatal msg="The specified service already exists."
winw1: Done - please remember to add '--cri-socket "npipe:////./pipe/containerd-containerd"' to your kubeadm join command
==> winw1: Running provisioner: shell...
winw1: Running: sync/windows/forked.ps1 as C:\tmp\vagrant-shell.ps1
winw1:
winw1:
winw1: Directory: C:\
winw1:
winw1:
winw1: Mode LastWriteTime Length Name
winw1: ---- ------------- ------ ----
winw1: d----- 1/21/2022 3:44 AM k
winw1:
winw1:
winw1: cp : Cannot find path 'C:\forked\StartKubelet.ps1' because it does not exist.
winw1: At C:\tmp\vagrant-shell.ps1:11 char:5
winw1: + cp C:/forked/StartKubelet.ps1 c:\k\StartKubelet.ps1
winw1: + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
winw1: + CategoryInfo : ObjectNotFound: (C:\forked\StartKubelet.ps1:String) [Copy-Item], ItemNotFoundException
winw1: + FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.CopyItemCommand
winw1:
==> winw1: Running provisioner: shell...
...
/hold /lgtm cancel
OK, lets figure out the issue before we get this merged.
@aravindhp After a few attempts and some tweaks, I've managed to run the cluster with success, I think. Next, I am going to try reproduce this success on a different Windows 11 machine.
Below, I copy full story in details from my personal notes at https://github.com/mloskot/sig-windows-dev-tools/wiki/Successful-Run-1
Hyper-V has NOT been disabled
C:\> Get-WindowsOptionalFeature -FeatureName Microsoft-Hyper-V-All -Online
FeatureName : Microsoft-Hyper-V-All
DisplayName : Hyper-V
Description : Provides services and management tools for creating and running virtual machines and their resources.
RestartRequired : Possible
State : Enabled
CustomProperties :
C:\> Get-Service | findstr vm
Running vmcompute Hyper-V Host Compute Service
Stopped vmicguestinterface Hyper-V Guest Service Interface
Stopped vmicheartbeat Hyper-V Heartbeat Service
Stopped vmickvpexchange Hyper-V Data Exchange Service
Stopped vmicrdv Hyper-V Remote Desktop Virtualizati...
Stopped vmicshutdown Hyper-V Guest Shutdown Service
Stopped vmictimesync Hyper-V Time Synchronization Service
Stopped vmicvmsession Hyper-V PowerShell Direct Service
Stopped vmicvss Hyper-V Volume Shadow Copy Requestor
Running vmms Hyper-V Virtual Machine Management
variables.yaml
NOTICE: I don't know if these changes have been helpful or essential for the successful run, but after initial failures with the Windows node (see above), I went for a blind shot from the hip and bumped the versions.
make all
Inside WSL terminal (Ubuntu 22.04) run:
export VAGRANT=/mnt/c/HashiCorp/Vagrant/bin/vagrant.exe
cd sig-windows-dev-tools
make all
Despite the make all
above terminated with 3-smoke-test
error the two nodes of the cluster are Ready
$ vagrant ssh controlplane
cni: calico
Last login: Fri Apr 14 23:39:48 2023 from 10.0.2.2
vagrant@controlplane:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready control-plane 23m v1.27.1-1+95feac5269be09
winw1 NotReady <none> 67s v1.27.1-1+95feac5269be09
vagrant@controlplane:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready control-plane 24m v1.27.1-1+95feac5269be09
winw1 NotReady <none> 2m22s v1.27.1-1+95feac5269be09
vagrant@controlplane:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready control-plane 29m v1.27.1-1+95feac5269be09
winw1 NotReady <none> 6m55s v1.27.1-1+95feac5269be09
vagrant@controlplane:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready control-plane 34m v1.27.1-1+95feac5269be09
winw1 NotReady <none> 12m v1.27.1-1+95feac5269be09
vagrant@controlplane:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready control-plane 35m v1.27.1-1+95feac5269be09
winw1 Ready <none> 13m v1.27.1-1+95feac5269be09
vagrant@controlplane:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready control-plane 36m v1.27.1-1+95feac5269be09
winw1 Ready <none> 14m v1.27.1-1+95feac5269be09
At least one pod on winw1
node has status Running
, i.e. calico-node-windows-bjstr
.
vagrant@controlplane:~$ kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-apiserver calico-apiserver-8bbdd5967-4dc4v 1/1 Running 0 33m 100.244.49.70 controlplane <none> <none>
calico-apiserver calico-apiserver-8bbdd5967-fvrrm 1/1 Running 0 33m 100.244.49.69 controlplane <none> <none>
calico-system calico-kube-controllers-789dc4c76b-f458q 1/1 Running 0 36m 100.244.49.67 controlplane <none> <none>
calico-system calico-node-2qgph 1/1 Running 0 36m 10.20.30.10 controlplane <none> <none>
calico-system calico-node-windows-bjstr 1/2 Running 1 (60s ago) 15m 10.20.30.11 winw1 <none> <none>
calico-system calico-typha-6fbc8d5c5d-fjh8s 1/1 Running 0 36m 10.20.30.10 controlplane <none> <none>
calico-system csi-node-driver-r4j49 2/2 Running 0 36m 100.244.49.68 controlplane <none> <none>
default netshoot 1/1 Running 0 14m 100.244.49.72 controlplane <none> <none>
default nginx-deployment-7f97bd64fb-h288q 1/1 Running 0 14m 100.244.49.71 controlplane <none> <none>
default whoami-windows-9d46bfd7-4clgt 0/1 ContainerCreating 0 14m <none> winw1 <none> <none>
default whoami-windows-9d46bfd7-krnbd 0/1 ContainerCreating 0 14m <none> winw1 <none> <none>
default whoami-windows-9d46bfd7-ntqg8 0/1 ContainerCreating 0 14m <none> winw1 <none> <none>
kube-system coredns-5d78c9869d-xrpcd 1/1 Running 0 37m 100.244.49.65 controlplane <none> <none>
kube-system coredns-5d78c9869d-zmtft 1/1 Running 0 37m 100.244.49.66 controlplane <none> <none>
kube-system etcd-controlplane 1/1 Running 0 37m 10.20.30.10 controlplane <none> <none>
kube-system kube-apiserver-controlplane 1/1 Running 0 37m 10.20.30.10 controlplane <none> <none>
kube-system kube-controller-manager-controlplane 1/1 Running 0 37m 10.20.30.10 controlplane <none> <none>
kube-system kube-proxy-ms54m 1/1 Running 0 37m 10.20.30.10 controlplane <none> <none>
kube-system kube-scheduler-controlplane 1/1 Running 0 37m 10.20.30.10 controlplane <none> <none>
tigera-operator tigera-operator-549d4f9bdb-g6k97 1/1 Running 0 37m 10.20.30.10 controlplane <none> <none>
kubectl
on hostDownload kubeconfig from controlplane
node:
vagrant plugin install vagrant-scp
vagrant scp controlplane:~/.kube/config ./.kubeconfig-sig-windows-dev-tools
Access cluster resources
kubectl get nodes --kubeconfig=./.kubeconfig-sig-windows-dev-tools -o wide
kubectl get pods -A --kubeconfig=./.kubeconfig-sig-windows-dev-tools -o wide
great news thanks ! so does that mean we should merge this?
@jayunit100
so does that mean we should merge this?
It turned out to be two threads here:
The PR itself which proposes just a clarification of the currently documented procedure. It does not propose any substantial change of the procedure itself. This is completed.
The test of the overall currently documented procedure on Windows host which attempts I documented in https://github.com/kubernetes-sigs/sig-windows-dev-tools/pull/245#issuecomment-1509712635 This is still work in progress - I need to build more understanding of the whole setup. For example, here are some issues I need to address:
update of versions in my patch to variables.yaml
above may turn out to be unncessary and even insufficient as there is Windows Server 2019 image already built with references of hard-wired versions of tools or tools already deployed.
I'm not sure why ctrctl.exe
is not being found as the WS 2019 image should have got it deployed already
https://github.com/kubernetes-sigs/sig-windows-dev-tools/blob/4cfbbba1acea3e3b3dce27da9a81e14b1a8c6a58/experiments/image-builder/overlays/ansible/roles/utilities/tasks/main.yml#L15-L18
but I'm seeing
winw1: [preflight] WARNING: Couldn't create the interface used for talking to the container runtime:
crictl is required for container runtime: exec: "crictl": executable file not found in %PATH%
I'm observing that it would be good to increase the Windows boot timeout in the Vagrantfile
with patch like this
--- a/Vagrantfile
+++ b/Vagrantfile
@@ -62,6 +62,8 @@ Vagrant.configure(2) do |config|
winw1.vm.box = "sig-windows-dev-tools/windows-2019"
winw1.vm.box_version = "1.0"
+ winw1.vm.boot_timeout = 600
To sum this thread up, I will continue discovering details and will try to discuss and confirm them here and on #sig-windows
Slack channel.
Meanwhile, IMHO, this PR can be merged now and regardless of the results of my tests of the whole procedure.
Following my first attempt in https://github.com/kubernetes-sigs/sig-windows-dev-tools/pull/245#issuecomment-1509229142, I have managed to successfully create the two-node cluster on another machine and using default variables.yaml
without any patches. The only patch I had to apply was larger winw1.vm.boot_timeout
as I had been experiencing Vagrant timetous for the Windows VM.
Hyper-V has NOT been disabled
C:\> Get-WindowsOptionalFeature -FeatureName Microsoft-Hyper-V-All -Online
FeatureName : Microsoft-Hyper-V-All
DisplayName : Hyper-V
Description : Provides services and management tools for creating and running virtual machines and their resources.
RestartRequired : Possible
State : Enabled
CustomProperties :
C:\> Get-Service | findstr vm
Running vmcompute Hyper-V Host Compute Service
Stopped vmicguestinterface Hyper-V Guest Service Interface
Stopped vmicheartbeat Hyper-V Heartbeat Service
Stopped vmickvpexchange Hyper-V Data Exchange Service
Stopped vmicrdv Hyper-V Remote Desktop Virtualizati...
Stopped vmicshutdown Hyper-V Guest Shutdown Service
Stopped vmictimesync Hyper-V Time Synchronization Service
Stopped vmicvmsession Hyper-V PowerShell Direct Service
Stopped vmicvss Hyper-V Volume Shadow Copy Requestor
Running vmms Hyper-V Virtual Machine Management
Unlike in Successful-Run-1, this time no CPU and memory updates to variables.yaml
.
Vagrantfile
was patched to avoid Windows node boot timeouts:
diff --git a/Vagrantfile b/Vagrantfile
index 1cfc0fe..45b15c7 100644
--- a/Vagrantfile
+++ b/Vagrantfile
@@ -61,6 +61,7 @@ Vagrant.configure(2) do |config|
winw1.vm.host_name = "winw1"
winw1.vm.box = "sig-windows-dev-tools/windows-2019"
winw1.vm.box_version = "1.0"
+ winw1.vm.boot_timeout = 600
make all
Inside WSL terminal (Ubuntu 22.04) run:
export VAGRANT=/mnt/c/HashiCorp/Vagrant/bin/vagrant.exe
cd sig-windows-dev-tools
make all
Despite the make all
above terminated with 3-smoke-test
error the two nodes of the cluster are Ready
▶ ⎈ kubernetes-admin@kubernetes ▶ $ ▶ $ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready control-plane 6h28m v1.26.4-3+2d4a3e29be572e
winw1 Ready <none> 5h45m v1.26.4-3+2d4a3e29be572e
All pods on winw1
node in namespace default
have status Running
▶ ⎈ kubernetes-admin@kubernetes ▶ default ▶ $ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-7c6949fdf4-9sjkg 1/1 Running 0 47m
whoami-windows-6f7964957-6rdvz 1/1 Running 0 5h45m
whoami-windows-6f7964957-gw9pt 1/1 Running 0 5h45m
whoami-windows-6f7964957-k475s 1/1 Running 0 5h45m
I have managed to deply and access the dashboard:
I observed significant performance issues of the overall cluster. The kubectl
often takes longer time to come back with results.
Running kubectl get -A events --watch
seems decreasing the performance even more.
I am going to restore CPU and memory tweaks from the variables.yaml
patch in Successful-Run-1, that is:
## Linux settings
k8s_linux_kubelet_nodeip: "10.20.30.10"
windows_node_ip: "10.20.30.11"
-windows_ram: 6048
-linux_ram: 4096
-linux_cpus: 2
-windows_cpus: 4
+windows_ram: 8192
+linux_ram: 8192
+linux_cpus: 4
+windows_cpus: 8
Then I will perform another run to create the whole cluster.
@aravindhp Thanks! Is /lgtm
supposed to remove the do-not-merge/hold
label?
/unhold
FWI: I was following the doc to recreate the steps, using the same host environment, but VMBox is running version 6.0. I got an error:
==> controlplane: Running 'pre-boot' VM customizations...
==> controlplane: Booting VM...
There was an error while executing `VBoxManage`, a CLI used by Vagrant
for controlling VirtualBox. The command and stderr is shown below.
Command: ["startvm", "3e926785-4d71-44a8-b942-f2357bf1bf87", "--type", "headless"]
Stderr: VBoxManage.exe: error: Failed to get device handle and/or partition ID for 00000000016c5930 (hPartitionDevice=0000000000000ad9, Last=0xc0000002/1) (VERR_NEM_VM_CREATE_FAILED)
VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component ConsoleWrap, interface IConsole
make: *** [Makefile:47: 2-vagrant-up] Error 1
Apparently, I can't run VMs on VirtualBox version 6.0 while Hyper-V is enabled, so I had to upgrade to version 7.0 which supports running VMs while Hyper-V is enabled or disabled.
@adeniyistephen Yes, I think VB 6.0 seems to be the issue. I've recalled in more details what setups I tried:
And in none of those I experienced the error as you did.
Correct @mloskot I read the development started from 6.1
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: aravindhp, mloskot
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Run the cluster from Windows host filesystem. Running it from WSL filesystem is likely to fail.
The current documentation is missing an important requirement for those who want to run the two-node cluster on Windows in WSL environment.
It is important to run
make all
insidesis-windows-dev-tools
repo cloned on Windows filesystem and not WSL filesystem, and here it is why:Run in WSL from WSL filesystem: Fail
Go to WSL terminal, then run the following sequence:
Run in WSL from Windows host filesystem: Success
Go to WSL terminal, then run the following sequence - notice
F:
drive location, not WSL$HOME
: