Closed rbo closed 3 months ago
4.15.8 -> 4.15.22 -> 4.16.4
Let's update to 4.15.22
Discussions at the internal slack: https://redhat-internal.slack.com/archives/C04J8QF8Y83/p1722500655964189
Live migration of some VM's took ages:
% oc get vmi -A -L kubevirt.io/migrationTargetNodeName -l kubevirt.io/nodeName=ucs57 -o wide
NAMESPACE NAME AGE PHASE IP NODENAME READY LIVE-MIGRATABLE PAUSED MIGRATIONTARGETNODENAME
stormshift-ocp11-infra stormshift-ocp11-sno 48d Running 10.32.111.190 ucs57 True True ucs56
%
% oc logs -n stormshift-ocp11-infra --tail=10 -f virt-launcher-stormshift-ocp11-sno-xkx9m | awk '/Migration info/ {print $10}'
MemoryRemaining:866MiB
MemoryRemaining:1096MiB
MemoryRemaining:1185MiB
MemoryRemaining:983MiB
MemoryRemaining:939MiB
MemoryRemaining:992MiB
MemoryRemaining:631MiB
MemoryRemaining:129MiB
MemoryRemaining:110MiB
MemoryRemaining:478MiB
Memory get always a bit dirty and migrate again.
Looks like we have to improve the live migration settings:
Or it might be related to the network issues from time to time #181
This migration is running around 8+ houres
% oc get vmim
NAME PHASE VMI
kubevirt-workload-update-fg9vz Running stormshift-ocp11-sno
% oc describe vmim kubevirt-workload-update-fg9vz
Name: kubevirt-workload-update-fg9vz
Namespace: stormshift-ocp11-infra
Labels: kubevirt.io/vmi-name=stormshift-ocp11-sno
Annotations: kubevirt.io/latest-observed-api-version: v1
kubevirt.io/storage-observed-api-version: v1
kubevirt.io/workloadUpdateMigration:
API Version: kubevirt.io/v1
Kind: VirtualMachineInstanceMigration
Metadata:
Creation Timestamp: 2024-08-01T10:54:12Z
Finalizers:
kubevirt.io/migrationJobFinalize
Generate Name: kubevirt-workload-update-
Generation: 1
Resource Version: 1259553963
UID: 64c1c689-6fa8-49b3-b1b8-76ccc1527078
Spec:
Vmi Name: stormshift-ocp11-sno
Status:
Phase: Running
Phase Transition Timestamps:
Phase: Pending
Phase Transition Timestamp: 2024-08-01T10:54:12Z
Phase: Scheduling
Phase Transition Timestamp: 2024-08-01T10:58:32Z
Phase: Scheduled
Phase Transition Timestamp: 2024-08-01T10:58:50Z
Phase: PreparingTarget
Phase Transition Timestamp: 2024-08-01T10:58:50Z
Phase: TargetReady
Phase Transition Timestamp: 2024-08-01T10:58:50Z
Phase: Running
Phase Transition Timestamp: 2024-08-01T10:58:50Z
Events: <none>
%
Live migration configuration:
% oc get hco -n openshift-cnv kubevirt-hyperconverged -o yaml | yq '.spec.liveMigrationConfig'
allowAutoConverge: false
allowPostCopy: false
completionTimeoutPerGiB: 800
parallelMigrationsPerCluster: 5
parallelOutboundMigrationsPerNode: 5
progressTimeout: 150
% oc explain hco.spec.liveMigrationConfig
GROUP: hco.kubevirt.io
KIND: HyperConverged
VERSION: v1beta1
FIELD: liveMigrationConfig <Object>
DESCRIPTION:
Live migration limits and timeouts are applied so that migration processes
do not overwhelm the cluster.
FIELDS:
allowAutoConverge <boolean>
AllowAutoConverge allows the platform to compromise performance/availability
of VMIs to guarantee successful VMI live migrations. Defaults to false
allowPostCopy <boolean>
AllowPostCopy enables post-copy live migrations. Such migrations allow even
the busiest VMIs to successfully live-migrate. However, events like a
network failure can cause a VMI crash. If set to true, migrations will still
start in pre-copy, but switch to post-copy when CompletionTimeoutPerGiB
triggers. Defaults to false
bandwidthPerMigration <string>
Bandwidth limit of each migration, the value is quantity of bytes per second
(e.g. 2048Mi = 2048MiB/sec)
completionTimeoutPerGiB <integer>
The migration will be canceled if it has not completed in this time, in
seconds per GiB of memory. For example, a virtual machine instance with 6GiB
memory will timeout if it has not completed migration in 4800 seconds. If
the Migration Method is BlockMigration, the size of the migrating disks is
included in the calculation.
network <string>
The migrations will be performed over a dedicated multus network to minimize
disruption to tenant workloads due to network saturation when VM live
migrations are triggered.
parallelMigrationsPerCluster <integer>
Number of migrations running in parallel in the cluster.
parallelOutboundMigrationsPerNode <integer>
Maximum number of outbound migrations per node.
progressTimeout <integer>
The migration will be canceled if memory copy fails to make progress in this
time, in seconds.
Let's create and try a MigrationPolicy
apiVersion: migrations.kubevirt.io/v1alpha1
kind: MigrationPolicy
metadata:
annotations:
description: Migration Policy to allow allowAutoConverge & allowPostCopy
name: mig-pol-allow-auto-conv-allow-post-copy
spec:
allowAutoConverge: true
allowPostCopy: true
selectors:
namespaceSelector:
mig-pol-allow-auto-conv-allow-post-copy: 'true'
oc label namespace/stormshift-ocp11-infra mig-pol-allow-auto-conv-allow-post-copy=true
Cancel migration, new is created with the applied policy:
I applied the policy to a bunch of namespaces:
% oc get namespaces -l mig-pol-allow-auto-conv-allow-post-copy=true
NAME STATUS AGE
demo-cluster-disco Active 227d
rbohne-hcp-rhods Active 242d
rbohne-hcp-sendling-ingress Active 223d
rbohne-sno Active 169d
stormshift-ocp1-infra Active 76d
stormshift-ocp11-infra Active 48d
%
ODF Stucks...
Can not evict inf44 osd pods...
sh-5.1$ ceph -s
cluster:
id: 5545f168-c66b-4a82-9bd1-f932695df518
health: HEALTH_WARN
1 filesystem is degraded
1 MDSs report slow metadata IOs
1/3 mons down, quorum a,b
Reduced data availability: 67 pgs inactive, 67 pgs peering
634 slow ops, oldest one blocked for 50024 sec, daemons [osd.0,osd.2,osd.4,osd.5,mon.a] have slow ops.
services:
mon: 3 daemons, quorum a,b (age 78m), out of quorum: c
mgr: a(active, since 78m), standbys: b
mds: 1/1 daemons up, 1 standby
osd: 6 osds: 6 up (since 13h), 6 in (since 3M)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 0/1 healthy, 1 recovering
pools: 12 pools, 201 pgs
objects: 69.48k objects, 253 GiB
usage: 757 GiB used, 1.9 TiB / 2.6 TiB avail
pgs: 33.333% pgs not active
134 active+clean
67 peering
sh-5.1$ ceph health detail
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; 1/3 mons down, quorum a,b; Reduced data availability: 67 pgs inactive, 67 pgs peering; 634 slow ops, oldest one blocked for 50029 sec, daemons [osd.0,osd.2,osd.4,osd.5,mon.a] have slow ops.
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
mds.ocs-storagecluster-cephfilesystem-a(mds.0): 3 slow metadata IOs are blocked > 30 secs, oldest blocked for 2963 secs
[WRN] MON_DOWN: 1/3 mons down, quorum a,b
mon.c (rank 2) addr v2:172.30.122.49:3300/0 is down (out of quorum)
[WRN] PG_AVAILABILITY: Reduced data availability: 67 pgs inactive, 67 pgs peering
pg 1.2 is stuck peering for 13h, current state peering, last acting [2,0,1]
pg 1.d is stuck peering for 13h, current state peering, last acting [2,0,1]
pg 1.1d is stuck peering for 14h, current state peering, last acting [0,1,5]
pg 1.1f is stuck peering for 14h, current state peering, last acting [0,1,2]
pg 1.27 is stuck peering for 13h, current state peering, last acting [5,0,1]
pg 1.28 is stuck peering for 13h, current state peering, last acting [5,0,1]
pg 1.2c is stuck peering for 14h, current state peering, last acting [2,0,4]
pg 1.2f is stuck peering for 14h, current state peering, last acting [0,1,5]
pg 1.30 is stuck peering for 14h, current state peering, last acting [0,5,1]
pg 1.31 is stuck peering for 13h, current state peering, last acting [2,1,0]
pg 1.36 is stuck peering for 14h, current state peering, last acting [0,5,1]
pg 1.39 is stuck peering for 14h, current state peering, last acting [0,5,4]
pg 3.1 is stuck peering for 14h, current state peering, last acting [0,2,4]
pg 3.3 is stuck peering for 13h, current state peering, last acting [2,0,4]
pg 3.6 is stuck peering for 14h, current state peering, last acting [0,2,4]
pg 4.4 is stuck peering for 21h, current state peering, last acting [0,1,2]
pg 4.5 is stuck peering for 14h, current state peering, last acting [5,1,0]
pg 4.7 is stuck peering for 17h, current state peering, last acting [0,1,2]
pg 5.2 is stuck peering for 13h, current state peering, last acting [5,0,1]
pg 5.3 is stuck peering for 21h, current state peering, last acting [0,5,1]
pg 5.5 is stuck peering for 14h, current state peering, last acting [0,2,1]
pg 6.0 is stuck peering for 15h, current state peering, last acting [0,5,1]
pg 6.3 is stuck peering for 13h, current state peering, last acting [5,4,0]
pg 6.5 is stuck peering for 13h, current state peering, last acting [5,1,0]
pg 7.4 is stuck peering for 18h, current state peering, last acting [0,5,4]
pg 7.6 is stuck peering for 21h, current state peering, last acting [0,2,4]
pg 8.1 is stuck peering for 17h, current state peering, last acting [0,4,2]
pg 8.7 is stuck inactive for 13h, current state peering, last acting [0,1,2]
pg 9.0 is stuck peering for 14h, current state peering, last acting [0,2,4]
pg 10.6 is stuck peering for 14h, current state peering, last acting [5,4,0]
pg 10.8 is stuck peering for 21h, current state peering, last acting [0,1,5]
pg 10.9 is stuck peering for 14h, current state peering, last acting [5,0,1]
pg 10.a is stuck peering for 14h, current state peering, last acting [2,4,0]
pg 10.c is stuck peering for 14h, current state peering, last acting [2,4,0]
pg 10.d is stuck peering for 14h, current state peering, last acting [2,0,1]
pg 10.e is stuck peering for 14h, current state peering, last acting [5,0,4]
pg 11.0 is stuck peering for 15h, current state peering, last acting [0,2,1]
pg 11.2 is stuck peering for 16h, current state peering, last acting [0,5,4]
pg 11.6 is stuck peering for 18h, current state peering, last acting [0,5,1]
pg 11.7 is stuck peering for 13h, current state peering, last acting [5,0,1]
pg 11.c is stuck peering for 15h, current state peering, last acting [0,2,4]
pg 11.14 is stuck peering for 13h, current state peering, last acting [2,0,4]
pg 11.15 is stuck peering for 17h, current state peering, last acting [0,2,1]
pg 12.0 is stuck peering for 14h, current state peering, last acting [5,0,4]
pg 12.6 is stuck peering for 19h, current state peering, last acting [0,2,4]
pg 12.b is stuck peering for 19h, current state peering, last acting [0,1,2]
pg 12.c is stuck peering for 21h, current state peering, last acting [0,1,5]
pg 12.d is stuck peering for 14h, current state peering, last acting [5,0,1]
pg 12.e is stuck peering for 21h, current state peering, last acting [0,4,5]
pg 12.f is stuck peering for 14h, current state peering, last acting [5,0,4]
pg 12.10 is stuck peering for 21h, current state peering, last acting [0,2,1]
[WRN] SLOW_OPS: 634 slow ops, oldest one blocked for 50029 sec, daemons [osd.0,osd.2,osd.4,osd.5,mon.a] have slow ops.
Let's try add the new ceph10,ceph11,ceph12 and rebalance the storage
% oc get pv -l storage.openshift.com/owner-name=local-odf-ssd -L kubernetes.io/hostname
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE HOSTNAME
local-pv-308e6d8f 1863Gi RWO Delete Available local-odf-ssd 10m ceph10
local-pv-335dba3d 1863Gi RWO Delete Available local-odf-ssd 14m ceph11
local-pv-3e070be3 445Gi RWO Delete Bound openshift-storage/ocs-deviceset-local-odf-ssd-0-data-1vjqbd local-odf-ssd 258d inf7
local-pv-54987767 1863Gi RWO Delete Available local-odf-ssd 14m ceph11
local-pv-602a8e03 447Gi RWO Delete Bound openshift-storage/ocs-deviceset-local-odf-ssd-0-data-5ktr4x local-odf-ssd 258d inf44
local-pv-71c4b9df 445Gi RWO Delete Bound openshift-storage/ocs-deviceset-local-odf-ssd-0-data-3h744f local-odf-ssd 258d inf8
local-pv-7f7b1f43 1863Gi RWO Delete Available local-odf-ssd 7m46s ceph10
local-pv-7faa5447 445Gi RWO Delete Bound openshift-storage/ocs-deviceset-local-odf-ssd-0-data-4fvvd5 local-odf-ssd 258d inf7
local-pv-9610e3d8 447Gi RWO Delete Bound openshift-storage/ocs-deviceset-local-odf-ssd-0-data-2glq85 local-odf-ssd 258d inf44
local-pv-c0d9ea9c 1863Gi RWO Delete Available local-odf-ssd 7m2s ceph12
local-pv-d51b1626 445Gi RWO Delete Bound openshift-storage/ocs-deviceset-local-odf-ssd-0-data-0xbb26 local-odf-ssd 258d inf8
local-pv-ef42156d 1863Gi RWO Delete Available local-odf-ssd 14s ceph12
%
% oc get pods -o wide -l app=rook-ceph-osd-prepare
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
rook-ceph-osd-prepare-0504d86b7040a0af248baceaa6d2a70a-mb6np 0/1 CrashLoopBackOff 5 (2m38s ago) 5m56s 10.130.16.16 ceph10 <none> <none>
rook-ceph-osd-prepare-0c7caa1ac0cc8efbb1205c35c8d932ca-8rwcp 0/1 CrashLoopBackOff 5 (2m30s ago) 5m55s 10.128.24.23 ceph12 <none> <none>
rook-ceph-osd-prepare-0cb23090e0e9580b58d82178a9312625-q8kv2 0/1 CrashLoopBackOff 5 (2m25s ago) 5m56s 10.128.24.22 ceph12 <none> <none>
rook-ceph-osd-prepare-0dff7b5649971c7463f6d3d1b0e75ffe-gzp6b 0/1 CrashLoopBackOff 5 (2m31s ago) 5m55s 10.131.16.12 ceph11 <none> <none>
rook-ceph-osd-prepare-4cfc0cdb9cda58ca06d1468537e4a1af-rgwcd 0/1 CrashLoopBackOff 5 (2m19s ago) 5m53s 10.131.16.13 ceph11 <none> <none>
rook-ceph-osd-prepare-9425b6f55cbbb5c6dae8a85e0555cdb0-pmmhk 0/1 CrashLoopBackOff 5 (2m42s ago) 5m54s 10.130.16.17 ceph10 <none> <none>
%
Pods disappear.
Grr I want to finish the update today (4.15.xx) let's 🔨 - reboot inf44 to force drain. 😢
inf44 came not back online:
Aug 02 11:46:25 inf44 crio[13880]: time="2024-08-02 11:46:25.952115702Z" level=warning msg="Error encountered when checking whether cri-o should wipe containers: open /var/run/crio/version: no such file or directory"
Aug 02 11:46:25 inf44 crio[13880]: time="2024-08-02 11:46:25.952336785Z" level=info msg="Registered SIGHUP reload watcher"
Aug 02 11:46:25 inf44 crio[13880]: time="2024-08-02 11:46:25.952493035Z" level=fatal msg="Failed to start streaming server: listen tcp [2620:52:0:2060:77c:8bd7:a07c:5017]:0: bind: cannot assign requested address"
Aug 02 11:46:25 inf44 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Aug 02 11:46:25 inf44 systemd[1]: crio.service: Failed with result 'exit-code'.
Aug 02 11:46:25 inf44 systemd[1]: Failed to start Container Runtime Interface for OCI (CRI-O).
CRIO and IPv6 🤦🏼
[root@inf44 ~]# systemctl cat crio | tail
[Service]
Environment="GODEBUG=x509ignoreCN=0,madvdontneed=1"
# /etc/systemd/system/crio.service.d/10-mco-profile-unix-socket.conf
[Service]
Environment="ENABLE_PROFILE_UNIX_SOCKET=true"
# /etc/systemd/system/crio.service.d/20-nodenet.conf
[Service]
Environment="CONTAINER_STREAM_ADDRESS=2620:52:0:2060:77c:8bd7:a07c:5017"
[root@inf44 ~]#
First problems because of #188
Disable ipv6
[root@inf44 etc]# nmcli connection show da1c0030-8c36-3895-9dda-32d13fcf0eaf | grep interface
connection.interface-name: eno1
[root@inf44 etc]# nmcli connection show
NAME UUID TYPE DEVICE
ovs-if-br-ex 5041b789-83ee-41bc-a243-e3e779559b08 ovs-interface br-ex
lo 3862ec1b-6e5a-4805-bebf-74392ffdf1df loopback lo
Wired connection 1 d029ea06-2680-34dc-bc3b-c2cea6ce2836 ethernet eno2
br-ex 7f1ce127-575c-4619-b65f-017d9dce6b62 ovs-bridge br-ex
coe-br-vlan-69 08ac4b8f-46e3-4e10-ab69-1700dc453527 bridge coe-br-vlan-69
coe-bridge c4b1040f-cb5c-440b-92a1-b90c521724b2 bridge coe-bridge
eno2.69 3d5fc333-a714-4de9-8055-462370391fb1 vlan eno2.69
ovs-if-phys0 e749d6cd-a03f-4368-95f7-e6c6afa17115 ethernet eno1
ovs-port-br-ex 962ea91c-1e94-4e4b-a1f2-2775057d2a91 ovs-port br-ex
ovs-port-phys0 e9cb92da-7932-4bbe-b572-9546fb9bef19 ovs-port eno1
Wired connection 2 da1c0030-8c36-3895-9dda-32d13fcf0eaf ethernet --
[root@inf44 etc]# nmcli connection show da1c0030-8c36-3895-9dda-32d13fcf0eaf | grep interface
connection.interface-name: eno1
[root@inf44 etc]# nmcli connection modify da1c0030-8c36-3895-9dda-32d13fcf0eaf ipv6.method "disabled"
[root@inf44 etc]# reboot
[root@inf44 etc]# Connection to 10.32.96.44 closed by remote host.
Start upgrade to 4.16.4, inf8 stuck because of ipv6 issue like inf44.
Disabled IPv6 at inf8
Disabled IPv6:
Update done, ipv6 disabled