Update ISAR to 4.16 - Githubissues

rbo commented 3 months ago

[ ] Check ODF Update

rbo commented 3 months ago

Update process: https://access.redhat.com/labs/ocpupgradegraph/update_path/?channel=stable-4.15&arch=x86_64&is_show_hot_fix=false&current_ocp_version=4.15.8&target_ocp_version=4.16.4

4.15.8 -> 4.15.22 -> 4.16.4

Let's update to 4.15.22

rbo commented 3 months ago

Discussions at the internal slack: https://redhat-internal.slack.com/archives/C04J8QF8Y83/p1722500655964189

Live migration of some VM's took ages:

%  oc get vmi -A -L kubevirt.io/migrationTargetNodeName -l kubevirt.io/nodeName=ucs57 -o wide
NAMESPACE                NAME                   AGE   PHASE     IP              NODENAME   READY   LIVE-MIGRATABLE   PAUSED   MIGRATIONTARGETNODENAME
stormshift-ocp11-infra   stormshift-ocp11-sno   48d   Running   10.32.111.190   ucs57      True    True                       ucs56
%
% oc logs -n stormshift-ocp11-infra --tail=10 -f virt-launcher-stormshift-ocp11-sno-xkx9m | awk '/Migration info/ {print $10}'
MemoryRemaining:866MiB
MemoryRemaining:1096MiB
MemoryRemaining:1185MiB
MemoryRemaining:983MiB
MemoryRemaining:939MiB
MemoryRemaining:992MiB
MemoryRemaining:631MiB
MemoryRemaining:129MiB
MemoryRemaining:110MiB
MemoryRemaining:478MiB

Memory get always a bit dirty and migrate again.

Looks like we have to improve the live migration settings:

Or it might be related to the network issues from time to time #181

rbo commented 3 months ago

This migration is running around 8+ houres

% oc get vmim
NAME                             PHASE     VMI
kubevirt-workload-update-fg9vz   Running   stormshift-ocp11-sno
% oc describe vmim kubevirt-workload-update-fg9vz
Name:         kubevirt-workload-update-fg9vz
Namespace:    stormshift-ocp11-infra
Labels:       kubevirt.io/vmi-name=stormshift-ocp11-sno
Annotations:  kubevirt.io/latest-observed-api-version: v1
              kubevirt.io/storage-observed-api-version: v1
              kubevirt.io/workloadUpdateMigration:
API Version:  kubevirt.io/v1
Kind:         VirtualMachineInstanceMigration
Metadata:
  Creation Timestamp:  2024-08-01T10:54:12Z
  Finalizers:
    kubevirt.io/migrationJobFinalize
  Generate Name:     kubevirt-workload-update-
  Generation:        1
  Resource Version:  1259553963
  UID:               64c1c689-6fa8-49b3-b1b8-76ccc1527078
Spec:
  Vmi Name:  stormshift-ocp11-sno
Status:
  Phase:  Running
  Phase Transition Timestamps:
    Phase:                       Pending
    Phase Transition Timestamp:  2024-08-01T10:54:12Z
    Phase:                       Scheduling
    Phase Transition Timestamp:  2024-08-01T10:58:32Z
    Phase:                       Scheduled
    Phase Transition Timestamp:  2024-08-01T10:58:50Z
    Phase:                       PreparingTarget
    Phase Transition Timestamp:  2024-08-01T10:58:50Z
    Phase:                       TargetReady
    Phase Transition Timestamp:  2024-08-01T10:58:50Z
    Phase:                       Running
    Phase Transition Timestamp:  2024-08-01T10:58:50Z
Events:                          <none>
%

rbo commented 3 months ago

Live migration configuration:

% oc get hco -n openshift-cnv kubevirt-hyperconverged -o yaml | yq '.spec.liveMigrationConfig'
allowAutoConverge: false
allowPostCopy: false
completionTimeoutPerGiB: 800
parallelMigrationsPerCluster: 5
parallelOutboundMigrationsPerNode: 5
progressTimeout: 150
% oc explain hco.spec.liveMigrationConfig
GROUP:      hco.kubevirt.io
KIND:       HyperConverged
VERSION:    v1beta1

FIELD: liveMigrationConfig <Object>

DESCRIPTION:
    Live migration limits and timeouts are applied so that migration processes
    do not overwhelm the cluster.

FIELDS:
  allowAutoConverge <boolean>
    AllowAutoConverge allows the platform to compromise performance/availability
    of VMIs to guarantee successful VMI live migrations. Defaults to false

  allowPostCopy <boolean>
    AllowPostCopy enables post-copy live migrations. Such migrations allow even
    the busiest VMIs to successfully live-migrate. However, events like a
    network failure can cause a VMI crash. If set to true, migrations will still
    start in pre-copy, but switch to post-copy when CompletionTimeoutPerGiB
    triggers. Defaults to false

  bandwidthPerMigration <string>
    Bandwidth limit of each migration, the value is quantity of bytes per second
    (e.g. 2048Mi = 2048MiB/sec)

  completionTimeoutPerGiB   <integer>
    The migration will be canceled if it has not completed in this time, in
    seconds per GiB of memory. For example, a virtual machine instance with 6GiB
    memory will timeout if it has not completed migration in 4800 seconds. If
    the Migration Method is BlockMigration, the size of the migrating disks is
    included in the calculation.

  network   <string>
    The migrations will be performed over a dedicated multus network to minimize
    disruption to tenant workloads due to network saturation when VM live
    migrations are triggered.

  parallelMigrationsPerCluster  <integer>
    Number of migrations running in parallel in the cluster.

  parallelOutboundMigrationsPerNode <integer>
    Maximum number of outbound migrations per node.

  progressTimeout   <integer>
    The migration will be canceled if memory copy fails to make progress in this
    time, in seconds.

https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/virtualization/live-migration#virt-configuring-a-live-migration-policy_virt-configuring-live-migration

Let's create and try a MigrationPolicy

apiVersion: migrations.kubevirt.io/v1alpha1
kind: MigrationPolicy
metadata:
  annotations:
    description: Migration Policy to allow allowAutoConverge & allowPostCopy
  name: mig-pol-allow-auto-conv-allow-post-copy
spec:
  allowAutoConverge: true
  allowPostCopy: true
  selectors:
    namespaceSelector:
      mig-pol-allow-auto-conv-allow-post-copy: 'true'

oc label namespace/stormshift-ocp11-infra mig-pol-allow-auto-conv-allow-post-copy=true

rbo commented 3 months ago

Cancel migration, new is created with the applied policy:

rbo commented 3 months ago

I applied the policy to a bunch of namespaces:

% oc get namespaces -l mig-pol-allow-auto-conv-allow-post-copy=true
NAME                          STATUS   AGE
demo-cluster-disco            Active   227d
rbohne-hcp-rhods              Active   242d
rbohne-hcp-sendling-ingress   Active   223d
rbohne-sno                    Active   169d
stormshift-ocp1-infra         Active   76d
stormshift-ocp11-infra        Active   48d
%

rbo commented 3 months ago

ODF Stucks...

Can not evict inf44 osd pods...

sh-5.1$ ceph -s       
  cluster:
    id:     5545f168-c66b-4a82-9bd1-f932695df518
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            1/3 mons down, quorum a,b
            Reduced data availability: 67 pgs inactive, 67 pgs peering
            634 slow ops, oldest one blocked for 50024 sec, daemons [osd.0,osd.2,osd.4,osd.5,mon.a] have slow ops.

  services:
    mon: 3 daemons, quorum a,b (age 78m), out of quorum: c
    mgr: a(active, since 78m), standbys: b
    mds: 1/1 daemons up, 1 standby
    osd: 6 osds: 6 up (since 13h), 6 in (since 3M)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   12 pools, 201 pgs
    objects: 69.48k objects, 253 GiB
    usage:   757 GiB used, 1.9 TiB / 2.6 TiB avail
    pgs:     33.333% pgs not active
             134 active+clean
             67  peering

sh-5.1$ ceph health detail
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; 1/3 mons down, quorum a,b; Reduced data availability: 67 pgs inactive, 67 pgs peering; 634 slow ops, oldest one blocked for 50029 sec, daemons [osd.0,osd.2,osd.4,osd.5,mon.a] have slow ops.
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): 3 slow metadata IOs are blocked > 30 secs, oldest blocked for 2963 secs
[WRN] MON_DOWN: 1/3 mons down, quorum a,b
    mon.c (rank 2) addr v2:172.30.122.49:3300/0 is down (out of quorum)
[WRN] PG_AVAILABILITY: Reduced data availability: 67 pgs inactive, 67 pgs peering
    pg 1.2 is stuck peering for 13h, current state peering, last acting [2,0,1]
    pg 1.d is stuck peering for 13h, current state peering, last acting [2,0,1]
    pg 1.1d is stuck peering for 14h, current state peering, last acting [0,1,5]
    pg 1.1f is stuck peering for 14h, current state peering, last acting [0,1,2]
    pg 1.27 is stuck peering for 13h, current state peering, last acting [5,0,1]
    pg 1.28 is stuck peering for 13h, current state peering, last acting [5,0,1]
    pg 1.2c is stuck peering for 14h, current state peering, last acting [2,0,4]
    pg 1.2f is stuck peering for 14h, current state peering, last acting [0,1,5]
    pg 1.30 is stuck peering for 14h, current state peering, last acting [0,5,1]
    pg 1.31 is stuck peering for 13h, current state peering, last acting [2,1,0]
    pg 1.36 is stuck peering for 14h, current state peering, last acting [0,5,1]
    pg 1.39 is stuck peering for 14h, current state peering, last acting [0,5,4]
    pg 3.1 is stuck peering for 14h, current state peering, last acting [0,2,4]
    pg 3.3 is stuck peering for 13h, current state peering, last acting [2,0,4]
    pg 3.6 is stuck peering for 14h, current state peering, last acting [0,2,4]
    pg 4.4 is stuck peering for 21h, current state peering, last acting [0,1,2]
    pg 4.5 is stuck peering for 14h, current state peering, last acting [5,1,0]
    pg 4.7 is stuck peering for 17h, current state peering, last acting [0,1,2]
    pg 5.2 is stuck peering for 13h, current state peering, last acting [5,0,1]
    pg 5.3 is stuck peering for 21h, current state peering, last acting [0,5,1]
    pg 5.5 is stuck peering for 14h, current state peering, last acting [0,2,1]
    pg 6.0 is stuck peering for 15h, current state peering, last acting [0,5,1]
    pg 6.3 is stuck peering for 13h, current state peering, last acting [5,4,0]
    pg 6.5 is stuck peering for 13h, current state peering, last acting [5,1,0]
    pg 7.4 is stuck peering for 18h, current state peering, last acting [0,5,4]
    pg 7.6 is stuck peering for 21h, current state peering, last acting [0,2,4]
    pg 8.1 is stuck peering for 17h, current state peering, last acting [0,4,2]
    pg 8.7 is stuck inactive for 13h, current state peering, last acting [0,1,2]
    pg 9.0 is stuck peering for 14h, current state peering, last acting [0,2,4]
    pg 10.6 is stuck peering for 14h, current state peering, last acting [5,4,0]
    pg 10.8 is stuck peering for 21h, current state peering, last acting [0,1,5]
    pg 10.9 is stuck peering for 14h, current state peering, last acting [5,0,1]
    pg 10.a is stuck peering for 14h, current state peering, last acting [2,4,0]
    pg 10.c is stuck peering for 14h, current state peering, last acting [2,4,0]
    pg 10.d is stuck peering for 14h, current state peering, last acting [2,0,1]
    pg 10.e is stuck peering for 14h, current state peering, last acting [5,0,4]
    pg 11.0 is stuck peering for 15h, current state peering, last acting [0,2,1]
    pg 11.2 is stuck peering for 16h, current state peering, last acting [0,5,4]
    pg 11.6 is stuck peering for 18h, current state peering, last acting [0,5,1]
    pg 11.7 is stuck peering for 13h, current state peering, last acting [5,0,1]
    pg 11.c is stuck peering for 15h, current state peering, last acting [0,2,4]
    pg 11.14 is stuck peering for 13h, current state peering, last acting [2,0,4]
    pg 11.15 is stuck peering for 17h, current state peering, last acting [0,2,1]
    pg 12.0 is stuck peering for 14h, current state peering, last acting [5,0,4]
    pg 12.6 is stuck peering for 19h, current state peering, last acting [0,2,4]
    pg 12.b is stuck peering for 19h, current state peering, last acting [0,1,2]
    pg 12.c is stuck peering for 21h, current state peering, last acting [0,1,5]
    pg 12.d is stuck peering for 14h, current state peering, last acting [5,0,1]
    pg 12.e is stuck peering for 21h, current state peering, last acting [0,4,5]
    pg 12.f is stuck peering for 14h, current state peering, last acting [5,0,4]
    pg 12.10 is stuck peering for 21h, current state peering, last acting [0,2,1]
[WRN] SLOW_OPS: 634 slow ops, oldest one blocked for 50029 sec, daemons [osd.0,osd.2,osd.4,osd.5,mon.a] have slow ops.

rbo commented 3 months ago

Let's try add the new ceph10,ceph11,ceph12 and rebalance the storage

https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.16/html/scaling_storage/scaling_storage_of_bare_metal_openshift_data_foundation_cluster#scaling-up-storage-by-adding-capacity-to-openshift-data-foundation-nodes-using-local-storage-devices_bare-metal

rbo commented 3 months ago

Added ceph* nodes to LocalVolumeSet (not gitopsified)

Clean up disks (wipe, write gpt partition tabel, remove multipath)

% oc get pv -l storage.openshift.com/owner-name=local-odf-ssd -L kubernetes.io/hostname
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                                                         STORAGECLASS    REASON   AGE     HOSTNAME
local-pv-308e6d8f   1863Gi     RWO            Delete           Available                                                                 local-odf-ssd            10m     ceph10
local-pv-335dba3d   1863Gi     RWO            Delete           Available                                                                 local-odf-ssd            14m     ceph11
local-pv-3e070be3   445Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-local-odf-ssd-0-data-1vjqbd   local-odf-ssd            258d    inf7
local-pv-54987767   1863Gi     RWO            Delete           Available                                                                 local-odf-ssd            14m     ceph11
local-pv-602a8e03   447Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-local-odf-ssd-0-data-5ktr4x   local-odf-ssd            258d    inf44
local-pv-71c4b9df   445Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-local-odf-ssd-0-data-3h744f   local-odf-ssd            258d    inf8
local-pv-7f7b1f43   1863Gi     RWO            Delete           Available                                                                 local-odf-ssd            7m46s   ceph10
local-pv-7faa5447   445Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-local-odf-ssd-0-data-4fvvd5   local-odf-ssd            258d    inf7
local-pv-9610e3d8   447Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-local-odf-ssd-0-data-2glq85   local-odf-ssd            258d    inf44
local-pv-c0d9ea9c   1863Gi     RWO            Delete           Available                                                                 local-odf-ssd            7m2s    ceph12
local-pv-d51b1626   445Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-local-odf-ssd-0-data-0xbb26   local-odf-ssd            258d    inf8
local-pv-ef42156d   1863Gi     RWO            Delete           Available                                                                 local-odf-ssd            14s     ceph12
%

rbo commented 3 months ago

paused MCP/worker
Uncordon inf44

rbo commented 3 months ago

% oc get pods -o wide -l app=rook-ceph-osd-prepare
NAME                                                           READY   STATUS             RESTARTS        AGE     IP             NODE     NOMINATED NODE   READINESS GATES
rook-ceph-osd-prepare-0504d86b7040a0af248baceaa6d2a70a-mb6np   0/1     CrashLoopBackOff   5 (2m38s ago)   5m56s   10.130.16.16   ceph10   <none>           <none>
rook-ceph-osd-prepare-0c7caa1ac0cc8efbb1205c35c8d932ca-8rwcp   0/1     CrashLoopBackOff   5 (2m30s ago)   5m55s   10.128.24.23   ceph12   <none>           <none>
rook-ceph-osd-prepare-0cb23090e0e9580b58d82178a9312625-q8kv2   0/1     CrashLoopBackOff   5 (2m25s ago)   5m56s   10.128.24.22   ceph12   <none>           <none>
rook-ceph-osd-prepare-0dff7b5649971c7463f6d3d1b0e75ffe-gzp6b   0/1     CrashLoopBackOff   5 (2m31s ago)   5m55s   10.131.16.12   ceph11   <none>           <none>
rook-ceph-osd-prepare-4cfc0cdb9cda58ca06d1468537e4a1af-rgwcd   0/1     CrashLoopBackOff   5 (2m19s ago)   5m53s   10.131.16.13   ceph11   <none>           <none>
rook-ceph-osd-prepare-9425b6f55cbbb5c6dae8a85e0555cdb0-pmmhk   0/1     CrashLoopBackOff   5 (2m42s ago)   5m54s   10.130.16.17   ceph10   <none>           <none>
%

Pods disappear.

Grr I want to finish the update today (4.15.xx) let's 🔨 - reboot inf44 to force drain. 😢

rbo commented 3 months ago

inf44 came not back online:

Aug 02 11:46:25 inf44 crio[13880]: time="2024-08-02 11:46:25.952115702Z" level=warning msg="Error encountered when checking whether cri-o should wipe containers: open /var/run/crio/version: no such file or directory"
Aug 02 11:46:25 inf44 crio[13880]: time="2024-08-02 11:46:25.952336785Z" level=info msg="Registered SIGHUP reload watcher"
Aug 02 11:46:25 inf44 crio[13880]: time="2024-08-02 11:46:25.952493035Z" level=fatal msg="Failed to start streaming server: listen tcp [2620:52:0:2060:77c:8bd7:a07c:5017]:0: bind: cannot assign requested address"
Aug 02 11:46:25 inf44 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Aug 02 11:46:25 inf44 systemd[1]: crio.service: Failed with result 'exit-code'.
Aug 02 11:46:25 inf44 systemd[1]: Failed to start Container Runtime Interface for OCI (CRI-O).

CRIO and IPv6 🤦🏼

rbo commented 3 months ago

[root@inf44 ~]# systemctl cat crio | tail
[Service]
Environment="GODEBUG=x509ignoreCN=0,madvdontneed=1"

# /etc/systemd/system/crio.service.d/10-mco-profile-unix-socket.conf
[Service]
Environment="ENABLE_PROFILE_UNIX_SOCKET=true"

# /etc/systemd/system/crio.service.d/20-nodenet.conf
[Service]
Environment="CONTAINER_STREAM_ADDRESS=2620:52:0:2060:77c:8bd7:a07c:5017"
[root@inf44 ~]#

rbo commented 3 months ago

First problems because of #188

rbo commented 3 months ago

Disable ipv6

[root@inf44 etc]# nmcli connection show da1c0030-8c36-3895-9dda-32d13fcf0eaf | grep interface
connection.interface-name:              eno1
[root@inf44 etc]# nmcli connection show
NAME                UUID                                  TYPE           DEVICE
ovs-if-br-ex        5041b789-83ee-41bc-a243-e3e779559b08  ovs-interface  br-ex
lo                  3862ec1b-6e5a-4805-bebf-74392ffdf1df  loopback       lo
Wired connection 1  d029ea06-2680-34dc-bc3b-c2cea6ce2836  ethernet       eno2
br-ex               7f1ce127-575c-4619-b65f-017d9dce6b62  ovs-bridge     br-ex
coe-br-vlan-69      08ac4b8f-46e3-4e10-ab69-1700dc453527  bridge         coe-br-vlan-69
coe-bridge          c4b1040f-cb5c-440b-92a1-b90c521724b2  bridge         coe-bridge
eno2.69             3d5fc333-a714-4de9-8055-462370391fb1  vlan           eno2.69
ovs-if-phys0        e749d6cd-a03f-4368-95f7-e6c6afa17115  ethernet       eno1
ovs-port-br-ex      962ea91c-1e94-4e4b-a1f2-2775057d2a91  ovs-port       br-ex
ovs-port-phys0      e9cb92da-7932-4bbe-b572-9546fb9bef19  ovs-port       eno1
Wired connection 2  da1c0030-8c36-3895-9dda-32d13fcf0eaf  ethernet       --
[root@inf44 etc]# nmcli connection show da1c0030-8c36-3895-9dda-32d13fcf0eaf | grep interface
connection.interface-name:              eno1
[root@inf44 etc]# nmcli connection modify da1c0030-8c36-3895-9dda-32d13fcf0eaf ipv6.method "disabled"
[root@inf44 etc]# reboot
[root@inf44 etc]# Connection to 10.32.96.44 closed by remote host.

rbo commented 3 months ago

Start upgrade to 4.16.4, inf8 stuck because of ipv6 issue like inf44.

Disabled IPv6 at inf8

rbo commented 3 months ago

Disabled IPv6:

[x] node/ceph10
[x] node/ceph11
[x] node/ceph12
[x] node/inf4
[X] node/inf44
[x] node/inf5
[x] node/inf6
[x] node/inf7
[X] node/inf8
[x] node/ucs-blade-server-1
[x] node/ucs-blade-server-3
[x] node/ucs56
[x] node/ucs57

rbo commented 3 months ago

Update done, ipv6 disabled

stormshift / support

Update ISAR to 4.16 #193