openshift-kni / performance-addon-operators

Operators related to optimizing OpenShift clusters for applications sensitive to cpu and network latency
Apache License 2.0
46 stars 60 forks source link

RT Kernel is not installed and cluster-wait-for-mcp errors after retries #64

Closed gsr-shanks closed 4 years ago

gsr-shanks commented 4 years ago
  1. Get the latest from this repo. (commit 271c0451c7d327f4cd7868f530201555e9d410a2)
  2. export FULL_REGISTRY_IMAGE=quay.io/openshift-kni/performance-addon-operator-registry:latest
  3. oc label --overwrite node/worker-1 node-role.kubernetes.io/worker-rt=""
  4. make cluster-deploy cluster-wait-for-mcp

Actual: RT Kernel is not installed and cluster-wait-for-mcp errors after retries. Expected: RT Kernel is installed in worker-1 worker node.

Additional info: performance-operator deployment as "Image: REPLACE_IMAGE" which looks incorrect.


# oc label --overwrite node/worker-1 node-role.kubernetes.io/worker-rt=""
node/worker-1 labeled

# oc get node
NAME       STATUS                     ROLES              AGE    VERSION
master-0   Ready                      master             3d3h   v1.17.1
master-1   Ready                      master             3d3h   v1.17.1
master-2   Ready                      master             3d3h   v1.17.1
worker-0   Ready,SchedulingDisabled   worker             3d3h   v1.17.1
worker-1   Ready                      worker,worker-rt   3d3h   v1.17.1
worker-2   Ready                      worker             3d3h   v1.17.1

# make cluster-deploy cluster-wait-for-mcp
Deploying operator
FULL_REGISTRY_IMAGE=quay.io/openshift-kni/performance-addon-operator-registry:latest hack/deploy.sh
Deploying using image quay.io/openshift-kni/performance-addon-operator-registry:latest.
[INFO] Deploying performance operator and profile.
namespace/openshift-performance-addon created
machineconfigpool.machineconfiguration.openshift.io/worker-rt created
operatorgroup.operators.coreos.com/openshift-performance-addon-operatorgroup created
catalogsource.operators.coreos.com/performance-addon-operator-catalogsource created
subscription.operators.coreos.com/performance-addon-operator-subscription created
performanceprofile.performance.openshift.io/ci created
[INFO] Deployment successful.
Waiting for MCP to be updated
hack/wait-for-mcp.sh
[INFO] Waiting 5 min for letting the operator do its work
[INFO] Unpausing  MCPs
machineconfigpool.machineconfiguration.openshift.io/master patched (no change)
machineconfigpool.machineconfiguration.openshift.io/worker patched (no change)
machineconfigpool.machineconfiguration.openshift.io/worker-rt patched
[INFO] Checking if MCP picked up the performance MC
[INFO] Performace MC not picked up yet. 89 retries left.
[INFO] Unpausing  MCPs
...
machineconfigpool.machineconfiguration.openshift.io/master patched (no change)
machineconfigpool.machineconfiguration.openshift.io/worker patched (no change)
machineconfigpool.machineconfiguration.openshift.io/worker-rt patched (no change)
[INFO] Checking if MCP picked up the performance MC
[INFO] Performace MC not picked up yet. 0 retries left.
[ERROR] MCP failed, giving up.
make: *** [Makefile:111: cluster-wait-for-mcp] Error 1
---

# oc describe mcp worker-rt
...
Status:
  Conditions:
    Last Transition Time:  2020-01-30T10:28:06Z
    Message:
    Reason:
    Status:                False
    Type:                  RenderDegraded
    Last Transition Time:  2020-01-30T10:28:11Z
    Message:
    Reason:
    Status:                False
    Type:                  NodeDegraded
    Last Transition Time:  2020-01-30T10:28:11Z
    Message:
    Reason:
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2020-01-30T10:34:23Z
    Message:               All nodes are updated with rendered-worker-rt-12698eb75a672ef94808584c61f26689
    Reason:
    Status:                True
    Type:                  Updated
    Last Transition Time:  2020-01-30T10:34:23Z
    Message:
    Reason:
    Status:                False
    Type:                  Updating
  Configuration:
    Name:  rendered-worker-rt-12698eb75a672ef94808584c61f26689
    Source:
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-worker
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-worker-chronyd-custom
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-container-runtime
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   98-worker-d818316e-b208-46f6-a4eb-cd91fb6cd2c9-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   98-worker-rt-91378b6d-265a-4a1f-a398-f6c93a6a81e1-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-d818316e-b208-46f6-a4eb-cd91fb6cd2c9-registries
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-ssh
  Degraded Machine Count:     0
  Machine Count:              1
  Observed Generation:        3
  Ready Machine Count:        1
  Unavailable Machine Count:  0
  Updated Machine Count:      1
Events:                       <none>
---

#oc describe node worker-1
Name:               worker-1
Roles:              worker,worker-rt
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=worker-1
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node-role.kubernetes.io/worker-rt=
                    node.openshift.io/os_id=rhcos
Annotations:        machine.openshift.io/machine: openshift-machine-api/ostest-worker-0-6wrjf
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-rt-12698eb75a672ef94808584c61f26689
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-rt-12698eb75a672ef94808584c61f26689
                    machineconfiguration.openshift.io/reason:
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 27 Jan 2020 02:20:47 -0500
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  worker-1
  AcquireTime:     <unset>
  RenewTime:       Thu, 30 Jan 2020 05:53:48 -0500
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Thu, 30 Jan 2020 05:49:18 -0500   Thu, 30 Jan 2020 05:34:08 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 30 Jan 2020 05:49:18 -0500   Thu, 30 Jan 2020 05:34:08 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Thu, 30 Jan 2020 05:49:18 -0500   Thu, 30 Jan 2020 05:34:08 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Thu, 30 Jan 2020 05:49:18 -0500   Thu, 30 Jan 2020 05:34:18 -0500   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  192.168.111.24
  Hostname:    worker-1
Capacity:
  cpu:                4
  ephemeral-storage:  19876Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             8163772Ki
  pods:               250
Allocatable:
  cpu:                3500m
  ephemeral-storage:  18757346888
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             7549372Ki
  pods:               250
System Info:
  Machine ID:                             e4895d21ba764561a99b3a93fb89b41e
  System UUID:                            e4895d21-ba76-4561-a99b-3a93fb89b41e
  Boot ID:                                1ee90ad2-61a0-4e50-b31b-5361395306af
  Kernel Version:                         4.18.0-147.3.1.el8_1.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 44.81.202001240931.0 (Ootpa)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.16.2-6.dev.rhaos4.3.git9e3db66.el8
  Kubelet Version:                        v1.17.1
  Kube-Proxy Version:                     v1.17.1
Non-terminated Pods:                      (11 in total)
  Namespace                               Name                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                               ------------  ----------  ---------------  -------------  ---
  openshift-cluster-node-tuning-operator  tuned-n8q9l                        10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         3d3h
  openshift-dns                           dns-default-wqgkz                  110m (3%)     0 (0%)      70Mi (0%)        512Mi (6%)     3d3h
  openshift-ingress                       router-default-65cdbc7767-946c8    100m (2%)     0 (0%)      256Mi (3%)       0 (0%)         20m
  openshift-kni-infra                     coredns-worker-1                   100m (2%)     0 (0%)      200Mi (2%)       0 (0%)         3d3h
  openshift-kni-infra                     keepalived-worker-1                250m (7%)     0 (0%)      1224Mi (16%)     0 (0%)         3d3h
  openshift-kni-infra                     mdns-publisher-worker-1            100m (2%)     0 (0%)      200Mi (2%)       0 (0%)         3d3h
  openshift-machine-config-operator       machine-config-daemon-g5shv        40m (1%)      0 (0%)      100Mi (1%)       0 (0%)         3d3h
  openshift-monitoring                    node-exporter-kxhtv                112m (3%)     0 (0%)      200Mi (2%)       0 (0%)         3d3h
  openshift-multus                        multus-nzdpz                       10m (0%)      0 (0%)      150Mi (2%)       0 (0%)         3d3h
  openshift-sdn                           ovs-28px4                          200m (5%)     0 (0%)      400Mi (5%)       0 (0%)         3d3h
  openshift-sdn                           sdn-g97xg                          100m (2%)     0 (0%)      200Mi (2%)       0 (0%)         3d3h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                1132m (32%)   0 (0%)
  memory             3050Mi (41%)  512Mi (6%)
  ephemeral-storage  0 (0%)        0 (0%)
Events:
  Type    Reason                   Age                From               Message
  ----    ------                   ----               ----               -------
  Normal  NodeNotSchedulable       20m (x2 over 2d)   kubelet, worker-1  Node worker-1 status is now: NodeNotSchedulable
  Normal  Starting                 19m                kubelet, worker-1  Starting kubelet.
  Normal  NodeHasSufficientMemory  19m (x8 over 19m)  kubelet, worker-1  Node worker-1 status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    19m (x7 over 19m)  kubelet, worker-1  Node worker-1 status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     19m (x8 over 19m)  kubelet, worker-1  Node worker-1 status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  19m                kubelet, worker-1  Updated Node Allocatable limit across pods
---

# oc describe catalogsource performance-addon-operator-catalogsource
...
Spec:
  Display Name:  Openshift Performance Addon Operator
  Icon:
    base64data:
    Mediatype:
  Image:         quay.io/openshift-kni/performance-addon-operator-registry:latest
  Publisher:     Red Hat
  Source Type:   grpc
Status:
  Connection State:
    Address:              performance-addon-operator-catalogsource.openshift-marketplace.svc:50051
    Last Connect:         2020-01-30T10:55:35Z
    Last Observed State:  READY
  Registry Service:
    Created At:         2020-01-30T10:55:13Z
    Port:               50051
    Protocol:           grpc
    Service Name:       performance-addon-operator-catalogsource
    Service Namespace:  openshift-marketplace
Events:                 <none>
---

# oc describe pod performance-operator-77f977d576-zszsd
Name:         performance-operator-77f977d576-zszsd
Namespace:    openshift-performance-addon
Priority:     0
Node:         worker-2/192.168.111.25
Start Time:   Thu, 30 Jan 2020 05:33:07 -0500
Labels:       name=performance-operator
              pod-template-hash=77f977d576
Annotations:  alm-examples:
                [
                  {
                    "apiVersion": "performance.openshift.io/v1alpha1",
                    "kind": "PerformanceProfile",
                    "metadata": {
                      "name": "example-performanceprofile"
                    },
                    "spec": {
                      "cpu": {
                        "isolated": "2-3",
                        "nonIsolated": "0",
                        "reserved": "0-1"
                      },
                      "hugepages": {
                        "defaultHugepagesSize": "1G",
                        "pages": [
                          {
                            "count": 2,
                            "size": "1G"
                          }
                        ]
                      },
                      "nodeSelector": {
                        "node-role.kubernetes.io/performance": ""
                      },
                      "realTimeKernel": {
                        "repoURL": "http://<snip>/rhel-8/nightly/RHEL-8/latest-RHEL-8.1.1/compose/RT/x86_64/os"
                      }
                    }
                  }
                ]
              capabilities: Basic Install
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.128.2.16"
                    ],
                    "dns": {},
                    "default-route": [
                        "10.128.2.1"
                    ]
                }]
              olm.operatorGroup: openshift-performance-addon-operatorgroup
              olm.operatorNamespace: openshift-performance-addon
              olm.targetNamespaces: openshift-performance-addon
              openshift.io/scc: restricted
Status:       Pending
IP:           10.128.2.16
IPs:
  IP:           10.128.2.16
Controlled By:  ReplicaSet/performance-operator-77f977d576
Containers:
  performance-operator:
    Container ID:
    Image:         REPLACE_IMAGE
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      performance-operator
    State:          Waiting
      Reason:       InvalidImageName
    Ready:          False
    Restart Count:  0
    Environment:
      WATCH_NAMESPACE:   (v1:metadata.annotations['olm.targetNamespaces'])
      POD_NAME:         performance-operator-77f977d576-zszsd (v1:metadata.name)
      OPERATOR_NAME:    performance-operator
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from performance-operator-token-djznl (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  performance-operator-token-djznl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  performance-operator-token-djznl
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason         Age                   From               Message
  ----     ------         ----                  ----               -------
  Normal   Scheduled      <unknown>             default-scheduler  Successfully assigned openshift-performance-addon/performance-operator-77f977d576-zszsd to worker-2
  Warning  Failed         14m (x49 over 24m)    kubelet, worker-2  Error: InvalidImageName
  Warning  InspectFailed  4m23s (x95 over 24m)  kubelet, worker-2  Failed to apply default image tag "REPLACE_IMAGE": couldn't parse image reference "REPLACE_IMAGE": invalid reference format: repository name must be lowercase
---

# oc describe subscription performance-addon-operator-subscription
Name:         performance-addon-operator-subscription
Namespace:    openshift-performance-addon
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"operators.coreos.com/v1alpha1","kind":"Subscription","metadata":{"annotations":{},"name":"performance-addon-operator-subscr...
API Version:  operators.coreos.com/v1alpha1
Kind:         Subscription
Metadata:
  Creation Timestamp:  2020-01-30T10:28:01Z
  Generation:          1
  Resource Version:    1576552
  Self Link:           /apis/operators.coreos.com/v1alpha1/namespaces/openshift-performance-addon/subscriptions/performance-addon-operator-subscription
  UID:                 51cc44e7-51be-474c-8cda-ef47bb5f3a63
Spec:
  Channel:           alpha
  Name:              performance-addon-operator
  Source:            performance-addon-operator-catalogsource
  Source Namespace:  openshift-marketplace
Status:
  Catalog Health:
    Catalog Source Ref:
      API Version:       operators.coreos.com/v1alpha1
      Kind:              CatalogSource
      Name:              certified-operators
      Namespace:         openshift-marketplace
      Resource Version:  1557284
      UID:               dc9b971d-b38a-475f-a4cf-b0ed444b6e3c
    Healthy:             true
    Last Updated:        2020-01-30T10:28:37Z
    Catalog Source Ref:
      API Version:       operators.coreos.com/v1alpha1
      Kind:              CatalogSource
      Name:              community-operators
      Namespace:         openshift-marketplace
      Resource Version:  1557285
      UID:               a97e3679-2e30-42f2-b210-9c436d6c913d
    Healthy:             true
    Last Updated:        2020-01-30T10:28:37Z
    Catalog Source Ref:
      API Version:       operators.coreos.com/v1alpha1
      Kind:              CatalogSource
      Name:              performance-addon-operator-catalogsource
      Namespace:         openshift-marketplace
      Resource Version:  1576461
      UID:               de4e07f1-6e30-49ab-9a09-75bf17c14960
    Healthy:             true
    Last Updated:        2020-01-30T10:28:37Z
    Catalog Source Ref:
      API Version:       operators.coreos.com/v1alpha1
      Kind:              CatalogSource
      Name:              redhat-operators
      Namespace:         openshift-marketplace
      Resource Version:  1557283
      UID:               aad2a0f6-0b1f-4241-a842-c5aaabb75b51
    Healthy:             true
    Last Updated:        2020-01-30T10:28:37Z
  Conditions:
    Last Transition Time:  2020-01-30T10:28:37Z
    Message:               all available catalogsources are healthy
    Reason:                AllCatalogSourcesHealthy
    Status:                False
    Type:                  CatalogSourcesUnhealthy
  Current CSV:             performance-addon-operator.v0.0.1
  Install Plan Ref:
    API Version:       operators.coreos.com/v1alpha1
    Kind:              InstallPlan
    Name:              install-nwh2m
    Namespace:         openshift-performance-addon
    Resource Version:  1576472
    UID:               b20645ad-6605-4e6b-852a-42ee1c248d19
  Installed CSV:       performance-addon-operator.v0.0.1
  Installplan:
    API Version:  operators.coreos.com/v1alpha1
    Kind:         InstallPlan
    Name:         install-nwh2m
    Uuid:         b20645ad-6605-4e6b-852a-42ee1c248d19
  Last Updated:   2020-01-30T10:28:43Z
  State:          AtLatestKnown
Events:           <none>
---

# oc describe deployment.apps/performance-operator
Name:                   performance-operator
Namespace:              openshift-performance-addon
CreationTimestamp:      Thu, 30 Jan 2020 05:28:41 -0500
Labels:                 olm.owner=performance-addon-operator.v0.0.1
                        olm.owner.kind=ClusterServiceVersion
                        olm.owner.namespace=openshift-performance-addon
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               name=performance-operator
Replicas:               1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           name=performance-operator
  Annotations:      alm-examples:
                      [
                        {
                          "apiVersion": "performance.openshift.io/v1alpha1",
                          "kind": "PerformanceProfile",
                          "metadata": {
                            "name": "example-performanceprofile"
                          },
                          "spec": {
                            "cpu": {
                              "isolated": "2-3",
                              "nonIsolated": "0",
                              "reserved": "0-1"
                            },
                            "hugepages": {
                              "defaultHugepagesSize": "1G",
                              "pages": [
                                {
                                  "count": 2,
                                  "size": "1G"
                                }
                              ]
                            },
                            "nodeSelector": {
                              "node-role.kubernetes.io/performance": ""
                            },
                            "realTimeKernel": {
                              "repoURL": "http://download-node-02.eng.bos.redhat.com/rhel-8/nightly/RHEL-8/latest-RHEL-8.1.1/compose/RT/x86_64/os"
                            }
                          }
                        }
                      ]
                    capabilities: Basic Install
                    olm.operatorGroup: openshift-performance-addon-operatorgroup
                    olm.operatorNamespace: openshift-performance-addon
                    olm.targetNamespaces: openshift-performance-addon
  Service Account:  performance-operator
  Containers:
   performance-operator:
    Image:      REPLACE_IMAGE
    Port:       <none>
    Host Port:  <none>
    Command:
      performance-operator
    Environment:
      WATCH_NAMESPACE:   (v1:metadata.annotations['olm.targetNamespaces'])
      POD_NAME:          (v1:metadata.name)
      OPERATOR_NAME:    performance-operator
    Mounts:             <none>
  Volumes:              <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      False   MinimumReplicasUnavailable
  Progressing    False   ProgressDeadlineExceeded
OldReplicaSets:  <none>
NewReplicaSet:   performance-operator-77f977d576 (1/1 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  46m   deployment-controller  Scaled up replica set performance-operator-77f977d576 to 1
cynepco3hahue commented 4 years ago

Ok the problem with the image

Containers:
  performance-operator:
    Container ID:
    Image:         REPLACE_IMAGE
Events:
  Type     Reason         Age                   From               Message
  ----     ------         ----                  ----               -------
  Normal   Scheduled      <unknown>             default-scheduler  Successfully assigned openshift-performance-addon/performance-operator-77f977d576-zszsd to worker-2
  Warning  Failed         14m (x49 over 24m)    kubelet, worker-2  Error: InvalidImageName
  Warning  InspectFailed  4m23s (x95 over 24m)  kubelet, worker-2  Failed to apply default image tag "REPLACE_IMAGE": couldn't parse image reference "REPLACE_IMAGE": invalid reference format: repository name must be lowercase

@slintes When should it be replaced with the real image?

slintes commented 4 years ago

that is strange, it was fixed already, and when I look into the registry image, it looks fine :thinking:

$ docker run -it --entrypoint /bin/bash quay.io/openshift-kni/performance-addon-operator-registry:latest
bash-4.2$ cat performance-addon-operator-catalog/performance-addon-operator/0.0.1/performance-addon-operator.v0.0.1.clusterserviceversion.yaml | grep image
                image: quay.io/openshift-kni/performance-addon-operator:latest

it is replaced here: https://github.com/openshift-kni/performance-addon-operators/blob/master/openshift-ci/Dockerfile.registry.upstream.dev#L6

@gsr-shanks is that an old cluster? Maybe an old registry image was cached somewhere?

slintes commented 4 years ago

and it works on my cluster...

this is how you can check for the correct image version:

  1. check latest in quay.io:
    $ skopeo inspect --tls-verify=false docker://quay.io/openshift-kni/performance-addon-operator-registry | grep Digest
    "Digest": "sha256:9fa623b37cd308f733522ebd181c6fb719401aed30fae0eb90ff476f984a542c",
  2. check digest in cluster: a) find the right node where the catalogsource runs:
    $ k -n openshift-marketplace get pod performance-addon-operator-catalogsource-zgzps -o=wide
    NAME                                             READY   STATUS    RESTARTS   AGE   IP           NODE                                                    NOMINATED NODE   READINESS GATES
    performance-addon-operator-catalogsource-zgzps   1/1     Running   3          27m   10.131.0.3   slinte-gwx5r-w-a-5k7ds.c.openshift-gce-devel.internal   <none>           <none>

    b) get into that node and check digest of registry image:

    $ oc debug node/slinte-gwx5r-w-a-5k7ds.c.openshift-gce-devel.internal
    Starting pod/slinte-gwx5r-w-a-5k7dscopenshift-gce-develinternal-debug ...
    To use host binaries, run `chroot /host`
    Pod IP: 10.0.32.2
    If you don't see a command prompt, try pressing enter.
    sh-4.2# chroot /host
    sh-4.4# crictl images --digests | grep perf
    quay.io/openshift-kni/performance-addon-operator-registry   latest              9fa623b37cd30       fa6a82e5a07bd       440MB
    quay.io/openshift-kni/performance-addon-operator            latest              7928c9dc2850c       d0ef957373102       249MB

    c) in this case, digests 9fa623b37cd30 matches

edit: this is the huge disadvantage of using latest tag. You never know for sure what you get. But I have no better idea yet without having to change the tag in the deploy repo every time. Will check if it is possible to let the CatalogSource always pull the image.

slintes commented 4 years ago

@gsr-shanks you can try this, but I'm not 100% sure if it works:

k -n openshift-marketplace edit catalogsource performance-addon-operator-catalogsource

and add this to the spec:

  updateStrategy:
    registryPoll:
      interval: 1m

edit: I think you need to remove it once the performance-addon-operator-catalogsource-xxx pod was restarted, else it will be restarted every minute

gsr-shanks commented 4 years ago

@gsr-shanks you can try this, but I'm not 100% sure if it works:

k -n openshift-marketplace edit catalogsource performance-addon-operator-catalogsource

and add this to the spec:

  updateStrategy:
    registryPoll:
      interval: 1m

edit: I think you need to remove it once the performance-addon-operator-catalogsource-xxx pod was restarted, else it will be restarted every minute

@slintes Is this how we are recommending customers to update their image?

slintes commented 4 years ago

no, customers won't use latest tag

gsr-shanks commented 4 years ago

Ok, I guess then we are using latest since we don't have tags yet.

Also, should not removing worker-rt label or make cluster-clean remove these images?

slintes commented 4 years ago

Ok, I guess then we are using latest since we don't have tags yet.

No clue if we will ever have tags upstream. We are mainly using the images for CI in the deploy repo. But we do not want to update the used tag over there for every new operator version. And because CI always spins up a new cluster, caching is no issue there.

Also, should not removing worker-rt label

No. The images are cached on the node by cri-o.

or make cluster-clean remove these images?

The clean script would have to oc debug node/.... into every node and delete the image using crictl... not sure if it's easy to implement (scripting oc debug) and if we want that.

slintes commented 4 years ago

An idea: can you clean up the cluster please? And then patch the catalogsource before deploying it like described above? But with a long interval, not 1 minute. I think that will already trigger a pull of the image when being deployed. If that works, we can add it to our manifests.

gsr-shanks commented 4 years ago

I actually deleted the images from the worker nodes and retrying now. I will try your idea if I could reproduce it.

gsr-shanks commented 4 years ago

I deleted all the performance-addon-operator images from the worker nodes using crictl and did make cluster-deploy cluster-wait-for-mcp. The latest image got pulled alright, however, I see

[INFO] Performace MC not picked up yet.
[ERROR] MCP failed, giving up.
make: *** [Makefile:118: cluster-wait-for-mcp] Error 1

I tried your method to verify the performance operator image and they look good.

sh-4.4# skopeo inspect --tls-verify=false docker://quay.io/openshift-kni/performance-addon-operator-registry | grep Digest
    "Digest": "sha256:9c1ce47aa434393f2707ce8784a81a0c00e229c79419326b7b58477de3b5b815",
sh-4.4#
sh-4.4# crictl images --digests | grep perf
quay.io/openshift-kni/performance-addon-operator-registry   latest              9c1ce47aa4343       fa6a82e5a07bd       440MB

I looked https://quay.io/repository/openshift-kni/performance-addon-operator-registry?tag=latest&tab=tags and the SHA looks correct.

What is not looking correct is the image ID SHA in performance-operator pod:

#oc describe pod performance-operator-f475cf684-wnpkk
...
Status:       Running
IP:           10.128.2.22
IPs:
  IP:           10.128.2.22
Controlled By:  ReplicaSet/performance-operator-f475cf684
Containers:
  performance-operator:
    Container ID:  cri-o://c680e116f82ce1e6f7589a9d8c3dd48ad1606e23685192ae3cc380a992602e62
    Image:         quay.io/openshift-kni/performance-addon-operator:latest
    Image ID:      quay.io/openshift-kni/performance-addon-operator@sha256:7928c9dc2850c63456a14af38fe6f31dcda28bf8f3a8c0d0fcc516967e870003
    Port:          <none>
    Host Port:     <none>
    Command:
      performance-operator
    State:          Running
      Started:      Thu, 30 Jan 2020 10:47:50 -0500
    Ready:          True
    Restart Count:  0
    Environment:
      WATCH_NAMESPACE:   (v1:metadata.annotations['olm.targetNamespaces'])
      POD_NAME:         performance-operator-f475cf684-wnpkk (v1:metadata.name)
      OPERATOR_NAME:    performance-operator
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from performance-operator-token-w2bsj (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  performance-operator-token-w2bsj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  performance-operator-token-w2bsj
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

The performance operator has a different sha

Image ID:      quay.io/openshift-kni/performance-addon-operator@sha256:7928c9dc2850c63456a14af38fe6f31dcda28bf8f3a8c0d0fcc516967e870003
gsr-shanks commented 4 years ago

Also another thing that is not clear is, why is the image pulled in worker-0 node while the worker-rt is enabled for worker-1 node.

# oc -n openshift-marketplace get pod performance-addon-operator-catalogsource-h6gkt -o wide
NAME                                             READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
performance-addon-operator-catalogsource-h6gkt   1/1     Running   0          37s   10.131.0.75   worker-0   <none>           <none>
# oc get node
NAME       STATUS                     ROLES              AGE     VERSION
master-0   Ready                      master             4d      v1.17.1
master-1   Ready                      master             4d      v1.17.1
master-2   Ready                      master             4d      v1.17.1
worker-0   Ready,SchedulingDisabled   worker             3d23h   v1.17.1
worker-1   Ready,SchedulingDisabled   worker,worker-rt   3d23h   v1.17.1
worker-2   Ready                      worker             3d23h   v1.17.1

Something looks wrong here.

slintes commented 4 years ago

What is not looking correct is the image ID SHA in performance-operator pod:

look at the start time: it's an old pod. Cleanup did not work correctly. You can just just delete the pod, it will be recreated automatically.

why is the image pulled in worker-0 node while the worker-rt is enabled for worker-1 node.

The catalogsource and the operator itself can run on any node, it does not matter. Only the MachineConfigs etc. created by the operator have to be applied on the right node.

gsr-shanks commented 4 years ago

What is not looking correct is the image ID SHA in performance-operator pod:

look at the start time: it's an old pod. Cleanup did not work correctly. You can just just delete the pod, it will be recreated automatically.

Ok, yeah in our make cluster-clean we do not verify if the pods are removed.

Anyways I am rebuilding the cluster to deploy downstream builds. If I find something not working there, will file them in bugzilla.

why is the image pulled in worker-0 node while the worker-rt is enabled for worker-1 node.

The catalogsource and the operator itself can run on any node, it does not matter. Only the MachineConfigs etc. created by the operator have to be applied on the right node.

Ok, got it.

gsr-shanks commented 4 years ago

Closing this issue. Thanks.