Closed gsr-shanks closed 4 years ago
Ok the problem with the image
Containers:
performance-operator:
Container ID:
Image: REPLACE_IMAGE
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned openshift-performance-addon/performance-operator-77f977d576-zszsd to worker-2
Warning Failed 14m (x49 over 24m) kubelet, worker-2 Error: InvalidImageName
Warning InspectFailed 4m23s (x95 over 24m) kubelet, worker-2 Failed to apply default image tag "REPLACE_IMAGE": couldn't parse image reference "REPLACE_IMAGE": invalid reference format: repository name must be lowercase
@slintes When should it be replaced with the real image?
that is strange, it was fixed already, and when I look into the registry image, it looks fine :thinking:
$ docker run -it --entrypoint /bin/bash quay.io/openshift-kni/performance-addon-operator-registry:latest
bash-4.2$ cat performance-addon-operator-catalog/performance-addon-operator/0.0.1/performance-addon-operator.v0.0.1.clusterserviceversion.yaml | grep image
image: quay.io/openshift-kni/performance-addon-operator:latest
it is replaced here: https://github.com/openshift-kni/performance-addon-operators/blob/master/openshift-ci/Dockerfile.registry.upstream.dev#L6
@gsr-shanks is that an old cluster? Maybe an old registry image was cached somewhere?
and it works on my cluster...
this is how you can check for the correct image version:
$ skopeo inspect --tls-verify=false docker://quay.io/openshift-kni/performance-addon-operator-registry | grep Digest
"Digest": "sha256:9fa623b37cd308f733522ebd181c6fb719401aed30fae0eb90ff476f984a542c",
$ k -n openshift-marketplace get pod performance-addon-operator-catalogsource-zgzps -o=wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
performance-addon-operator-catalogsource-zgzps 1/1 Running 3 27m 10.131.0.3 slinte-gwx5r-w-a-5k7ds.c.openshift-gce-devel.internal <none> <none>
b) get into that node and check digest of registry image:
$ oc debug node/slinte-gwx5r-w-a-5k7ds.c.openshift-gce-devel.internal
Starting pod/slinte-gwx5r-w-a-5k7dscopenshift-gce-develinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.32.2
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# crictl images --digests | grep perf
quay.io/openshift-kni/performance-addon-operator-registry latest 9fa623b37cd30 fa6a82e5a07bd 440MB
quay.io/openshift-kni/performance-addon-operator latest 7928c9dc2850c d0ef957373102 249MB
c) in this case, digests 9fa623b37cd30
matches
edit: this is the huge disadvantage of using latest
tag. You never know for sure what you get. But I have no better idea yet without having to change the tag in the deploy repo every time.
Will check if it is possible to let the CatalogSource always pull the image.
@gsr-shanks you can try this, but I'm not 100% sure if it works:
k -n openshift-marketplace edit catalogsource performance-addon-operator-catalogsource
and add this to the spec:
updateStrategy:
registryPoll:
interval: 1m
edit: I think you need to remove it once the performance-addon-operator-catalogsource-xxx
pod was restarted, else it will be restarted every minute
@gsr-shanks you can try this, but I'm not 100% sure if it works:
k -n openshift-marketplace edit catalogsource performance-addon-operator-catalogsource
and add this to the spec:
updateStrategy: registryPoll: interval: 1m
edit: I think you need to remove it once the
performance-addon-operator-catalogsource-xxx
pod was restarted, else it will be restarted every minute
@slintes Is this how we are recommending customers to update their image?
no, customers won't use latest
tag
Ok, I guess then we are using latest
since we don't have tags
yet.
Also, should not removing worker-rt
label or make cluster-clean
remove these images?
Ok, I guess then we are using
latest
since we don't havetags
yet.
No clue if we will ever have tags upstream. We are mainly using the images for CI in the deploy repo. But we do not want to update the used tag over there for every new operator version. And because CI always spins up a new cluster, caching is no issue there.
Also, should not removing
worker-rt
label
No. The images are cached on the node by cri-o.
or
make cluster-clean
remove these images?
The clean script would have to oc debug node/....
into every node and delete the image using crictl
... not sure if it's easy to implement (scripting oc debug
) and if we want that.
An idea: can you clean up the cluster please? And then patch the catalogsource before deploying it like described above? But with a long interval, not 1 minute. I think that will already trigger a pull of the image when being deployed. If that works, we can add it to our manifests.
I actually deleted the images from the worker nodes and retrying now. I will try your idea if I could reproduce it.
I deleted all the performance-addon-operator
images from the worker nodes using crictl
and did make cluster-deploy cluster-wait-for-mcp
. The latest image got pulled alright, however, I see
[INFO] Performace MC not picked up yet.
[ERROR] MCP failed, giving up.
make: *** [Makefile:118: cluster-wait-for-mcp] Error 1
I tried your method to verify the performance operator image and they look good.
sh-4.4# skopeo inspect --tls-verify=false docker://quay.io/openshift-kni/performance-addon-operator-registry | grep Digest
"Digest": "sha256:9c1ce47aa434393f2707ce8784a81a0c00e229c79419326b7b58477de3b5b815",
sh-4.4#
sh-4.4# crictl images --digests | grep perf
quay.io/openshift-kni/performance-addon-operator-registry latest 9c1ce47aa4343 fa6a82e5a07bd 440MB
I looked https://quay.io/repository/openshift-kni/performance-addon-operator-registry?tag=latest&tab=tags and the SHA looks correct.
What is not looking correct is the image ID SHA in performance-operator pod:
#oc describe pod performance-operator-f475cf684-wnpkk
...
Status: Running
IP: 10.128.2.22
IPs:
IP: 10.128.2.22
Controlled By: ReplicaSet/performance-operator-f475cf684
Containers:
performance-operator:
Container ID: cri-o://c680e116f82ce1e6f7589a9d8c3dd48ad1606e23685192ae3cc380a992602e62
Image: quay.io/openshift-kni/performance-addon-operator:latest
Image ID: quay.io/openshift-kni/performance-addon-operator@sha256:7928c9dc2850c63456a14af38fe6f31dcda28bf8f3a8c0d0fcc516967e870003
Port: <none>
Host Port: <none>
Command:
performance-operator
State: Running
Started: Thu, 30 Jan 2020 10:47:50 -0500
Ready: True
Restart Count: 0
Environment:
WATCH_NAMESPACE: (v1:metadata.annotations['olm.targetNamespaces'])
POD_NAME: performance-operator-f475cf684-wnpkk (v1:metadata.name)
OPERATOR_NAME: performance-operator
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from performance-operator-token-w2bsj (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
performance-operator-token-w2bsj:
Type: Secret (a volume populated by a Secret)
SecretName: performance-operator-token-w2bsj
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
The performance operator has a different sha
Image ID: quay.io/openshift-kni/performance-addon-operator@sha256:7928c9dc2850c63456a14af38fe6f31dcda28bf8f3a8c0d0fcc516967e870003
Also another thing that is not clear is, why is the image pulled in worker-0 node while the worker-rt
is enabled for worker-1 node.
# oc -n openshift-marketplace get pod performance-addon-operator-catalogsource-h6gkt -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
performance-addon-operator-catalogsource-h6gkt 1/1 Running 0 37s 10.131.0.75 worker-0 <none> <none>
# oc get node
NAME STATUS ROLES AGE VERSION
master-0 Ready master 4d v1.17.1
master-1 Ready master 4d v1.17.1
master-2 Ready master 4d v1.17.1
worker-0 Ready,SchedulingDisabled worker 3d23h v1.17.1
worker-1 Ready,SchedulingDisabled worker,worker-rt 3d23h v1.17.1
worker-2 Ready worker 3d23h v1.17.1
Something looks wrong here.
What is not looking correct is the image ID SHA in performance-operator pod:
look at the start time: it's an old pod. Cleanup did not work correctly. You can just just delete the pod, it will be recreated automatically.
why is the image pulled in worker-0 node while the worker-rt is enabled for worker-1 node.
The catalogsource and the operator itself can run on any node, it does not matter. Only the MachineConfigs etc. created by the operator have to be applied on the right node.
What is not looking correct is the image ID SHA in performance-operator pod:
look at the start time: it's an old pod. Cleanup did not work correctly. You can just just delete the pod, it will be recreated automatically.
Ok, yeah in our make cluster-clean
we do not verify if the pods are removed.
Anyways I am rebuilding the cluster to deploy downstream builds. If I find something not working there, will file them in bugzilla.
why is the image pulled in worker-0 node while the worker-rt is enabled for worker-1 node.
The catalogsource and the operator itself can run on any node, it does not matter. Only the MachineConfigs etc. created by the operator have to be applied on the right node.
Ok, got it.
Closing this issue. Thanks.
Actual: RT Kernel is not installed and cluster-wait-for-mcp errors after retries. Expected: RT Kernel is installed in worker-1 worker node.
Additional info: performance-operator deployment as "Image: REPLACE_IMAGE" which looks incorrect.