Open danmillwood opened 2 years ago
In order to support different architectures in our release builds, OLM is fed information from the environment when choosing which image to use in the unpacker pod, something seems to be going wrong here. Could you please share:
Hello,
I installed OLM via the operator-sdk, using the following set of commands
apt-get install operator-sdk
export ARCH=$(case $(uname -m) in x86_64) echo -n amd64 ;; aarch64) echo -n arm64 ;; *) echo -n $(uname -m) ;; esac)
export OS=$(uname | awk '{print tolower($0)}')
export OPERATOR_SDK_DL_URL=https://github.com/operator-framework/operator-sdk/releases/download/v1.23.0
curl -LO ${OPERATOR_SDK_DL_URL}/operator-sdk_${OS}_${ARCH}
gpg --keyserver keyserver.ubuntu.com --recv-keys 052996E2A20B5C7E
curl -LO ${OPERATOR_SDK_DL_URL}/checksums.txt
curl -LO ${OPERATOR_SDK_DL_URL}/checksums.txt.asc
gpg -u "Operator SDK (release) <cncf-operator-sdk@cncf.io>" --verify checksums.txt.asc
grep operator-sdk_${OS}_${ARCH} checksums.txt | sha256sum -c -
chmod +x operator-sdk_${OS}_${ARCH} && sudo mv operator-sdk_${OS}_${ARCH} /usr/local/bin/operator-sdk
operator-sdk
operator-sdk olm -h
operator-sdk olm install
The operator-sdk version is operator-sdk version: "v1.23.0", commit: "1eaeb5adb56be05fe8cc6dd70517e441696846a4", kubernetes version: "1.24.2", go version: "go1.18.5", GOOS: "linux", GOARCH: "s390x"
and from the sha of the operator-framework/olm image, I think its v0.22
The full deployment yaml is
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "2"
creationTimestamp: "2022-09-08T12:50:25Z"
generation: 2
labels:
app: olm-operator
name: olm-operator
namespace: olm
resourceVersion: "23091916"
uid: 70eff54d-6b69-47ca-bbef-8b6886972605
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: olm-operator
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: olm-operator
spec:
containers:
- args:
- --namespace
- $(OPERATOR_NAMESPACE)
- --writeStatusName
- ""
command:
- /bin/olm
env:
- name: OPERATOR_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: OPERATOR_NAME
value: olm-operator
image: quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: olm-operator
ports:
- containerPort: 8080
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
requests:
cpu: 10m
memory: 160Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
serviceAccount: olm-operator-serviceaccount
serviceAccountName: olm-operator-serviceaccount
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2022-09-08T12:50:25Z"
lastUpdateTime: "2022-09-08T12:57:57Z"
message: ReplicaSet "olm-operator-64b58958bb" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
- lastTransitionTime: "2022-09-14T15:39:12Z"
lastUpdateTime: "2022-09-14T15:39:12Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 2
readyReplicas: 1
replicas: 1
updatedReplicas: 1
I did have to edit the yaml to add a toleration to run the deployment on my control plane node.
Incase it helps with debug, I tried an experiment directly with the image using docker on the linux/s390x system
# docker pull quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b: Pulling from operator-framework/olm
Digest: sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
Status: Image is up to date for quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
Run the default entrypoint which does something
# docker run quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
time="2022-09-20T10:30:42Z" level=info msg="log level info"
{"level":"error","ts":1663669842.7247128,"logger":"controller-runtime.client.config","msg":"unable to get kubeconfig","error":"invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable","errorCauses":[{"error":"no configuration has been provided, try setting KUBERNETES_MASTER environment variable"}],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/client/config.GetConfigOrDie\n\t/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/vendor/sigs.k8s.io/controller-runtime/pkg/client/config/config.go:153\nmain.Manager\n\t/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/cmd/olm/manager.go:50\nmain.main\n\t/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/cmd/olm/main.go:135\nruntime.main\n\t/opt/hostedtoolcache/go/1.18.5/x64/src/runtime/proc.go:250"}
now run the /bin/cp command as the entrypoint, which fails
# docker run --entrypoint /bin/cp quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
standard_init_linux.go:228: exec user process caused: exec format error
I also tried deliberately pulling the s390x image down to an amd64 system and reran the same commands. In this case /bin/cp executed, so my guess is the s390x image is being built from an amd64 base layer.
# docker run quay.io/operator-framework/olm@sha256:14afcf5c38f7055cb5a45a053da10791469e58de264bc449bef24f54b8bb6be2
WARNING: The requested image's platform (linux/s390x) does not match the detected host platform (linux/amd64) and no specific platform was requested
exec /bin/olm: exec format error
# docker run --entrypoint /bin/cp quay.io/operator-framework/olm@sha256:14afcf5c38f7055cb5a45a053da10791469e58de264bc449bef24f54b8bb6be2
WARNING: The requested image's platform (linux/s390x) does not match the detected host platform (linux/amd64) and no specific platform was requested
BusyBox v1.34.1 (2022-04-13 00:26:55 UTC) multi-call binary.
Usage: cp [-arPLHpfinlsTu] SOURCE DEST
or: cp [-arPLHpfinlsu] SOURCE... { -t DIRECTORY | DIRECTORY }
Bug Report
What did you do?
I installed olm on linux/s390x and connected to a catalog of operators. I then attempted to install an operator. This resulted in an
InstallPlan
resource being created and aJob
resource being created in the OLM namespace.The job failed, and looking at the corresponding pod, I could see it failed to run an initContainer called
util
Googling for this error suggests that the image is for the wrong architecture, and this is where Im confused as to how this is happening. From the pod yaml, this is the image its trying to retrieve:
This appears to be exactly the same image, as is being run for the olm operator pod, which is running successfully.
So Im confused how an image can be running successfully in one container and failing when used in another container, unless that image really does have a binary in it for the wrong architecture?
What did you expect to see?
The job to execute to completion
What did you see instead? Under which circumstances?
The pod for the job failed at the first initContainer
Environment
sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
Possible Solution
Additional context Add any other context about the problem here.