operator-framework / operator-lifecycle-manager

A management framework for extending Kubernetes with Operators
https://olm.operatorframework.io
Apache License 2.0
1.7k stars 542 forks source link

standard_init_linux.go:228: exec user process caused: exec format error #2861

Open danmillwood opened 2 years ago

danmillwood commented 2 years ago

Bug Report

What did you do?

I installed olm on linux/s390x and connected to a catalog of operators. I then attempted to install an operator. This resulted in an InstallPlan resource being created and a Job resource being created in the OLM namespace.

The job failed, and looking at the corresponding pod, I could see it failed to run an initContainer called util

kubectl logs 648e77c9a7adc015c096cf0e2667326e167263a93266a3c6f52b4a62adlgnf2 util -n olm
standard_init_linux.go:228: exec user process caused: exec format error

Googling for this error suggests that the image is for the wrong architecture, and this is where Im confused as to how this is happening. From the pod yaml, this is the image its trying to retrieve:

  initContainers:
  - command:
    - /bin/cp
    - -Rv
    - /bin/cpb
    - /util/cpb
    image: quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
    imagePullPolicy: IfNotPresent
    name: util

This appears to be exactly the same image, as is being run for the olm operator pod, which is running successfully.

  containers:
  - args:
    - --namespace
    - $(OPERATOR_NAMESPACE)
    - --writeStatusName
    - ""
    command:
    - /bin/olm
    env:
    - name: OPERATOR_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: OPERATOR_NAME
      value: olm-operator
    image: quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
    imagePullPolicy: IfNotPresent
kubectl get pods -n olm
NAME                                                              READY   STATUS       RESTARTS   AGE
648e77c9a7adc015c096cf0e2667326e167263a93266a3c6f52b4a62adlgnf2   0/1     Init:Error   0          2d1h
648e77c9a7adc015c096cf0e2667326e167263a93266a3c6f52b4a62admm8n6   0/1     Init:Error   0          2d1h
catalog-operator-8d9d97478-8v5mx                                  1/1     Running      0          6d21h
ibm-operator-catalog-x9gln                                        1/1     Running      0          2d2h
olm-operator-64b58958bb-pprtx                                     1/1     Running      0          21h
packageserver-545b4f5db8-nf42n                                    1/1     Running      0          6d23h
packageserver-545b4f5db8-pkp9h                                    1/1     Running      0          6d21h

So Im confused how an image can be running successfully in one container and failing when used in another container, unless that image really does have a binary in it for the wrong architecture?

What did you expect to see?

The job to execute to completion

What did you see instead? Under which circumstances?

The pod for the job failed at the first initContainer

Environment

sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:16:20Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"linux/s390x"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.2", GitCommit:"9d142434e3af351a628bffee3939e64c681afa4d", GitTreeState:"clean", BuildDate:"2022-01-19T17:29:16Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/s390x"}

Possible Solution

Additional context Add any other context about the problem here.

awgreene commented 2 years ago

In order to support different architectures in our release builds, OLM is fed information from the environment when choosing which image to use in the unpacker pod, something seems to be going wrong here. Could you please share:

danmillwood commented 2 years ago

Hello,

I installed OLM via the operator-sdk, using the following set of commands

apt-get install operator-sdk
export ARCH=$(case $(uname -m) in x86_64) echo -n amd64 ;; aarch64) echo -n arm64 ;; *) echo -n $(uname -m) ;; esac)
export OS=$(uname | awk '{print tolower($0)}')
export OPERATOR_SDK_DL_URL=https://github.com/operator-framework/operator-sdk/releases/download/v1.23.0
curl -LO ${OPERATOR_SDK_DL_URL}/operator-sdk_${OS}_${ARCH}
gpg --keyserver keyserver.ubuntu.com --recv-keys 052996E2A20B5C7E
curl -LO ${OPERATOR_SDK_DL_URL}/checksums.txt
curl -LO ${OPERATOR_SDK_DL_URL}/checksums.txt.asc
gpg -u "Operator SDK (release) <cncf-operator-sdk@cncf.io>" --verify checksums.txt.asc
grep operator-sdk_${OS}_${ARCH} checksums.txt | sha256sum -c -
chmod +x operator-sdk_${OS}_${ARCH} && sudo mv operator-sdk_${OS}_${ARCH} /usr/local/bin/operator-sdk
operator-sdk 
operator-sdk olm -h
operator-sdk olm install

The operator-sdk version is operator-sdk version: "v1.23.0", commit: "1eaeb5adb56be05fe8cc6dd70517e441696846a4", kubernetes version: "1.24.2", go version: "go1.18.5", GOOS: "linux", GOARCH: "s390x" and from the sha of the operator-framework/olm image, I think its v0.22

The full deployment yaml is

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "2"
  creationTimestamp: "2022-09-08T12:50:25Z"
  generation: 2
  labels:
    app: olm-operator
  name: olm-operator
  namespace: olm
  resourceVersion: "23091916"
  uid: 70eff54d-6b69-47ca-bbef-8b6886972605
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: olm-operator
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: olm-operator
    spec:
      containers:
      - args:
        - --namespace
        - $(OPERATOR_NAMESPACE)
        - --writeStatusName
        - ""
        command:
        - /bin/olm
        env:
        - name: OPERATOR_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: OPERATOR_NAME
          value: olm-operator
        image: quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8080
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: olm-operator
        ports:
        - containerPort: 8080
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8080
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 10m
            memory: 160Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
      serviceAccount: olm-operator-serviceaccount
      serviceAccountName: olm-operator-serviceaccount
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2022-09-08T12:50:25Z"
    lastUpdateTime: "2022-09-08T12:57:57Z"
    message: ReplicaSet "olm-operator-64b58958bb" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2022-09-14T15:39:12Z"
    lastUpdateTime: "2022-09-14T15:39:12Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 2
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

I did have to edit the yaml to add a toleration to run the deployment on my control plane node.

danmillwood commented 2 years ago

Incase it helps with debug, I tried an experiment directly with the image using docker on the linux/s390x system

# docker pull quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b: Pulling from operator-framework/olm
Digest: sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
Status: Image is up to date for quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b

Run the default entrypoint which does something

# docker run quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
time="2022-09-20T10:30:42Z" level=info msg="log level info"
{"level":"error","ts":1663669842.7247128,"logger":"controller-runtime.client.config","msg":"unable to get kubeconfig","error":"invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable","errorCauses":[{"error":"no configuration has been provided, try setting KUBERNETES_MASTER environment variable"}],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/client/config.GetConfigOrDie\n\t/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/vendor/sigs.k8s.io/controller-runtime/pkg/client/config/config.go:153\nmain.Manager\n\t/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/cmd/olm/manager.go:50\nmain.main\n\t/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/cmd/olm/main.go:135\nruntime.main\n\t/opt/hostedtoolcache/go/1.18.5/x64/src/runtime/proc.go:250"}

now run the /bin/cp command as the entrypoint, which fails

# docker run --entrypoint /bin/cp  quay.io/operator-framework/olm@sha256:2b4fee73c05069d9d2c537c7d3072241097914748abfb938b5b08c969b2f544b
standard_init_linux.go:228: exec user process caused: exec format error
danmillwood commented 2 years ago

I also tried deliberately pulling the s390x image down to an amd64 system and reran the same commands. In this case /bin/cp executed, so my guess is the s390x image is being built from an amd64 base layer.

# docker run quay.io/operator-framework/olm@sha256:14afcf5c38f7055cb5a45a053da10791469e58de264bc449bef24f54b8bb6be2
WARNING: The requested image's platform (linux/s390x) does not match the detected host platform (linux/amd64) and no specific platform was requested
exec /bin/olm: exec format error

# docker run --entrypoint /bin/cp quay.io/operator-framework/olm@sha256:14afcf5c38f7055cb5a45a053da10791469e58de264bc449bef24f54b8bb6be2
WARNING: The requested image's platform (linux/s390x) does not match the detected host platform (linux/amd64) and no specific platform was requested
BusyBox v1.34.1 (2022-04-13 00:26:55 UTC) multi-call binary.

Usage: cp [-arPLHpfinlsTu] SOURCE DEST
or: cp [-arPLHpfinlsu] SOURCE... { -t DIRECTORY | DIRECTORY }