mnlipp / VM-Operator

Run Qemu VMs in Kubernetes Pods
https://jdrupes.org/vm-operator/
GNU Affero General Public License v3.0
6 stars 0 forks source link

Operator fails to start with `[Too many errors, abort]` after a while #33

Closed meschbach closed 3 months ago

meschbach commented 3 months ago

This project looks interesting! When following the steps take from I received a cryptic error message. Tried to track down where it might originate within the code without any luck. The vm-operator deployment does not fail either health or readiness checks.

Any ideas on how to start diagnosing

Steps to deploy

kubectl apply -f https://github.com/mnlipp/VM-Operator/raw/main/deploy/crds/vms-crd.yaml
kubectl create namespace vmop-demo
kubectl apply -k .

kustomization.yaml in that directory.

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
# Again, I recommend to use the deploy directory from a
# release branch for anything but test environments.
- https://github.com/mnlipp/VM-Operator/deploy

namespace: vmop-demo

patches:
- patch: |-
    kind: PersistentVolumeClaim
    apiVersion: v1
    metadata:
      name: vmop-image-repository
    spec:
      # Default is ReadOnlyMany
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          # Default is 100Gi
          storage: 10Gi
      # Default is to use the default storage class
      storageClassName: local-path

- patch: |-
    kind: ConfigMap
    apiVersion: v1
    metadata:
      name: vm-operator
    data:
      config.yaml: |
        "/Manager":
          # "/GuiHttpServer":
            # See section about the GUI
          "/Controller":
            "/Reconciler":
              runnerDataPvc:
                # Default is to use the default storage class
                storageClassName: local-path

Target Cluster

Rancher Desktop 1.14.1 on MacOS 14.5 (Sonoma) on an M1. Single Node.

~:> kubectl version --output=json |sed -e 's/^/> /' { "clientVersion": { "major": "1", "minor": "28", "gitVersion": "v1.28.6", "gitCommit": "be3af46a4654bdf05b4838fe94e95ec8c165660c", "gitTreeState": "clean", "buildDate": "2024-01-17T13:49:03Z", "goVersion": "go1.20.13", "compiler": "gc", "platform": "darwin/arm64" }, "kustomizeVersion": "v5.0.4-0.20230601165947-6ce0bf390ce3", "serverVersion": { "major": "1", "minor": "28", "gitVersion": "v1.28.6+k3s2", "gitCommit": "c9f49a3b06cd7ebe793f8cc1dcd0293168e743d9", "gitTreeState": "clean", "buildDate": "2024-02-06T01:25:17Z", "goVersion": "go1.20.13", "compiler": "gc", "platform": "linux/arm64" } }

Storage Class local-path exists.

Log Message

Jul 26 17:03:29 CONFIG Version: 3.1.2 (built from manager-app-3.1.2) Jul 26 17:03:30 CONFIG running on OpenJDK 64-Bit Server VM (21.0.3+9-LTS) from Eclipse Adoptium Jul 26 17:03:32 CONFIG Using configuration from: /etc/opt/vmoperator/config.yaml

[Too many errors, abort]

mnlipp commented 3 months ago

The message "[Too many errors, abort]" is not from the VM-Operator. It does all logging through java.util.logging and a message from it would therefore have a timestamp at least. I found the exact message in the jre sources, but that's not helpful.

I can currently not verify, but my assumption is that the VM-Operator won't work without a minimal GUI configuration.

I'll have a closer look at this next week.

mnlipp commented 3 months ago

Here's another idea: I've only just noticed that your node reports an arm architecture. The images provided use a jre built for x86_64. So actually you should have got a message from your OS that the jre cannot be executed. Nevertheless, you got some initial output. This is extremely strange.

As mentioned, I'll test the unmodified configuration files again next week. But there may be some inconsistency in your environmernt as well.

mnlipp commented 3 months ago

I've tested the configuration from your initial comment on a x86_64 node and it works without problems.

I did a bit of research and obviously some emulation library allows you to start the x86_64 image on your arm node. This emulation is, however, not perfect and obviously fails to fully support the Java Runtime Environment (or the Alpine base used in the image). This is a problem that you should report to the project that provides the emulation. It is not caused by or related to this project.

meschbach commented 3 months ago

Thank you for taking the time to look into the issue!

Definitely interesting behavior as the incorrect CPU architectures typically manifest on container start with a very different error message with both straight k8s and on arm64. Although I can not rule out something breaking in the JVM or QEMU layers as a result of actions performed by the application. If I get a chance to investigate further I will. Thank you again!