k8ssandra / k8ssandra-operator

The Kubernetes operator for K8ssandra
https://k8ssandra.io/
Apache License 2.0
174 stars 79 forks source link

Cassandra 4.1 process does not start with ZGC enabled #1368

Closed iAlex97 closed 2 months ago

iAlex97 commented 4 months ago

What happened?

I just got started with using K8ssandra operator and cannot wait to migrate to it our on-premise cluster. Having previously ran that cluster (version 3.11) with Shenandoah GC and saw the latency improvements, enabling ZGC was among the first things I tried. However after checking out 4.0-jdk11-G1 Cassandra pods never fully initialised, due to Cassandra process immediately exiting when started.

Did you expect to see something different?

I would expect the cluster to come up normally using the test fixture.

How to reproduce it (as minimally and precisely as possible):

  1. Install k8ssandra operator using Helm
  2. kubectl apply -f manifest.yaml
  3. Readiness probe will always return 500

Environment

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  name: prod
  namespace: k8ssandra-operator
spec:
  cassandra:
    serverVersion: "4.1.5"

    datacenters:
      - metadata:
          name: fsn1

        size: 3

        resources:
          requests:
            cpu: 24
            memory: 64Gi
            hugepages-2Mi: 5Gi
          limits:
            hugepages-2Mi: 5Gi

        storageConfig:
          cassandraDataVolumeClaimSpec:
            storageClassName: topolvm-cassandra
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 300Gi

        config:
          jvmOptions:
            heap_initial_size: 4G
            heap_max_size: 4G
            gc: ZGC
            additionalOptions: {}
              # - -XX:ConcGCThreads=1
              # - -XX:ParallelGCThreads=2 # must be >= ConcGCThreads

        networking:
          hostNetwork: false

not relevant

Anything else we need to know?:

My debugging process involved running exec on one pod and trying to manually start the cassandra process like this:

export JAVA_VERSION=11
source /opt/cassandra/conf/cassandra-env.sh
/opt/cassandra/bin/cassandra

results in the following output

Error: VM option 'UseZGC' is experimental and must be enabled via -XX:+UnlockExperimentalVMOptions.
Error: The unlock option must precede 'UseZGC'.
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

Checking the contents of /opt/cassandra/conf/jvm11-server.options:

-Djdk.attach.allowAttachSelf=true
--add-exports java.base/jdk.internal.misc=ALL-UNNAMED
--add-exports java.base/jdk.internal.ref=ALL-UNNAMED
--add-exports java.base/sun.nio.ch=ALL-UNNAMED
--add-exports java.management.rmi/com.sun.jmx.remote.internal.rmi=ALL-UNNAMED
--add-exports java.rmi/sun.rmi.registry=ALL-UNNAMED
--add-exports java.rmi/sun.rmi.server=ALL-UNNAMED
--add-exports java.sql/java.sql=ALL-UNNAMED
--add-opens java.base/java.lang.module=ALL-UNNAMED
--add-opens java.base/jdk.internal.loader=ALL-UNNAMED
--add-opens java.base/jdk.internal.ref=ALL-UNNAMED
--add-opens java.base/jdk.internal.reflect=ALL-UNNAMED
--add-opens java.base/jdk.internal.math=ALL-UNNAMED
--add-opens java.base/jdk.internal.module=ALL-UNNAMED
--add-opens java.base/jdk.internal.util.jar=ALL-UNNAMED
--add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED
-Dio.netty.tryReflectionSetAccessible=true
-XX:+UseZGC
-XX:+UnlockExperimentalVMOptions

which indeed shows the -XX:+UseZGC flag before -XX:+UnlockExperimentalVMOptions.

My workaround was setting -XX:+UnlockExperimentalVMOptions in JVM_OPTIONS like this:

export JVM_OPTS="$JVM_OPTS -XX:+UnlockExperimentalVMOptions"
/opt/cassandra/bin/cassandra
# cassandra starts normally

Finally I would also like to mention that the use of ZGC should be backed by enabling hugepages on the nodes which was my first guess as to why the java process refused to start.

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: K8OP-10

iAlex97 commented 4 months ago

Finally got it to work using Custom GC like this:

        config:
          jvmOptions:
            heap_initial_size: 4G
            heap_max_size: 4G
            gc: Custom
            additionalOptions:
              - -XX:+UnlockExperimentalVMOptions
              - -XX:+UseLargePages
              - -XX:+UseZGC
burmanm commented 2 months ago

This was fixed and works in 1.19.0.