k8ssandra / cass-operator

The DataStax Kubernetes Operator for Apache Cassandra
https://docs.datastax.com/en/cass-operator/doc/cass-operator/cassOperatorGettingStarted.html
Apache License 2.0
188 stars 66 forks source link

Latest cass-operator does not support server version of Cassandra 5.0.2 #725

Open kos-team opened 6 days ago

kos-team commented 6 days ago

What happened?

The latest cass-operator with version 1.22.4 cannot deploy Cassandra with version 5.0.2 correctly. From the https://github.com/k8ssandra/management-api-for-apache-cassandra repo, 5.0.2 is supported. The Cassandra process crashes with error message: ERROR [COMMIT-LOG-ALLOCATOR] 2024-11-07 21:35:29,362 JVMStabilityInspector.java:201 - Exiting due to error while processing commit log during initialization.

What did you expect to happen?

cass-operator should be able to deploy Cassandra with 5.0.2.

How can we reproduce it (as minimally and precisely as possible)?

This bug can be reproduced by first deploying the cass-operator.

Deploy this CR with the serverVersion set to 5.0.2:

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: test-cluster
spec:
  clusterName: development
  config:
    cassandra-yaml:
      authenticator: PasswordAuthenticator
      authorizer: CassandraAuthorizer
      num_tokens: 16
      role_manager: CassandraRoleManager
      transfer_hints_on_decommission: false
  managementApiAuth:
    insecure: {}
  racks:
  - name: rack1
  - name: rack2
  - name: rack3
  resources:
    requests:
      cpu: 1000m
      memory: 2Gi
  serverType: cassandra
  serverVersion: 5.0.2
  size: 3
  storageConfig:
    cassandraDataVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      storageClassName: standard

cass-operator version

1.22.4

Kubernetes version

1.29.1

Method of installation

Helm

Anything else we need to know?

Error log from the server-system-logger container, which is the log from the Cassandra itself

ERROR [COMMIT-LOG-ALLOCATOR] 2024-11-07 18:16:46,295 JVMStabilityInspector.java:201 - Exiting due to error while processing commit log during initialization.                     
org.apache.cassandra.io.FSWriteError: java.nio.file.FileSystemException: /opt/cassandra/data/commitlog/CommitLog-8-1731003406282.log: Invalid argument                            
    at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:165)                                                                                       
    at org.apache.cassandra.db.commitlog.DirectIOSegment.<init>(DirectIOSegment.java:57)                                                                                          
    at org.apache.cassandra.db.commitlog.DirectIOSegment$DirectIOSegmentBuilder.build(DirectIOSegment.java:179)                                                                   
    at org.apache.cassandra.db.commitlog.DirectIOSegment$DirectIOSegmentBuilder.build(DirectIOSegment.java:160)                                                                   
    at org.apache.cassandra.db.commitlog.AbstractCommitLogSegmentManager.createSegment(AbstractCommitLogSegmentManager.java:277)                                                  
    at org.apache.cassandra.db.commitlog.CommitLogSegmentManagerStandard.createSegment(CommitLogSegmentManagerStandard.java:65)                                                   
    at org.apache.cassandra.db.commitlog.AbstractCommitLogSegmentManager$AllocatorRunnable.run(AbstractCommitLogSegmentManager.java:189)                                          
    at org.apache.cassandra.concurrent.InfiniteLoopExecutor.loop(InfiniteLoopExecutor.java:121)                                                                                   
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)                                                                                      
    at java.base/java.lang.Thread.run(Thread.java:829)                                                                                                                            
Caused by: java.nio.file.FileSystemException: /opt/cassandra/data/commitlog/CommitLog-8-1731003406282.log: Invalid argument                                                       
    at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)                                                                                          
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)                                                                                            
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)                                                                                            
    at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182)                                                                                
    at java.base/java.nio.channels.FileChannel.open(FileChannel.java:292)                                                                                                         
    at java.base/java.nio.channels.FileChannel.open(FileChannel.java:345)                                                                                                         
    at org.apache.cassandra.db.commitlog.DirectIOSegment$DirectIOSegmentBuilder.lambda$build$0(DirectIOSegment.java:180)                                                          
    at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:161)                                                                                       
    ... 9 common frames omitted 

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: CASS-77

burmanm commented 3 days ago

Those errors look like something is wrong in your Kubernetes environment. "Invalid argument" comes when the filesystem is unable to do something (in this case, allocate a segment in the disk). This isn't directly related to cass-operator or management-api as these functions are dependant on your StorageClass / CSI driver / Kubernetes / Linux / filesystem / etc.

Perhaps something as simple as running out of diskspace or defective disk?

I tested 5.0.2 on multiple systems and they all worked fine.

kos-team commented 1 day ago

After some debugging, we found out the key root cause is the file system that we are running upon. We reproduced it on a Kind Kubernetes cluster with the default local-storage CSI driver. The host OS is a Linux system, but we were running everything on a tmpfs filesystem. When we switched the Kind to use normal ext4 file system, 5.0.2 works fine.

We are curious what has been changed in Cassandra 5.0.2 that made it incompatible with the tmpfs file system.