k8ssandra / cass-operator

The DataStax Kubernetes Operator for Apache Cassandra
https://docs.datastax.com/en/cass-operator/doc/cass-operator/cassOperatorGettingStarted.html
Apache License 2.0
189 stars 66 forks source link

Latest cass-operator does not support server version of Cassandra 5.0.2 #725

Open kos-team opened 2 weeks ago

kos-team commented 2 weeks ago

What happened?

The latest cass-operator with version 1.22.4 cannot deploy Cassandra with version 5.0.2 correctly. From the https://github.com/k8ssandra/management-api-for-apache-cassandra repo, 5.0.2 is supported. The Cassandra process crashes with error message: ERROR [COMMIT-LOG-ALLOCATOR] 2024-11-07 21:35:29,362 JVMStabilityInspector.java:201 - Exiting due to error while processing commit log during initialization.

What did you expect to happen?

cass-operator should be able to deploy Cassandra with 5.0.2.

How can we reproduce it (as minimally and precisely as possible)?

This bug can be reproduced by first deploying the cass-operator.

Deploy this CR with the serverVersion set to 5.0.2:

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: test-cluster
spec:
  clusterName: development
  config:
    cassandra-yaml:
      authenticator: PasswordAuthenticator
      authorizer: CassandraAuthorizer
      num_tokens: 16
      role_manager: CassandraRoleManager
      transfer_hints_on_decommission: false
  managementApiAuth:
    insecure: {}
  racks:
  - name: rack1
  - name: rack2
  - name: rack3
  resources:
    requests:
      cpu: 1000m
      memory: 2Gi
  serverType: cassandra
  serverVersion: 5.0.2
  size: 3
  storageConfig:
    cassandraDataVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      storageClassName: standard

cass-operator version

1.22.4

Kubernetes version

1.29.1

Method of installation

Helm

Anything else we need to know?

Error log from the server-system-logger container, which is the log from the Cassandra itself

ERROR [COMMIT-LOG-ALLOCATOR] 2024-11-07 18:16:46,295 JVMStabilityInspector.java:201 - Exiting due to error while processing commit log during initialization.                     
org.apache.cassandra.io.FSWriteError: java.nio.file.FileSystemException: /opt/cassandra/data/commitlog/CommitLog-8-1731003406282.log: Invalid argument                            
    at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:165)                                                                                       
    at org.apache.cassandra.db.commitlog.DirectIOSegment.<init>(DirectIOSegment.java:57)                                                                                          
    at org.apache.cassandra.db.commitlog.DirectIOSegment$DirectIOSegmentBuilder.build(DirectIOSegment.java:179)                                                                   
    at org.apache.cassandra.db.commitlog.DirectIOSegment$DirectIOSegmentBuilder.build(DirectIOSegment.java:160)                                                                   
    at org.apache.cassandra.db.commitlog.AbstractCommitLogSegmentManager.createSegment(AbstractCommitLogSegmentManager.java:277)                                                  
    at org.apache.cassandra.db.commitlog.CommitLogSegmentManagerStandard.createSegment(CommitLogSegmentManagerStandard.java:65)                                                   
    at org.apache.cassandra.db.commitlog.AbstractCommitLogSegmentManager$AllocatorRunnable.run(AbstractCommitLogSegmentManager.java:189)                                          
    at org.apache.cassandra.concurrent.InfiniteLoopExecutor.loop(InfiniteLoopExecutor.java:121)                                                                                   
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)                                                                                      
    at java.base/java.lang.Thread.run(Thread.java:829)                                                                                                                            
Caused by: java.nio.file.FileSystemException: /opt/cassandra/data/commitlog/CommitLog-8-1731003406282.log: Invalid argument                                                       
    at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)                                                                                          
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)                                                                                            
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)                                                                                            
    at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182)                                                                                
    at java.base/java.nio.channels.FileChannel.open(FileChannel.java:292)                                                                                                         
    at java.base/java.nio.channels.FileChannel.open(FileChannel.java:345)                                                                                                         
    at org.apache.cassandra.db.commitlog.DirectIOSegment$DirectIOSegmentBuilder.lambda$build$0(DirectIOSegment.java:180)                                                          
    at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:161)                                                                                       
    ... 9 common frames omitted 

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: CASS-77

burmanm commented 2 weeks ago

Those errors look like something is wrong in your Kubernetes environment. "Invalid argument" comes when the filesystem is unable to do something (in this case, allocate a segment in the disk). This isn't directly related to cass-operator or management-api as these functions are dependant on your StorageClass / CSI driver / Kubernetes / Linux / filesystem / etc.

Perhaps something as simple as running out of diskspace or defective disk?

I tested 5.0.2 on multiple systems and they all worked fine.

kos-team commented 1 week ago

After some debugging, we found out the key root cause is the file system that we are running upon. We reproduced it on a Kind Kubernetes cluster with the default local-storage CSI driver. The host OS is a Linux system, but we were running everything on a tmpfs filesystem. When we switched the Kind to use normal ext4 file system, 5.0.2 works fine.

We are curious what has been changed in Cassandra 5.0.2 that made it incompatible with the tmpfs file system.

burmanm commented 4 days ago

I do not know, but I can make a guess. In 5.0, they introduced the DIRECT_IO as the type for Commitlog instead of mmap as the default if DirectIO is available for that target disk.

https://github.com/apache/cassandra/blob/cassandra-5.0/src/java/org/apache/cassandra/config/DatabaseDescriptor.java#L1485

I don't think the logic works correctly for tmpfs in this case as it only checks for the blockSize available by creating a stub file. tmpfs probably returns a value that's in the accepted range (> 0), but tmpfs itself does not support DIRECT_IO so the real writes would fail when using that method.

Because as far as I understand tmpfs, it's already in the page cache and DIRECT_IO means bypassing the page cache. So in that sense, I wonder where it would end up.

You might get tmpfs working if you manually set the commitlog diskaccess mode to mmap or standard (with caveats of course to perf).