Open JBOClara opened 1 month ago
We have also encountered the same issue when enabling autoScheduling
for Reaper. Further checking the logs from mgmt-api, I think this is due to Reaper having an invalid combination of default parameters (only happens for Cassandra 4.x) when setting up automatic schedules. The error which led me to think this:
INFO [epollEventLoopGroup-5-3] 2024-07-31 08:44:16,274 RpcMethod41x.java:138 - Failed to execute method NodeOps.repair
java.lang.reflect.InvocationTargetException: null
at jdk.internal.reflect.GeneratedMethodAccessor47.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at com.datastax.mgmtapi.rpc.RpcMethod41x.execute(RpcMethod41x.java:130)
at com.datastax.mgmtapi.rpc.RpcMethod41x.execute(RpcMethod41x.java:33)
at com.datastax.mgmtapi.interceptors.QueryHandlerInterceptor.lambda$handle$1(QueryHandlerInterceptor.java:120)
at com.datastax.mgmtapi.shims.CassandraAPI.handleRpcResult(CassandraAPI.java:73)
at com.datastax.mgmtapi.interceptors.QueryHandlerInterceptor.handle(QueryHandlerInterceptor.java:120)
at com.datastax.mgmtapi.interceptors.QueryHandlerInterceptor.intercept(QueryHandlerInterceptor.java:80)
at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java)
at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:116)
at org.apache.cassandra.transport.Message$Request.execute(Message.java:255)
<redacted>
Caused by: java.io.IOException: Invalid repair combination. Incremental repair if Parallelism is not set
at com.datastax.mgmtapi.NodeOpsProvider.repair(NodeOpsProvider.java:824)
... 43 common frames omitted
K8ssandraCluster
CRD has autoScheduling.repairType
set as AUTO
which for Cassandra 4.x will behave as INCREMENTAL
and will setup the schedules accordingly.
From the Reaper docs we understand that for an Incremental repair the only allowed value for repairParallelism
is PARALLEL
:
Sets the default repair type unless specifically defined for each run. Note that this is only supported with the PARALLEL repairParallelism setting. For more details in incremental repair, please refer to the following article.http://www.datastax.com/dev/blog/more-efficient-repairs
This is checked by the management-api here which indeed throws the error that I'm seeing.
Run exec into a reaper pod to check it's configuration, we see that its /etc/cassandra-reaper/config/cassandra-reaper.yml
sets repairParallelism
to the value of an env variable called REAPER_REPAIR_PARALELLISM
. The value for that variable is
REAPER_REPAIR_PARALELLISM=DATACENTER_AWARE
We can further check this by looking at the reaper tables inside cassandra:
prod-superuser@cqlsh> use reaper_db;
prod-superuser@cqlsh:reaper_db> select * from repair_schedule_v1;
id | adaptive | creation_time | days_between | intensity | last_run | next_activation | owner | pause_time | percent_unrepaired_threshold | repair_parallelism | repair_unit_id | run_history | segment_count | segment_count_per_node | state
--------------------------------------+----------+---------------------------------+--------------+-----------+--------------------------------------+---------------------------------+-----------------+---------------------------------+------------------------------+--------------------+--------------------------------------+-------------+---------------+------------------------+--------
b3cb2180-4e7c-11ef-9f1c-4d0488525d6c | False | 2024-07-30 14:04:57.112000+0000 | 7 | 0.9 | 2db1ccf0-4e97-11ef-92a3-c328b392dd6d | 2024-08-06 17:08:38.033000+0000 | auto-scheduling | 2024-07-30 14:10:43.136000+0000 | 10 | dc_parallel | b3c9e900-4e7c-11ef-9f1c-4d0488525d6c | null | null | 64 | ACTIVE
b3d533a0-4e7c-11ef-9f1c-4d0488525d6c | False | 2024-07-30 14:04:57.178000+0000 | 7 | 0.9 | null | 2024-07-30 20:09:57.150000+0000 | auto-scheduling | null | 10 | dc_parallel | b3d337d0-4e7c-11ef-9f1c-4d0488525d6c | null | null | 64 | ACTIVE
which confirms that the default parallelism was set to dc_parallel
or DATACENTER_AWARE
.
My confusion comes from where this variable is set. From my limited research is not specified in the reaper deployment, it is not inside the Dockerfile, nor can it be configured from CRD.
For possible workarounds, I see the following:
autoScheduling.repairType
inside the CRD to REGULAR
, because ADAPTIVE
is only recommended for cassandra 3.xrepair_parallelism
to parallel
REAPER_REPAIR_PARALELLISM=PARALLEL
reaper_db
keyspace@adejanovski what do you think?
What happened?
Cassandra container shows the following error in the logs:
Did you expect to see something different?
/api/v2/repairs status=500 Internal Server Error
should return with a 200.How to reproduce it (as minimally and precisely as possible):
Visible in the cassandra logs
Environment
this error is visible with:
and
Image hash
``` k describe po -n k8ssandra-operator | grep "Image ID:" | sort -u Image ID: cr.k8ssandra.io/k8ssandra/cass-management-api@sha256:e606bae0bd49e794dffdb508bd461e6734e8bba415ac30f2f58742f647fab38c Image ID: cr.k8ssandra.io/k8ssandra/system-logger@sha256:a25251eb74ca08dc87d5ceb3d22bfcb7ac93c1ec7b673c3ce2f8c7bc32769c1f Image ID: docker.io/k8ssandra/medusa@sha256:1a8e63b9dd49744cf13678584f9558c6452ed1b160de17c149174d6035e053d7 Image ID: docker.io/k8ssandra/medusa@sha256:4f2991f88c92441bd6ed5034c4a0cdab94b52e37590183753b2b5786eb25abd9 Image ID: docker.io/thelastpickle/cassandra-reaper@sha256:9e84f87108994d63bc76cec25b2cdd2e1f02072585f825fd2ca493b09371fc38 Image ID: docker.io/timberio/vector@sha256:13779856a8afe8240a1549208040dec12a50cd9b9d98b577d9327d2c212499d8 Image ID: cr.k8ssandra.io/k8ssandra/cass-management-api@sha256:e606bae0bd49e794dffdb508bd461e6734e8bba415ac30f2f58742f647fab38c Image ID: cr.k8ssandra.io/k8ssandra/cass-operator@sha256:d851410079654d6f0acd55d220f647f042d7691dd28a6b3866efcc120c34aeae Image ID: cr.k8ssandra.io/k8ssandra/k8ssandra-client@sha256:4cd4f97e74ea4ce256cb55aa166039471b977c5c4f75e92971d012579146b050 Image ID: cr.k8ssandra.io/k8ssandra/k8ssandra-operator@sha256:00cd1e0bab61aba16df7edcfbcdab5aa5c9d6c29d3656d1e467aca312090890d Image ID: docker.io/bitnami/kubectl@sha256:f5fc0d561d9ef931f9ecb2e8b65d93eb92767c57f64897c56a100bfe28102c74 Image ID: docker.io/library/busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47 Image ID: docker.io/thelastpickle/cassandra-reaper@sha256:9e84f87108994d63bc76cec25b2cdd2e1f02072585f825fd2ca493b09371fc38 ```Kubernetes version information:
kubectl version
And:
EKS
Manifests
``` apiVersion: cassandra.datastax.com/v1beta1 kind: CassandraDatacenter metadata: annotations: eks.amazonaws.com/skip-containers: cassandra,server-system-logger,server-config-init finalizers: - finalizer.cassandra.datastax.com generation: 1 labels: app.kubernetes.io/component: cassandra app.kubernetes.io/name: k8ssandra-operator app.kubernetes.io/part-of: k8ssandra k8ssandra.io/cleaned-up-by: k8ssandracluster-controller k8ssandra.io/cluster-name: cassandra k8ssandra.io/cluster-namespace: k8ssandra-operator name: us-east namespace: k8ssandra-operator spec: additionalServiceConfig: additionalSeedService: {} allpodsService: {} dcService: {} nodePortService: {} seedService: {} clusterName: cassandra config: cassandra-env-sh: additional-jvm-opts: - -Dcassandra.allow_alter_rf_during_range_movement=true - -Dcassandra.system_distributed_replication=us-east:3 - -Dcassandra.jmx.authorizer=org.apache.cassandra.auth.jmx.AuthorizationProxy - -Djava.security.auth.login.config=$CASSANDRA_HOME/conf/cassandra-jaas.config - -Dcassandra.jmx.remote.login.config=CassandraLogin - -Dcom.sun.management.jmxremote.authenticate=true - -Djavax.net.ssl.trustStore=/mnt/client-truststore/truststore - -Djavax.net.ssl.keyStore=/mnt/client-keystore/keystore - -Djavax.net.debug=ssl - -Dcom.sun.management.jmxremote.registry.ssl=true - -Dcassandra.consistent.rangemovement=false - -Dcom.sun.management.jmxremote.ssl.need.client.auth=true - -Dcom.sun.management.jmxremote.registry.ssl=true - -Dcom.sun.management.jmxremote.ssl=true - -Dcassandra.allow_new_old_config_keys=true cassandra-yaml: authenticator: PasswordAuthenticator authorizer: CassandraAuthorizer auto_bootstrap: true auto_snapshot: true batch_size_fail_threshold: 1500KiB batch_size_warn_threshold: 10KiB client_encryption_options: enabled: true keystore: /mnt/client-keystore/keystore keystore_password: READACTED optional: false require_client_auth: false truststore: /mnt/client-truststore/truststore truststore_password: READACTED concurrent_counter_writes: 64 concurrent_materialized_view_writes: 64 concurrent_reads: 64 concurrent_writes: 64 counter_cache_size: 50MiB materialized_views_enabled: true native_transport_port: 9042 num_tokens: 256 range_request_timeout: 10000ms read_request_timeout: 15000ms request_timeout: 20000ms role_manager: CassandraRoleManager server_encryption_options: internode_encryption: all keystore: /mnt/server-keystore/keystore keystore_password: READACTED require_client_auth: false truststore: /mnt/server-truststore/truststore truststore_password: READACTED write_request_timeout: 2000ms jvm-server-options: initial_heap_size: 4294967296 jmx-connection-type: local-no-auth jmx-port: 7199 jmx-remote-ssl: true max_heap_size: 4294967296 jvm11-server-options: garbage_collector: G1GC configBuilderResources: {} managementApiAuth: {} networking: {} podTemplateSpec: metadata: {} spec: containers: - env: - name: LOCAL_JMX value: "no" - name: MANAGEMENT_API_HEAP_SIZE value: "128000000" - name: MGMT_API_DISABLE_MCAC value: "true" livenessProbe: failureThreshold: 3 httpGet: path: /api/v0/probes/liveness port: 8080 scheme: HTTP initialDelaySeconds: 230 periodSeconds: 15 successThreshold: 1 timeoutSeconds: 10 name: cassandra readinessProbe: failureThreshold: 3 httpGet: path: /api/v0/probes/readiness port: 8080 scheme: HTTP initialDelaySeconds: 270 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 10 resources: {} volumeMounts: - mountPath: /crypto name: certs - mountPath: /home/cassandra/.cassandra/cqlshrc name: cqlsh-config subPath: cqlshrc - mountPath: /home/cassandra/.cassandra/nodetool-ssl.properties name: nodetool-config subPath: nodetool-ssl.properties - mountPath: /mnt/client-keystore name: client-keystore - mountPath: /mnt/client-truststore name: client-truststore - mountPath: /mnt/server-keystore name: server-keystore - mountPath: /mnt/server-truststore name: server-truststore - name: server-system-logger resources: {} - env: - name: MEDUSA_MODE value: GRPC - name: MEDUSA_TMP_DIR value: /var/lib/cassandra - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: CQL_USERNAME valueFrom: secretKeyRef: key: username name: cassandra-medusa - name: CQL_PASSWORD valueFrom: secretKeyRef: key: password name: cassandra-medusa image: docker.io/k8ssandra/medusa:0.21.0 imagePullPolicy: IfNotPresent livenessProbe: exec: command: - /bin/grpc_health_probe - --addr=:50051 failureThreshold: 10 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: medusa ports: - containerPort: 50051 name: grpc protocol: TCP readinessProbe: exec: command: - /bin/grpc_health_probe - --addr=:50051 failureThreshold: 10 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: limits: memory: 512Mi requests: cpu: 10m memory: 116Mi volumeMounts: - mountPath: /etc/cassandra name: server-config - mountPath: /var/lib/cassandra name: server-data - mountPath: /etc/medusa name: cassandra-medusa - mountPath: /etc/podinfo name: podinfo - mountPath: /etc/certificates name: certificates initContainers: - command: - sysctl - -w - vm.max_map_count=1048575 image: busybox:1.28 name: sysctl resources: {} securityContext: privileged: true - name: server-config-init resources: {} - env: - name: MEDUSA_MODE value: RESTORE - name: MEDUSA_TMP_DIR value: /var/lib/cassandra - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: CQL_USERNAME valueFrom: secretKeyRef: key: username name: cassandra-medusa - name: CQL_PASSWORD valueFrom: secretKeyRef: key: password name: cassandra-medusa image: docker.io/k8ssandra/medusa:0.21.0 imagePullPolicy: IfNotPresent name: medusa-restore resources: limits: memory: 8Gi requests: cpu: 100m memory: 100Mi volumeMounts: - mountPath: /etc/cassandra name: server-config - mountPath: /var/lib/cassandra name: server-data - mountPath: /etc/medusa name: cassandra-medusa - mountPath: /etc/podinfo name: podinfo - mountPath: /etc/certificates name: certificates volumes: - name: certs secret: secretName: cassandra-jks-keystore - configMap: name: cqlsh-config name: cqlsh-config - configMap: name: nodetool-config name: nodetool-config - name: client-keystore secret: items: - key: keystore.jks path: keystore secretName: cassandra-jks-keystore - name: client-truststore secret: items: - key: truststore.jks path: truststore secretName: cassandra-jks-keystore - name: server-keystore secret: items: - key: keystore.jks path: keystore secretName: cassandra-jks-keystore - name: server-truststore secret: items: - key: truststore.jks path: truststore secretName: cassandra-jks-keystore - configMap: name: cassandra-medusa name: cassandra-medusa - downwardAPI: items: - fieldRef: fieldPath: metadata.labels path: labels name: podinfo - name: certificates secret: secretName: medusa-certificates racks: - name: 1a nodeAffinityLabels: topology.kubernetes.io/zone: us-east-1a - name: 1d nodeAffinityLabels: topology.kubernetes.io/zone: us-east-1b - name: 1c nodeAffinityLabels: topology.kubernetes.io/zone: us-east-1c resources: limits: memory: 9Gi requests: cpu: "1" memory: 9Gi serverType: cassandra serverVersion: 4.1.4 size: 3 storageConfig: additionalVolumes: - mountPath: /etc/vector name: vector-config volumeSource: configMap: name: cassandra-us-east-cass-vector - mountPath: /opt/management-api/configs name: metrics-agent-config volumeSource: configMap: items: - key: metrics-collector.yaml path: metrics-collector.yaml name: cassandra-us-east-metrics-agent-config cassandraDataVolumeClaimSpec: accessModes: - ReadWriteOnce resources: requests: storage: 300Gi storageClassName: ebs-xfs-sc superuserSecretName: cassandra-superuser systemLoggerResources: limits: memory: 512Mi requests: cpu: 100m memory: 128Mi users: - secretName: cassandra-reaper superuser: true - secretName: cassandra-medusa superuser: true ``` ``` apiVersion: k8ssandra.io/v1alpha1 kind: K8ssandraCluster metadata: annotations: config.kubernetes.io/origin: | path: ../../base/k8ssandra-encrypted.yaml k8ssandra.io/initial-system-replication: '{"us-east":3}' finalizers: - k8ssandracluster.k8ssandra.io/finalizer generation: 5 name: cassandra namespace: k8ssandra-operator spec: auth: true cassandra: clientEncryptionStores: keystorePasswordSecretRef: name: jks-password keystoreSecretRef: key: keystore.jks name: cassandra-jks-keystore truststorePasswordSecretRef: name: jks-password truststoreSecretRef: key: truststore.jks name: cassandra-jks-keystore config: cassandraYaml: authenticator: PasswordAuthenticator authorizer: CassandraAuthorizer auto_bootstrap: true auto_snapshot: true batch_size_fail_threshold: 1500KiB batch_size_warn_threshold: 10KiB client_encryption_options: enabled: true optional: false require_client_auth: false concurrent_counter_writes: 64 concurrent_materialized_view_writes: 64 concurrent_reads: 64 concurrent_writes: 64 counter_cache_size: 50MiB materialized_views_enabled: true native_transport_port: 9042 num_tokens: 256 range_request_timeout: 10000ms read_request_timeout: 15000ms request_timeout: 20000ms server_encryption_options: internode_encryption: all require_client_auth: false write_request_timeout: 2000ms jvmOptions: additionalOptions: - -Djavax.net.debug=ssl - -Dcom.sun.management.jmxremote.registry.ssl=true - -Dcassandra.consistent.rangemovement=false - -Dcom.sun.management.jmxremote.ssl.need.client.auth=true - -Dcom.sun.management.jmxremote.registry.ssl=true - -Dcom.sun.management.jmxremote.ssl=true - -Dcassandra.allow_new_old_config_keys=true gc: G1GC heap_initial_size: 4Gi heap_max_size: 4Gi jmx_connection_type: local-no-auth jmx_port: 7199 jmx_remote_ssl: true containers: - livenessProbe: failureThreshold: 3 httpGet: path: /api/v0/probes/liveness port: 8080 scheme: HTTP initialDelaySeconds: 230 periodSeconds: 15 successThreshold: 1 timeoutSeconds: 10 name: cassandra readinessProbe: failureThreshold: 3 httpGet: path: /api/v0/probes/readiness port: 8080 scheme: HTTP initialDelaySeconds: 270 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 10 volumeMounts: - mountPath: /crypto name: certs - mountPath: /home/cassandra/.cassandra/cqlshrc name: cqlsh-config subPath: cqlshrc - mountPath: /home/cassandra/.cassandra/nodetool-ssl.properties name: nodetool-config subPath: nodetool-ssl.properties datacenters: - initContainers: - command: - sysctl - -w - vm.max_map_count=1048575 image: busybox:1.28 name: sysctl securityContext: privileged: true metadata: name: us-east perNodeConfigInitContainerImage: mikefarah/yq:4 racks: - name: 1a nodeAffinityLabels: topology.kubernetes.io/zone: us-east-1a - name: 1d nodeAffinityLabels: topology.kubernetes.io/zone: us-east-1b - name: 1c nodeAffinityLabels: topology.kubernetes.io/zone: us-east-1c resources: limits: memory: 9Gi requests: cpu: 1 memory: 9Gi size: 3 stopped: false extraVolumes: volumes: - name: certs secret: secretName: cassandra-jks-keystore - configMap: name: cqlsh-config name: cqlsh-config - configMap: name: nodetool-config name: nodetool-config metadata: annotations: eks.amazonaws.com/skip-containers: cassandra,server-system-logger,server-config-init mgmtAPIHeap: 128M networking: hostNetwork: false perNodeConfigInitContainerImage: mikefarah/yq:4 serverEncryptionStores: keystorePasswordSecretRef: name: jks-password keystoreSecretRef: key: keystore.jks name: cassandra-jks-keystore truststorePasswordSecretRef: name: jks-password truststoreSecretRef: key: truststore.jks name: cassandra-jks-keystore serverType: cassandra serverVersion: 4.1.4 softPodAntiAffinity: false storageConfig: cassandraDataVolumeClaimSpec: accessModes: - ReadWriteOnce resources: requests: storage: 300Gi storageClassName: ebs-xfs-sc telemetry: mcac: enabled: false prometheus: enabled: true vector: components: sinks: - config: | target = "stdout" [sinks.console_output.encoding] codec = "json" inputs: - cassandra_metrics name: console_output type: console enabled: true resources: limits: memory: 512Mi requests: cpu: 100m memory: 128Mi scrapeInterval: 30s medusa: certificatesSecretRef: name: medusa-certificates containerImage: name: medusa registry: docker.io repository: k8ssandra tag: 0.21.0 containerResources: limits: memory: 512Mi requests: cpu: 10m memory: 116Mi storageProperties: bucketName: dow-backups concurrentTransfers: 10 credentialsType: role-based maxBackupAge: 0 maxBackupCount: 0 multiPartUploadThreshold: 104857600 prefix: cassandra-tests region: us-east-1 secure: true storageProvider: s3 storageSecretRef: name: "" transferMaxBandwidth: 90MB/s reaper: ServiceAccountName: default autoScheduling: enabled: true initialDelayPeriod: PT15S percentUnrepairedThreshold: 10 periodBetweenPolls: PT10M repairType: AUTO scheduleSpreadPeriod: PT6H timeBeforeFirstSchedule: PT5M containerImage: name: cassandra-reaper repository: thelastpickle tag: 3.6.0 deploymentMode: SINGLE heapSize: 2Gi httpManagement: enabled: true keyspace: reaper_db secretsProvider: internal telemetry: cassandra: endpoint: address: 0.0.0.0 mcac: enabled: false prometheus: enabled: true vector: enabled: true resources: limits: cpu: 100m memory: 512Mi requests: cpu: 100m memory: 128Mi secretsProvider: internal ```Anything else we need to know?:
No
┆Issue is synchronized with this Jira Story by Unito