k8ssandra / k8ssandra-operator

The Kubernetes operator for K8ssandra
https://k8ssandra.io/
Apache License 2.0
157 stars 74 forks source link

Failed to execute method NodeOps.repair #1370

Open JBOClara opened 1 month ago

JBOClara commented 1 month ago

What happened?

Cassandra container shows the following error in the logs:

com.datastax.oss.driver.api.core.servererrors.ServerError: Failed to execute method NodeOps.repair

Did you expect to see something different?

/api/v2/repairs status=500 Internal Server Error should return with a 200.

How to reproduce it (as minimally and precisely as possible):

Visible in the cassandra logs

Environment

this error is visible with:

helm ls -A -a -d | grep k8ss
k8ssandra-operator          k8ssandra-operator  1           2024-05-22 17:13:25.033002 +0200 CEST       deployed    k8ssandra-operator-1.16.0                   1.16.0

and

k8ssandra-operator      k8ssandra-operator  29          2024-07-13 17:55:28.039314 +0200 CEST   deployed    k8ssandra-operator-1.17.0           1.17.0
k describe po -n k8ssandra-operator | grep "Image:" | sort -u
Alias tip: k describe po -n k8ssandra-operator | grep --color "Image:" | sort -u
    Image:          cr.k8ssandra.io/k8ssandra/cass-management-api:4.1.4
    Image:          cr.k8ssandra.io/k8ssandra/system-logger:v1.21.0
    Image:          docker.io/k8ssandra/medusa:0.19.1
    Image:          docker.io/k8ssandra/medusa:0.21.0
    Image:          docker.io/thelastpickle/cassandra-reaper:3.5.0
    Image:          timberio/vector:0.26.0-alpine
    Image:         bitnami/kubectl:1.29.3
    Image:         busybox:1.28
    Image:         cr.k8ssandra.io/k8ssandra/cass-management-api:4.1.4
    Image:         cr.k8ssandra.io/k8ssandra/cass-operator:v1.21.0
    Image:         cr.k8ssandra.io/k8ssandra/k8ssandra-client:v0.4.0
    Image:         cr.k8ssandra.io/k8ssandra/k8ssandra-operator:v1.17.0
    Image:         docker.io/thelastpickle/cassandra-reaper:3.5.0
Image hash ``` k describe po -n k8ssandra-operator | grep "Image ID:" | sort -u Image ID: cr.k8ssandra.io/k8ssandra/cass-management-api@sha256:e606bae0bd49e794dffdb508bd461e6734e8bba415ac30f2f58742f647fab38c Image ID: cr.k8ssandra.io/k8ssandra/system-logger@sha256:a25251eb74ca08dc87d5ceb3d22bfcb7ac93c1ec7b673c3ce2f8c7bc32769c1f Image ID: docker.io/k8ssandra/medusa@sha256:1a8e63b9dd49744cf13678584f9558c6452ed1b160de17c149174d6035e053d7 Image ID: docker.io/k8ssandra/medusa@sha256:4f2991f88c92441bd6ed5034c4a0cdab94b52e37590183753b2b5786eb25abd9 Image ID: docker.io/thelastpickle/cassandra-reaper@sha256:9e84f87108994d63bc76cec25b2cdd2e1f02072585f825fd2ca493b09371fc38 Image ID: docker.io/timberio/vector@sha256:13779856a8afe8240a1549208040dec12a50cd9b9d98b577d9327d2c212499d8 Image ID: cr.k8ssandra.io/k8ssandra/cass-management-api@sha256:e606bae0bd49e794dffdb508bd461e6734e8bba415ac30f2f58742f647fab38c Image ID: cr.k8ssandra.io/k8ssandra/cass-operator@sha256:d851410079654d6f0acd55d220f647f042d7691dd28a6b3866efcc120c34aeae Image ID: cr.k8ssandra.io/k8ssandra/k8ssandra-client@sha256:4cd4f97e74ea4ce256cb55aa166039471b977c5c4f75e92971d012579146b050 Image ID: cr.k8ssandra.io/k8ssandra/k8ssandra-operator@sha256:00cd1e0bab61aba16df7edcfbcdab5aa5c9d6c29d3656d1e467aca312090890d Image ID: docker.io/bitnami/kubectl@sha256:f5fc0d561d9ef931f9ecb2e8b65d93eb92767c57f64897c56a100bfe28102c74 Image ID: docker.io/library/busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47 Image ID: docker.io/thelastpickle/cassandra-reaper@sha256:9e84f87108994d63bc76cec25b2cdd2e1f02072585f825fd2ca493b09371fc38 ```

And:

kubectl version
Alias tip: k version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4-eks-036c24b

EKS

Manifests ``` apiVersion: cassandra.datastax.com/v1beta1 kind: CassandraDatacenter metadata: annotations: eks.amazonaws.com/skip-containers: cassandra,server-system-logger,server-config-init finalizers: - finalizer.cassandra.datastax.com generation: 1 labels: app.kubernetes.io/component: cassandra app.kubernetes.io/name: k8ssandra-operator app.kubernetes.io/part-of: k8ssandra k8ssandra.io/cleaned-up-by: k8ssandracluster-controller k8ssandra.io/cluster-name: cassandra k8ssandra.io/cluster-namespace: k8ssandra-operator name: us-east namespace: k8ssandra-operator spec: additionalServiceConfig: additionalSeedService: {} allpodsService: {} dcService: {} nodePortService: {} seedService: {} clusterName: cassandra config: cassandra-env-sh: additional-jvm-opts: - -Dcassandra.allow_alter_rf_during_range_movement=true - -Dcassandra.system_distributed_replication=us-east:3 - -Dcassandra.jmx.authorizer=org.apache.cassandra.auth.jmx.AuthorizationProxy - -Djava.security.auth.login.config=$CASSANDRA_HOME/conf/cassandra-jaas.config - -Dcassandra.jmx.remote.login.config=CassandraLogin - -Dcom.sun.management.jmxremote.authenticate=true - -Djavax.net.ssl.trustStore=/mnt/client-truststore/truststore - -Djavax.net.ssl.keyStore=/mnt/client-keystore/keystore - -Djavax.net.debug=ssl - -Dcom.sun.management.jmxremote.registry.ssl=true - -Dcassandra.consistent.rangemovement=false - -Dcom.sun.management.jmxremote.ssl.need.client.auth=true - -Dcom.sun.management.jmxremote.registry.ssl=true - -Dcom.sun.management.jmxremote.ssl=true - -Dcassandra.allow_new_old_config_keys=true cassandra-yaml: authenticator: PasswordAuthenticator authorizer: CassandraAuthorizer auto_bootstrap: true auto_snapshot: true batch_size_fail_threshold: 1500KiB batch_size_warn_threshold: 10KiB client_encryption_options: enabled: true keystore: /mnt/client-keystore/keystore keystore_password: READACTED optional: false require_client_auth: false truststore: /mnt/client-truststore/truststore truststore_password: READACTED concurrent_counter_writes: 64 concurrent_materialized_view_writes: 64 concurrent_reads: 64 concurrent_writes: 64 counter_cache_size: 50MiB materialized_views_enabled: true native_transport_port: 9042 num_tokens: 256 range_request_timeout: 10000ms read_request_timeout: 15000ms request_timeout: 20000ms role_manager: CassandraRoleManager server_encryption_options: internode_encryption: all keystore: /mnt/server-keystore/keystore keystore_password: READACTED require_client_auth: false truststore: /mnt/server-truststore/truststore truststore_password: READACTED write_request_timeout: 2000ms jvm-server-options: initial_heap_size: 4294967296 jmx-connection-type: local-no-auth jmx-port: 7199 jmx-remote-ssl: true max_heap_size: 4294967296 jvm11-server-options: garbage_collector: G1GC configBuilderResources: {} managementApiAuth: {} networking: {} podTemplateSpec: metadata: {} spec: containers: - env: - name: LOCAL_JMX value: "no" - name: MANAGEMENT_API_HEAP_SIZE value: "128000000" - name: MGMT_API_DISABLE_MCAC value: "true" livenessProbe: failureThreshold: 3 httpGet: path: /api/v0/probes/liveness port: 8080 scheme: HTTP initialDelaySeconds: 230 periodSeconds: 15 successThreshold: 1 timeoutSeconds: 10 name: cassandra readinessProbe: failureThreshold: 3 httpGet: path: /api/v0/probes/readiness port: 8080 scheme: HTTP initialDelaySeconds: 270 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 10 resources: {} volumeMounts: - mountPath: /crypto name: certs - mountPath: /home/cassandra/.cassandra/cqlshrc name: cqlsh-config subPath: cqlshrc - mountPath: /home/cassandra/.cassandra/nodetool-ssl.properties name: nodetool-config subPath: nodetool-ssl.properties - mountPath: /mnt/client-keystore name: client-keystore - mountPath: /mnt/client-truststore name: client-truststore - mountPath: /mnt/server-keystore name: server-keystore - mountPath: /mnt/server-truststore name: server-truststore - name: server-system-logger resources: {} - env: - name: MEDUSA_MODE value: GRPC - name: MEDUSA_TMP_DIR value: /var/lib/cassandra - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: CQL_USERNAME valueFrom: secretKeyRef: key: username name: cassandra-medusa - name: CQL_PASSWORD valueFrom: secretKeyRef: key: password name: cassandra-medusa image: docker.io/k8ssandra/medusa:0.21.0 imagePullPolicy: IfNotPresent livenessProbe: exec: command: - /bin/grpc_health_probe - --addr=:50051 failureThreshold: 10 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: medusa ports: - containerPort: 50051 name: grpc protocol: TCP readinessProbe: exec: command: - /bin/grpc_health_probe - --addr=:50051 failureThreshold: 10 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: limits: memory: 512Mi requests: cpu: 10m memory: 116Mi volumeMounts: - mountPath: /etc/cassandra name: server-config - mountPath: /var/lib/cassandra name: server-data - mountPath: /etc/medusa name: cassandra-medusa - mountPath: /etc/podinfo name: podinfo - mountPath: /etc/certificates name: certificates initContainers: - command: - sysctl - -w - vm.max_map_count=1048575 image: busybox:1.28 name: sysctl resources: {} securityContext: privileged: true - name: server-config-init resources: {} - env: - name: MEDUSA_MODE value: RESTORE - name: MEDUSA_TMP_DIR value: /var/lib/cassandra - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: CQL_USERNAME valueFrom: secretKeyRef: key: username name: cassandra-medusa - name: CQL_PASSWORD valueFrom: secretKeyRef: key: password name: cassandra-medusa image: docker.io/k8ssandra/medusa:0.21.0 imagePullPolicy: IfNotPresent name: medusa-restore resources: limits: memory: 8Gi requests: cpu: 100m memory: 100Mi volumeMounts: - mountPath: /etc/cassandra name: server-config - mountPath: /var/lib/cassandra name: server-data - mountPath: /etc/medusa name: cassandra-medusa - mountPath: /etc/podinfo name: podinfo - mountPath: /etc/certificates name: certificates volumes: - name: certs secret: secretName: cassandra-jks-keystore - configMap: name: cqlsh-config name: cqlsh-config - configMap: name: nodetool-config name: nodetool-config - name: client-keystore secret: items: - key: keystore.jks path: keystore secretName: cassandra-jks-keystore - name: client-truststore secret: items: - key: truststore.jks path: truststore secretName: cassandra-jks-keystore - name: server-keystore secret: items: - key: keystore.jks path: keystore secretName: cassandra-jks-keystore - name: server-truststore secret: items: - key: truststore.jks path: truststore secretName: cassandra-jks-keystore - configMap: name: cassandra-medusa name: cassandra-medusa - downwardAPI: items: - fieldRef: fieldPath: metadata.labels path: labels name: podinfo - name: certificates secret: secretName: medusa-certificates racks: - name: 1a nodeAffinityLabels: topology.kubernetes.io/zone: us-east-1a - name: 1d nodeAffinityLabels: topology.kubernetes.io/zone: us-east-1b - name: 1c nodeAffinityLabels: topology.kubernetes.io/zone: us-east-1c resources: limits: memory: 9Gi requests: cpu: "1" memory: 9Gi serverType: cassandra serverVersion: 4.1.4 size: 3 storageConfig: additionalVolumes: - mountPath: /etc/vector name: vector-config volumeSource: configMap: name: cassandra-us-east-cass-vector - mountPath: /opt/management-api/configs name: metrics-agent-config volumeSource: configMap: items: - key: metrics-collector.yaml path: metrics-collector.yaml name: cassandra-us-east-metrics-agent-config cassandraDataVolumeClaimSpec: accessModes: - ReadWriteOnce resources: requests: storage: 300Gi storageClassName: ebs-xfs-sc superuserSecretName: cassandra-superuser systemLoggerResources: limits: memory: 512Mi requests: cpu: 100m memory: 128Mi users: - secretName: cassandra-reaper superuser: true - secretName: cassandra-medusa superuser: true ``` ``` apiVersion: k8ssandra.io/v1alpha1 kind: K8ssandraCluster metadata: annotations: config.kubernetes.io/origin: | path: ../../base/k8ssandra-encrypted.yaml k8ssandra.io/initial-system-replication: '{"us-east":3}' finalizers: - k8ssandracluster.k8ssandra.io/finalizer generation: 5 name: cassandra namespace: k8ssandra-operator spec: auth: true cassandra: clientEncryptionStores: keystorePasswordSecretRef: name: jks-password keystoreSecretRef: key: keystore.jks name: cassandra-jks-keystore truststorePasswordSecretRef: name: jks-password truststoreSecretRef: key: truststore.jks name: cassandra-jks-keystore config: cassandraYaml: authenticator: PasswordAuthenticator authorizer: CassandraAuthorizer auto_bootstrap: true auto_snapshot: true batch_size_fail_threshold: 1500KiB batch_size_warn_threshold: 10KiB client_encryption_options: enabled: true optional: false require_client_auth: false concurrent_counter_writes: 64 concurrent_materialized_view_writes: 64 concurrent_reads: 64 concurrent_writes: 64 counter_cache_size: 50MiB materialized_views_enabled: true native_transport_port: 9042 num_tokens: 256 range_request_timeout: 10000ms read_request_timeout: 15000ms request_timeout: 20000ms server_encryption_options: internode_encryption: all require_client_auth: false write_request_timeout: 2000ms jvmOptions: additionalOptions: - -Djavax.net.debug=ssl - -Dcom.sun.management.jmxremote.registry.ssl=true - -Dcassandra.consistent.rangemovement=false - -Dcom.sun.management.jmxremote.ssl.need.client.auth=true - -Dcom.sun.management.jmxremote.registry.ssl=true - -Dcom.sun.management.jmxremote.ssl=true - -Dcassandra.allow_new_old_config_keys=true gc: G1GC heap_initial_size: 4Gi heap_max_size: 4Gi jmx_connection_type: local-no-auth jmx_port: 7199 jmx_remote_ssl: true containers: - livenessProbe: failureThreshold: 3 httpGet: path: /api/v0/probes/liveness port: 8080 scheme: HTTP initialDelaySeconds: 230 periodSeconds: 15 successThreshold: 1 timeoutSeconds: 10 name: cassandra readinessProbe: failureThreshold: 3 httpGet: path: /api/v0/probes/readiness port: 8080 scheme: HTTP initialDelaySeconds: 270 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 10 volumeMounts: - mountPath: /crypto name: certs - mountPath: /home/cassandra/.cassandra/cqlshrc name: cqlsh-config subPath: cqlshrc - mountPath: /home/cassandra/.cassandra/nodetool-ssl.properties name: nodetool-config subPath: nodetool-ssl.properties datacenters: - initContainers: - command: - sysctl - -w - vm.max_map_count=1048575 image: busybox:1.28 name: sysctl securityContext: privileged: true metadata: name: us-east perNodeConfigInitContainerImage: mikefarah/yq:4 racks: - name: 1a nodeAffinityLabels: topology.kubernetes.io/zone: us-east-1a - name: 1d nodeAffinityLabels: topology.kubernetes.io/zone: us-east-1b - name: 1c nodeAffinityLabels: topology.kubernetes.io/zone: us-east-1c resources: limits: memory: 9Gi requests: cpu: 1 memory: 9Gi size: 3 stopped: false extraVolumes: volumes: - name: certs secret: secretName: cassandra-jks-keystore - configMap: name: cqlsh-config name: cqlsh-config - configMap: name: nodetool-config name: nodetool-config metadata: annotations: eks.amazonaws.com/skip-containers: cassandra,server-system-logger,server-config-init mgmtAPIHeap: 128M networking: hostNetwork: false perNodeConfigInitContainerImage: mikefarah/yq:4 serverEncryptionStores: keystorePasswordSecretRef: name: jks-password keystoreSecretRef: key: keystore.jks name: cassandra-jks-keystore truststorePasswordSecretRef: name: jks-password truststoreSecretRef: key: truststore.jks name: cassandra-jks-keystore serverType: cassandra serverVersion: 4.1.4 softPodAntiAffinity: false storageConfig: cassandraDataVolumeClaimSpec: accessModes: - ReadWriteOnce resources: requests: storage: 300Gi storageClassName: ebs-xfs-sc telemetry: mcac: enabled: false prometheus: enabled: true vector: components: sinks: - config: | target = "stdout" [sinks.console_output.encoding] codec = "json" inputs: - cassandra_metrics name: console_output type: console enabled: true resources: limits: memory: 512Mi requests: cpu: 100m memory: 128Mi scrapeInterval: 30s medusa: certificatesSecretRef: name: medusa-certificates containerImage: name: medusa registry: docker.io repository: k8ssandra tag: 0.21.0 containerResources: limits: memory: 512Mi requests: cpu: 10m memory: 116Mi storageProperties: bucketName: dow-backups concurrentTransfers: 10 credentialsType: role-based maxBackupAge: 0 maxBackupCount: 0 multiPartUploadThreshold: 104857600 prefix: cassandra-tests region: us-east-1 secure: true storageProvider: s3 storageSecretRef: name: "" transferMaxBandwidth: 90MB/s reaper: ServiceAccountName: default autoScheduling: enabled: true initialDelayPeriod: PT15S percentUnrepairedThreshold: 10 periodBetweenPolls: PT10M repairType: AUTO scheduleSpreadPeriod: PT6H timeBeforeFirstSchedule: PT5M containerImage: name: cassandra-reaper repository: thelastpickle tag: 3.6.0 deploymentMode: SINGLE heapSize: 2Gi httpManagement: enabled: true keyspace: reaper_db secretsProvider: internal telemetry: cassandra: endpoint: address: 0.0.0.0 mcac: enabled: false prometheus: enabled: true vector: enabled: true resources: limits: cpu: 100m memory: 512Mi requests: cpu: 100m memory: 128Mi secretsProvider: internal ```
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:31:35,347 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v2/repairs status=500 Internal Server Error
INFO  [nioEventLoopGroup-2-1] 2024-07-16 12:31:38,541 Cli.java:663 - address=/10.210.20.219:56784 url=/api/v0/probes/readiness status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:31:43,538 Cli.java:663 - address=/10.210.20.219:51656 url=/api/v0/probes/liveness status=200 OK
INFO  [nioEventLoopGroup-2-1] 2024-07-16 12:31:48,540 Cli.java:663 - address=/10.210.20.219:51666 url=/api/v0/probes/readiness status=200 OK
INFO  [nioEventLoopGroup-2-1] 2024-07-16 12:31:58,539 Cli.java:663 - address=/10.210.20.219:48066 url=/api/v0/probes/liveness status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:31:58,540 Cli.java:663 - address=/10.210.20.219:48068 url=/api/v0/probes/readiness status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:02,818 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:02,820 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:02,909 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v1/ops/tables/compactions status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:05,371 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:05,373 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:05,466 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v1/ops/tables/compactions status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:08,541 Cli.java:663 - address=/10.210.20.219:55514 url=/api/v0/probes/readiness status=200 OK
INFO  [nioEventLoopGroup-2-1] 2024-07-16 12:32:13,538 Cli.java:663 - address=/10.210.20.219:58392 url=/api/v0/probes/liveness status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:18,540 Cli.java:663 - address=/10.210.20.219:58402 url=/api/v0/probes/readiness status=200 OK
INFO  [nioEventLoopGroup-2-1] 2024-07-16 12:32:28,539 Cli.java:663 - address=/10.210.20.219:52776 url=/api/v0/probes/liveness status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:28,540 Cli.java:663 - address=/10.210.20.219:52790 url=/api/v0/probes/readiness status=200 OK
INFO  [nioEventLoopGroup-2-1] 2024-07-16 12:32:38,541 Cli.java:663 - address=/10.210.20.219:39932 url=/api/v0/probes/readiness status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:40,870 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:40,873 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:40,989 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v1/ops/tables/compactions status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:41,561 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:41,564 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:41,657 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v1/ops/tables/compactions status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:43,539 Cli.java:663 - address=/10.210.20.219:44100 url=/api/v0/probes/liveness status=200 OK
INFO  [nioEventLoopGroup-2-1] 2024-07-16 12:32:48,540 Cli.java:663 - address=/10.210.20.219:44112 url=/api/v0/probes/readiness status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:32:58,538 Cli.java:663 - address=/10.210.20.219:36508 url=/api/v0/probes/liveness status=200 OK
INFO  [nioEventLoopGroup-2-1] 2024-07-16 12:32:58,541 Cli.java:663 - address=/10.210.20.219:36520 url=/api/v0/probes/readiness status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:33:08,541 Cli.java:663 - address=/10.210.20.219:52446 url=/api/v0/probes/readiness status=200 OK
INFO  [nioEventLoopGroup-2-1] 2024-07-16 12:33:13,538 Cli.java:663 - address=/10.210.20.219:52002 url=/api/v0/probes/liveness status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:33:17,148 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:33:17,150 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:33:17,152 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:33:17,161 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:33:17,162 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:33:17,254 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v1/ops/tables/compactions status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:33:18,537 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:33:18,539 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:33:18,540 Cli.java:663 - address=/10.210.20.219:52018 url=/api/v0/probes/readiness status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:33:18,643 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v1/ops/tables/compactions status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:33:26,184 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO  [nioEventLoopGroup-2-2] 2024-07-16 12:33:26,186 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
com.datastax.oss.driver.api.core.servererrors.ServerError: Failed to execute method NodeOps.repair
    at com.datastax.oss.driver.api.core.servererrors.ServerError.copy(ServerError.java:54)
    at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
    at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53)
    at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30)
    at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
    at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54)
    at com.datastax.mgmtapi.CqlService.executePreparedStatement(CqlService.java:57)
    at com.datastax.mgmtapi.resources.v2.RepairResourcesV2.lambda$repair$0(RepairResourcesV2.java:80)
    at com.datastax.mgmtapi.resources.common.BaseResources.handle(BaseResources.java:67)
    at com.datastax.mgmtapi.resources.v2.RepairResourcesV2.repair(RepairResourcesV2.java:71)
    at jdk.internal.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.base/java.lang.reflect.Method.invoke(Unknown Source)
    at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:170)
    at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:130)
    at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:643)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:507)
    at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$2(ResourceMethodInvoker.java:457)
    at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:459)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:419)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:393)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:68)
    at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:492)
    at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:261)
    at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:161)
    at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
    at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:164)
    at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:247)
    at org.jboss.resteasy.plugins.server.netty.RequestDispatcher.service(RequestDispatcher.java:86)
    at org.jboss.resteasy.plugins.server.netty.RequestHandler.channelRead0(RequestHandler.java:51)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:61)
    at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:370)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Unknown Source)

Anything else we need to know?:

No

┆Issue is synchronized with this Jira Story by Unito

iAlex97 commented 1 month ago

We have also encountered the same issue when enabling autoScheduling for Reaper. Further checking the logs from mgmt-api, I think this is due to Reaper having an invalid combination of default parameters (only happens for Cassandra 4.x) when setting up automatic schedules. The error which led me to think this:

INFO  [epollEventLoopGroup-5-3] 2024-07-31 08:44:16,274 RpcMethod41x.java:138 - Failed to execute method NodeOps.repair
java.lang.reflect.InvocationTargetException: null
    at jdk.internal.reflect.GeneratedMethodAccessor47.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.base/java.lang.reflect.Method.invoke(Unknown Source)
    at com.datastax.mgmtapi.rpc.RpcMethod41x.execute(RpcMethod41x.java:130)
    at com.datastax.mgmtapi.rpc.RpcMethod41x.execute(RpcMethod41x.java:33)
    at com.datastax.mgmtapi.interceptors.QueryHandlerInterceptor.lambda$handle$1(QueryHandlerInterceptor.java:120)
    at com.datastax.mgmtapi.shims.CassandraAPI.handleRpcResult(CassandraAPI.java:73)
    at com.datastax.mgmtapi.interceptors.QueryHandlerInterceptor.handle(QueryHandlerInterceptor.java:120)
    at com.datastax.mgmtapi.interceptors.QueryHandlerInterceptor.intercept(QueryHandlerInterceptor.java:80)
    at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java)
    at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:116)
    at org.apache.cassandra.transport.Message$Request.execute(Message.java:255)
    <redacted>
Caused by: java.io.IOException: Invalid repair combination. Incremental repair if Parallelism is not set
    at com.datastax.mgmtapi.NodeOpsProvider.repair(NodeOpsProvider.java:824)
    ... 43 common frames omitted

K8ssandraCluster CRD has autoScheduling.repairType set as AUTO which for Cassandra 4.x will behave as INCREMENTAL and will setup the schedules accordingly.

From the Reaper docs we understand that for an Incremental repair the only allowed value for repairParallelism is PARALLEL:

Sets the default repair type unless specifically defined for each run. Note that this is only supported with the PARALLEL repairParallelism setting. For more details in incremental repair, please refer to the following article.http://www.datastax.com/dev/blog/more-efficient-repairs

This is checked by the management-api here which indeed throws the error that I'm seeing.

Run exec into a reaper pod to check it's configuration, we see that its /etc/cassandra-reaper/config/cassandra-reaper.yml sets repairParallelism to the value of an env variable called REAPER_REPAIR_PARALELLISM. The value for that variable is

REAPER_REPAIR_PARALELLISM=DATACENTER_AWARE

We can further check this by looking at the reaper tables inside cassandra:

prod-superuser@cqlsh> use reaper_db;
prod-superuser@cqlsh:reaper_db> select * from repair_schedule_v1;

 id                                   | adaptive | creation_time                   | days_between | intensity | last_run                             | next_activation                 | owner           | pause_time                      | percent_unrepaired_threshold | repair_parallelism | repair_unit_id                       | run_history | segment_count | segment_count_per_node | state
--------------------------------------+----------+---------------------------------+--------------+-----------+--------------------------------------+---------------------------------+-----------------+---------------------------------+------------------------------+--------------------+--------------------------------------+-------------+---------------+------------------------+--------
 b3cb2180-4e7c-11ef-9f1c-4d0488525d6c |    False | 2024-07-30 14:04:57.112000+0000 |            7 |       0.9 | 2db1ccf0-4e97-11ef-92a3-c328b392dd6d | 2024-08-06 17:08:38.033000+0000 | auto-scheduling | 2024-07-30 14:10:43.136000+0000 |                           10 |        dc_parallel | b3c9e900-4e7c-11ef-9f1c-4d0488525d6c |        null |          null |                     64 | ACTIVE
 b3d533a0-4e7c-11ef-9f1c-4d0488525d6c |    False | 2024-07-30 14:04:57.178000+0000 |            7 |       0.9 |                                 null | 2024-07-30 20:09:57.150000+0000 | auto-scheduling |                            null |                           10 |        dc_parallel | b3d337d0-4e7c-11ef-9f1c-4d0488525d6c |        null |          null |                     64 | ACTIVE

which confirms that the default parallelism was set to dc_parallel or DATACENTER_AWARE.

My confusion comes from where this variable is set. From my limited research is not specified in the reaper deployment, it is not inside the Dockerfile, nor can it be configured from CRD.

For possible workarounds, I see the following:

@adejanovski what do you think?