k8ssandra / management-api-for-apache-cassandra

RESTful / Secure Management Sidecar for Apache Cassandra
Apache License 2.0
70 stars 51 forks source link

Time out on taking snapshot #292

Open shaotingcheng opened 1 year ago

shaotingcheng commented 1 year ago

When calling the management API /api/v0/ops/node/snapshots to take the snapshot, I encounter the 30s timeout error Query timed out after PT30S

curl -k -L -XPOST -H "Content-Type:application/json" 'http://localhost:8080/api/v0/ops/node/snapshots' -d '{"snapshot_name": "test124"}'
Query timed out after PT30S

From the container logs

INFO  [nioEventLoopGroup-2-1] 2023-04-24 04:32:38,254 Cli.java:617 - address=/127.0.0.1:44180 url=/api/v0/ops/node/snapshots status=500 Internal Server Error
INFO  [nioEventLoopGroup-2-1] 2023-04-24 04:32:38,270 Cli.java:617 - address=/10.100.42.232:42336 url=/api/v0/probes/liveness status=200 OK
ERROR [nioEventLoopGroup-3-3] 2023-04-24 04:32:38,336 NodeOpsResources.java:562 - Error when executing request
com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT30S
    at com.datastax.oss.driver.api.core.DriverTimeoutException.copy(DriverTimeoutException.java:34)
    at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
    at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53)
    at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30)
    at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
    at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54)
    at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:78)
    at com.datastax.mgmtapi.CqlService.executeCql(CqlService.java:35)
    at com.datastax.mgmtapi.resources.K8OperatorResources.lambda$checkReadiness$0(K8OperatorResources.java:87)
    at com.datastax.mgmtapi.resources.NodeOpsResources.handle(NodeOpsResources.java:554)
    at com.datastax.mgmtapi.resources.K8OperatorResources.checkReadiness(K8OperatorResources.java:85)
    at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.base/java.lang.reflect.Method.invoke(Unknown Source)
    at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:170)
    at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:130)
    at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:643)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:507)
    at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$2(ResourceMethodInvoker.java:457)
    at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:459)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:419)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:393)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:68)
    at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:492)
    at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:261)
    at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:161)
    at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
    at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:164)
    at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:247)
    at org.jboss.resteasy.plugins.server.netty.RequestDispatcher.service(RequestDispatcher.java:86)
    at org.jboss.resteasy.plugins.server.netty.RequestHandler.channelRead0(RequestHandler.java:51)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:61)
    at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:370)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Unknown Source)

I am using k8ssandra-operator V1.5.2 with k8ssandra/cass-management-api:4.0.1

Does anyone encounter such issue? In my experience, sometimes nodetool snapshot indeed takes over 30s. Or maybe Is it something about the "mgmtAPIHeap"?

Anyway to update the 30s timeout?

Regards ShaoTing

┆Issue is synchronized with this Jira Story by Unito

burmanm commented 1 year ago

I don't think that error is actually the snapshot one. It seems to be related to your Cassandra not being available when the readiness call was made after liveness.

The takesnapshot does not use CQL driver to do the query. The reason for code 500 is not listed in that log, so I don't know why it failed. Maybe it was related to the fact that cluster itself was not "working" ?

shaotingcheng commented 1 year ago

@burmanm thanks for your reply. Indeed, I see other exceptions HeartbeatException in management-api log

ERROR [nioEventLoopGroup-3-15] 2023-05-03 09:06:31,264 NodeOpsResources.java:562 - Error when executing request
com.datastax.oss.driver.api.core.connection.HeartbeatException: null
    at com.datastax.oss.driver.api.core.connection.HeartbeatException.copy(HeartbeatException.java:58)
    at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
    at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53)
    at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30)
    at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
    at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54)
    at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:78)
    at com.datastax.mgmtapi.CqlService.executeCql(CqlService.java:35)
    at com.datastax.mgmtapi.resources.K8OperatorResources.lambda$checkReadiness$0(K8OperatorResources.java:87)
    at com.datastax.mgmtapi.resources.NodeOpsResources.handle(NodeOpsResources.java:554)
    at com.datastax.mgmtapi.resources.K8OperatorResources.checkReadiness(K8OperatorResources.java:85)
    at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.base/java.lang.reflect.Method.invoke(Unknown Source)
    at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:170)
    at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:130)
    at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:643)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:507)
    at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$2(ResourceMethodInvoker.java:457)
    at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:459)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:419)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:393)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:68)
    at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:492)
    at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:261)
    at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:161)
    at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
    at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:164)
    at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:247)
    at org.jboss.resteasy.plugins.server.netty.RequestDispatcher.service(RequestDispatcher.java:86)
    at org.jboss.resteasy.plugins.server.netty.RequestHandler.channelRead0(RequestHandler.java:51)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:61)
    at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:370)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Heartbeat request: timed out after 5000 ms
    at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.onTimeout(ChannelHandlerRequest.java:108)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:394)
    ... 4 common frames omitted

and

ERROR [nioEventLoopGroup-3-2] 2023-05-03 09:06:38,115 NodeOpsResources.java:562 - Error when executing request
com.datastax.oss.driver.api.core.AllNodesFailedException: All 1 node(s) tried for the query failed (showing first 1 nodes, use getAllErrors() for more): Node(endPoint=/tmp/cassandra.sock, hostId=02508e17-3a7f-4f09-b65a-e538fadbdc41, hashCode=228fec7d): [com.datastax.oss.driver.api.core.NodeUnavailableException: No connection was available to Node(endPoint=/tmp/cassandra.sock, hostId=02508e17-3a7f-4f09-b65a-e538fadbdc41, hashCode=228fec7d)]
    at com.datastax.oss.driver.api.core.AllNodesFailedException.copy(AllNodesFailedException.java:141)
    at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
    at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53)
    at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30)
    at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
    at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54)
    at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:78)
    at com.datastax.mgmtapi.CqlService.executeCql(CqlService.java:35)
    at com.datastax.mgmtapi.resources.K8OperatorResources.lambda$checkReadiness$0(K8OperatorResources.java:87)
    at com.datastax.mgmtapi.resources.NodeOpsResources.handle(NodeOpsResources.java:554)
    at com.datastax.mgmtapi.resources.K8OperatorResources.checkReadiness(K8OperatorResources.java:85)
    at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.base/java.lang.reflect.Method.invoke(Unknown Source)
    at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:170)
    at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:130)
    at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:643)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:507)
    at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$2(ResourceMethodInvoker.java:457)
    at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:459)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:419)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:393)
    at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:68)
    at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:492)
    at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:261)
    at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:161)
    at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
    at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:164)
    at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:247)
    at org.jboss.resteasy.plugins.server.netty.RequestDispatcher.service(RequestDispatcher.java:86)
    at org.jboss.resteasy.plugins.server.netty.RequestHandler.channelRead0(RequestHandler.java:51)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:61)
    at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:370)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Unknown Source)
    Suppressed: com.datastax.oss.driver.api.core.NodeUnavailableException: No connection was available to Node(endPoint=/tmp/cassandra.sock, hostId=02508e17-3a7f-4f09-b65a-e538fadbdc41, hashCode=228fec7d)
        at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.sendRequest(CqlRequestHandler.java:256)
        at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.onThrottleReady(CqlRequestHandler.java:195)
        at com.datastax.oss.driver.internal.core.session.throttling.PassThroughRequestThrottler.register(PassThroughRequestThrottler.java:52)
        at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.<init>(CqlRequestHandler.java:172)
        at com.datastax.oss.driver.internal.core.cql.CqlRequestAsyncProcessor.process(CqlRequestAsyncProcessor.java:44)
        at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:54)
        ... 41 common frames omitted

And in server-system-logger logs, found these when timeout

WARN  [epollEventLoopGroup-7-6] 2023-05-03 09:06:53,162 K8SeedProvider4x.java:58 - Seed provider couldn't lookup host cassandra-dc1-additional-seed-service
INFO  [epollEventLoopGroup-7-6] 2023-05-03 09:06:53,244 Keyspace.java:386 - Creating replication strategy none params KeyspaceParams{durable_writes=true, replication=ReplicationParams{class=org.apache.cassandra.locator.NetworkTopologyStrategy, dc1=3}}
INFO  [epollEventLoopGroup-7-6] 2023-05-03 09:06:53,244 NetworkTopologyStrategy.java:88 - Configured datacenter replicas are dc1:rf(3)

This doesn't only happen when taking snapshot but also deleting snapshots.

I suspect the cassandra POD is under pressure when taking/deleting snapshots. Maybe I configured k8ssandraCluster wrong? too less memory? too tense probe? Here is the yaml.

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  labels:
    app: cassandra
  name: cassandra
  namespace: stage1
spec:
  auth: false
  cassandra:
    datacenters:
    - config:
        cassandraYaml:
          cas_contention_timeout_in_ms: 1000
          column_index_size_in_kb: 64
          commitlog_segment_size_in_mb: 384
          commitlog_sync: periodic
          commitlog_sync_period_in_ms: 10000
          compaction_throughput_mb_per_sec: 64
          concurrent_counter_writes: 96
          concurrent_reads: 96
          concurrent_writes: 128
          counter_write_request_timeout_in_ms: 5000
          dynamic_snitch_badness_threshold: "0.1"
          dynamic_snitch_reset_interval_in_ms: 600000
          dynamic_snitch_update_interval_in_ms: 100
          file_cache_size_in_mb: 3072
          internode_application_receive_queue_reserve_endpoint_capacity_in_bytes: 536870912
          internode_application_receive_queue_reserve_global_capacity_in_bytes: 1073741824
          internode_application_send_queue_reserve_endpoint_capacity_in_bytes: 536870912
          internode_application_send_queue_reserve_global_capacity_in_bytes: 1073741824
          memtable_allocation_type: offheap_objects
          num_tokens: 8
          range_request_timeout_in_ms: 60000
          read_request_timeout_in_ms: 60000
          request_timeout_in_ms: 60000
          tombstone_failure_threshold: 300000
          trickle_fsync: false
          truncate_request_timeout_in_ms: 60000
          write_request_timeout_in_ms: 60000
        jvmOptions:
          heap_initial_size: 24G
          heap_max_size: 24G
          jmx_connection_type: remote-no-auth
      metadata:
        annotations:
          kustomize.synopsys.com/pool-selector: stage1
          kustomize.synopsys.com/role-selector: db
        name: dc1
      racks:
      - name: r1
        nodeAffinityLabels:
          kubernetes.synopsys.com/pool: stage1
          kubernetes.synopsys.com/role: db
      - name: r2
        nodeAffinityLabels:
          kubernetes.synopsys.com/pool: stage1
          kubernetes.synopsys.com/role: db
      - name: r3
        nodeAffinityLabels:
          kubernetes.synopsys.com/pool: stage1
          kubernetes.synopsys.com/role: db
      resources:
        limits:
          memory: 48G
        requests:
          cpu: 12
          memory: 24G
      size: 5
      storageConfig:
        cassandraDataVolumeClaimSpec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 1Ti
          storageClassName: local-path-cassandra
    managementApiAuth:
      insecure: {}
    metadata:
      commonLabels:
        app: cassandra
    serverVersion: 4.0.1
    telemetry:
      mcac:
        metricFilters:
        - deny:org.apache.cassandra.metrics.table
        - allow:org.apache.cassandra.metrics.table.live_ss_table_count
        - allow:org.apache.cassandra.metrics.table.live_disk_space_used
        - allow:org.apache.cassandra.metrics.table.LiveDiskSpaceUsed
        - allow:org.apache.cassandra.metrics.table.read
        - allow:org.apache.cassandra.metrics.table.write
        - allow:org.apache.cassandra.metrics.table.range
        - allow:org.apache.cassandra.metrics.table.coordinator
        - allow:org.apache.cassandra.metrics.table.dropped_mutations
        - deny:system.*
        - deny:reaper_db
  medusa:
    storageProperties:
      maxBackupAge: 2
      maxBackupCount: 10
      podStorage:
        accessModes:
        - ReadWriteOnce
        size: 300Gi
        storageClassName: csi-cephfs-slow
      storageProvider: local

I will be gratefull if any feedback.

shaotingcheng commented 1 year ago

The cause is "flush" operation takes too long. I saw some commit logs are > 1Gi

"May 5, 2023 @ 23:18:03.940","Completed flushing /var/lib/cassandra/data/demo1/product_data-60a74550712f11ed9db1679d9bc4cfab/nb-55917-big-Data.db (1017.550MiB) for commitlog position CommitLogPosition(segmentId=1683277730848, position=292404316)"

However, sometimes, even listing snapshots or deleting snapshots by managementAPI face the timeout PT30S No issue with "nodetool" though it took over 2m

cassandra@cassandra-dc1-r2-sts-1:/$ date;nodetool snapshot -t auto-20230517-0004 --skip-flush;date
Wed 17 May 2023 09:25:15 AM UTC
Requested creating snapshot(s) for [all keyspaces] with snapshot name [auto-20230517-0004] and options {skipFlush=true}
Snapshot directory: auto-20230517-0004
Wed 17 May 2023 09:27:34 AM UTC

I suspect something restrict in managementAPI. Besides, it's not allowed in k8ssandraCluster to update the livenessProbe and readinessProbe https://github.com/k8ssandra/k8ssandra-operator/issues/26

@burmanm is it possible to increase the default timeout for managementAPI?

shaotingcheng commented 1 year ago

I got the same timeout PT30S by executing medusa CLI in medusa POD

cassandra@cassandra-dc1-r1-sts-1:~$ medusa backup --backup-name medusa-test --mode differential
[2023-05-18 09:46:07,631] WARNING: The CQL_USERNAME environment variable is deprecated and has been replaced by the MEDUSA_CQL_USERNAME variable
[2023-05-18 09:46:07,631] WARNING: The CQL_PASSWORD environment variable is deprecated and has been replaced by the MEDUSA_CQL_PASSWORD variable
[2023-05-18 09:46:07,632] INFO: Resolving ip address
[2023-05-18 09:46:07,633] INFO: ip address to resolve 192.168.37.46
[2023-05-18 09:46:07,643] INFO: Registered backup id medusa-test
[2023-05-18 09:46:07,643] INFO: Monitoring provider is noop
[2023-05-18 09:46:07,739] INFO: Starting backup using Stagger: None Mode: differential Name: medusa-test
[2023-05-18 09:46:07,739] INFO: Updated from existing status: -1 to new status: 0 for backup id: medusa-test
[2023-05-18 09:46:07,741] INFO: Saving tokenmap and schema
[2023-05-18 09:46:08,227] INFO: Resolving ip address 192.168.37.46
[2023-05-18 09:46:08,227] INFO: ip address to resolve 192.168.37.46
[2023-05-18 09:46:08,234] INFO: Resolving ip address 192.168.130.113
[2023-05-18 09:46:08,235] INFO: ip address to resolve 192.168.130.113
[2023-05-18 09:46:08,274] INFO: Resolving ip address 192.168.187.241
[2023-05-18 09:46:08,274] INFO: ip address to resolve 192.168.187.241
[2023-05-18 09:46:08,280] INFO: Resolving ip address 192.168.206.213
[2023-05-18 09:46:08,280] INFO: ip address to resolve 192.168.206.213
[2023-05-18 09:46:08,289] INFO: Resolving ip address 192.168.37.46
[2023-05-18 09:46:08,289] INFO: ip address to resolve 192.168.37.46
[2023-05-18 09:46:08,295] INFO: Resolving ip address 192.168.7.244
[2023-05-18 09:46:08,296] INFO: ip address to resolve 192.168.7.244
[2023-05-18 09:46:08,567] INFO: Node cassandra-dc1-r1-sts-1 does not have latest backup
[2023-05-18 09:46:08,567] INFO: Creating snapshot
[2023-05-18 09:46:38,708] ERROR: Issue occurred inside handle_backup Name: medusa-test Error: failed to create snapshot: Query timed out after PT30
S
[2023-05-18 09:46:38,709] INFO: Updated from existing status: 0 to new status: 2 for backup id: medusa-test
Traceback (most recent call last):
  File "/home/cassandra/.local/bin/medusa", line 8, in <module>
    sys.exit(cli())
  File "/home/cassandra/.local/lib/python3.6/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/cassandra/.local/lib/python3.6/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/cassandra/.local/lib/python3.6/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/cassandra/.local/lib/python3.6/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/cassandra/.local/lib/python3.6/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/cassandra/.local/lib/python3.6/site-packages/click/decorators.py", line 84, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/home/cassandra/.local/lib/python3.6/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/medusacli.py", line 133, in backup
    return backup_node.handle_backup(medusaconfig, backup_name, stagger_time, enable_md5_checks, mode)
  File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/backup_node.py", line 214, in handle_backup
    config
  File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/utils.py", line 35, in handle_exception
    raise exception
  File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/backup_node.py", line 197, in handle_backup
    enable_md5_checks_flag, backup_name, config, monitoring)
  File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/backup_node.py", line 250, in start_backup
    cassandra, node_backup, storage, differential_mode, enable_md5, config, backup_name)
  File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/backup_node.py", line 294, in do_backup
    with cassandra.create_snapshot(backup_name) as snapshot:
  File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/cassandra_utils.py", line 449, in create_snapshot
    self.snapshot_service.create_snapshot(tag=tag)
  File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/service/snapshot/management_api_snapshot_service.py", line 35, in create_snapshot
    raise Exception(err_msg)
Exception: failed to create snapshot: Query timed out after PT30S
shaotingcheng commented 1 year ago

I built my own docker image to increase the REQUEST_TIMEOUT to 120s https://github.com/k8ssandra/management-api-for-apache-cassandra/blob/master/management-api-server/src/main/java/com/datastax/mgmtapi/UnixSocketCQLAccess.java#L131

No PT30S request timeout, but sometimes Cassandra returns ERROR {code}DriverTimeoutException: Heartbeat request: timed out after 5000 ms{code}

I noticed that the readiness (/probes/readiness) from managementAPI couldn't get the response while taking snapshot (/snapshots) The readiness will wait until it's killed by the driver (HEARTBEAT_INTERVAL+HEARBEAT_TIMEOUT) => that's the return code 500 I post in the previous comments.

shaotingcheng commented 1 year ago

For those API require the dbUnixSocketFile, management-api can only handle one request at a time. Is the statement correct? @burmanm

I made an example:

  1. takesnapshot first
    cassandra@cassandra-dc1-r1-sts-0:/$ date;curl -k -L -XPOST -H "Content-Type:application/json" 'http://localhost:8080/api/v0/ops/node/snapshots' -d '{"snapshot_name": "auto-20230605-0004"}';date
    Mon 05 Jun 2023 04:06:00 AM UTC
    Mon 05 Jun 2023 04:06:33 AM UTC
  2. readiness won't get the response until takesnapshot finished
    cassandra@cassandra-dc1-r1-sts-0:/$ date; curl -XGET 'http://localhost:8080/api/v0/probes/readiness'; date
    Mon 05 Jun 2023 04:06:03 AM UTC
    Mon 05 Jun 2023 04:06:33 AM UTC
emerkle826 commented 1 year ago

For those API require the dbUnixSocketFile, management-api can only handle one request at a time. Is the statement correct?

That is not correct. The problem here is that the snapshots endpoint implementation is blocking on the operation (i.e. not async) so the management API process isn't allowed to do anything else until the snapshot is taken.

We might have two issues here. One being the snapshot operation is not async and may take time. The second being that the snapshot operations may take longer than the driver timeout.

shaotingcheng commented 1 year ago

@emerkle826 thanks for the feedback. Indeed, from the DataStax Java-driver,

The only place where the driver blocks is when using the synchronous API (methods declared in SyncCqlSession), and when calling other synchronous wrapper methods in the public API

My workaround is

  1. built a customized management-api image with larger timeout
  2. changed readinessProbe in k8ssandraCluster