Open shaotingcheng opened 1 year ago
I don't think that error is actually the snapshot one. It seems to be related to your Cassandra not being available when the readiness call was made after liveness.
The takesnapshot does not use CQL driver to do the query. The reason for code 500 is not listed in that log, so I don't know why it failed. Maybe it was related to the fact that cluster itself was not "working" ?
@burmanm thanks for your reply. Indeed, I see other exceptions HeartbeatException in management-api log
ERROR [nioEventLoopGroup-3-15] 2023-05-03 09:06:31,264 NodeOpsResources.java:562 - Error when executing request
com.datastax.oss.driver.api.core.connection.HeartbeatException: null
at com.datastax.oss.driver.api.core.connection.HeartbeatException.copy(HeartbeatException.java:58)
at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53)
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30)
at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54)
at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:78)
at com.datastax.mgmtapi.CqlService.executeCql(CqlService.java:35)
at com.datastax.mgmtapi.resources.K8OperatorResources.lambda$checkReadiness$0(K8OperatorResources.java:87)
at com.datastax.mgmtapi.resources.NodeOpsResources.handle(NodeOpsResources.java:554)
at com.datastax.mgmtapi.resources.K8OperatorResources.checkReadiness(K8OperatorResources.java:85)
at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:170)
at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:130)
at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:643)
at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:507)
at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$2(ResourceMethodInvoker.java:457)
at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:459)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:419)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:393)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:68)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:492)
at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:261)
at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:161)
at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:164)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:247)
at org.jboss.resteasy.plugins.server.netty.RequestDispatcher.service(RequestDispatcher.java:86)
at org.jboss.resteasy.plugins.server.netty.RequestHandler.channelRead0(RequestHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:61)
at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:370)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Heartbeat request: timed out after 5000 ms
at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.onTimeout(ChannelHandlerRequest.java:108)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:394)
... 4 common frames omitted
and
ERROR [nioEventLoopGroup-3-2] 2023-05-03 09:06:38,115 NodeOpsResources.java:562 - Error when executing request
com.datastax.oss.driver.api.core.AllNodesFailedException: All 1 node(s) tried for the query failed (showing first 1 nodes, use getAllErrors() for more): Node(endPoint=/tmp/cassandra.sock, hostId=02508e17-3a7f-4f09-b65a-e538fadbdc41, hashCode=228fec7d): [com.datastax.oss.driver.api.core.NodeUnavailableException: No connection was available to Node(endPoint=/tmp/cassandra.sock, hostId=02508e17-3a7f-4f09-b65a-e538fadbdc41, hashCode=228fec7d)]
at com.datastax.oss.driver.api.core.AllNodesFailedException.copy(AllNodesFailedException.java:141)
at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53)
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30)
at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54)
at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:78)
at com.datastax.mgmtapi.CqlService.executeCql(CqlService.java:35)
at com.datastax.mgmtapi.resources.K8OperatorResources.lambda$checkReadiness$0(K8OperatorResources.java:87)
at com.datastax.mgmtapi.resources.NodeOpsResources.handle(NodeOpsResources.java:554)
at com.datastax.mgmtapi.resources.K8OperatorResources.checkReadiness(K8OperatorResources.java:85)
at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:170)
at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:130)
at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:643)
at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:507)
at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$2(ResourceMethodInvoker.java:457)
at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:459)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:419)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:393)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:68)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:492)
at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:261)
at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:161)
at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:164)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:247)
at org.jboss.resteasy.plugins.server.netty.RequestDispatcher.service(RequestDispatcher.java:86)
at org.jboss.resteasy.plugins.server.netty.RequestHandler.channelRead0(RequestHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:61)
at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:370)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
Suppressed: com.datastax.oss.driver.api.core.NodeUnavailableException: No connection was available to Node(endPoint=/tmp/cassandra.sock, hostId=02508e17-3a7f-4f09-b65a-e538fadbdc41, hashCode=228fec7d)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.sendRequest(CqlRequestHandler.java:256)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.onThrottleReady(CqlRequestHandler.java:195)
at com.datastax.oss.driver.internal.core.session.throttling.PassThroughRequestThrottler.register(PassThroughRequestThrottler.java:52)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.<init>(CqlRequestHandler.java:172)
at com.datastax.oss.driver.internal.core.cql.CqlRequestAsyncProcessor.process(CqlRequestAsyncProcessor.java:44)
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:54)
... 41 common frames omitted
And in server-system-logger logs, found these when timeout
WARN [epollEventLoopGroup-7-6] 2023-05-03 09:06:53,162 K8SeedProvider4x.java:58 - Seed provider couldn't lookup host cassandra-dc1-additional-seed-service
INFO [epollEventLoopGroup-7-6] 2023-05-03 09:06:53,244 Keyspace.java:386 - Creating replication strategy none params KeyspaceParams{durable_writes=true, replication=ReplicationParams{class=org.apache.cassandra.locator.NetworkTopologyStrategy, dc1=3}}
INFO [epollEventLoopGroup-7-6] 2023-05-03 09:06:53,244 NetworkTopologyStrategy.java:88 - Configured datacenter replicas are dc1:rf(3)
This doesn't only happen when taking snapshot but also deleting snapshots.
I suspect the cassandra POD is under pressure when taking/deleting snapshots. Maybe I configured k8ssandraCluster wrong? too less memory? too tense probe? Here is the yaml.
apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
labels:
app: cassandra
name: cassandra
namespace: stage1
spec:
auth: false
cassandra:
datacenters:
- config:
cassandraYaml:
cas_contention_timeout_in_ms: 1000
column_index_size_in_kb: 64
commitlog_segment_size_in_mb: 384
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
compaction_throughput_mb_per_sec: 64
concurrent_counter_writes: 96
concurrent_reads: 96
concurrent_writes: 128
counter_write_request_timeout_in_ms: 5000
dynamic_snitch_badness_threshold: "0.1"
dynamic_snitch_reset_interval_in_ms: 600000
dynamic_snitch_update_interval_in_ms: 100
file_cache_size_in_mb: 3072
internode_application_receive_queue_reserve_endpoint_capacity_in_bytes: 536870912
internode_application_receive_queue_reserve_global_capacity_in_bytes: 1073741824
internode_application_send_queue_reserve_endpoint_capacity_in_bytes: 536870912
internode_application_send_queue_reserve_global_capacity_in_bytes: 1073741824
memtable_allocation_type: offheap_objects
num_tokens: 8
range_request_timeout_in_ms: 60000
read_request_timeout_in_ms: 60000
request_timeout_in_ms: 60000
tombstone_failure_threshold: 300000
trickle_fsync: false
truncate_request_timeout_in_ms: 60000
write_request_timeout_in_ms: 60000
jvmOptions:
heap_initial_size: 24G
heap_max_size: 24G
jmx_connection_type: remote-no-auth
metadata:
annotations:
kustomize.synopsys.com/pool-selector: stage1
kustomize.synopsys.com/role-selector: db
name: dc1
racks:
- name: r1
nodeAffinityLabels:
kubernetes.synopsys.com/pool: stage1
kubernetes.synopsys.com/role: db
- name: r2
nodeAffinityLabels:
kubernetes.synopsys.com/pool: stage1
kubernetes.synopsys.com/role: db
- name: r3
nodeAffinityLabels:
kubernetes.synopsys.com/pool: stage1
kubernetes.synopsys.com/role: db
resources:
limits:
memory: 48G
requests:
cpu: 12
memory: 24G
size: 5
storageConfig:
cassandraDataVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: local-path-cassandra
managementApiAuth:
insecure: {}
metadata:
commonLabels:
app: cassandra
serverVersion: 4.0.1
telemetry:
mcac:
metricFilters:
- deny:org.apache.cassandra.metrics.table
- allow:org.apache.cassandra.metrics.table.live_ss_table_count
- allow:org.apache.cassandra.metrics.table.live_disk_space_used
- allow:org.apache.cassandra.metrics.table.LiveDiskSpaceUsed
- allow:org.apache.cassandra.metrics.table.read
- allow:org.apache.cassandra.metrics.table.write
- allow:org.apache.cassandra.metrics.table.range
- allow:org.apache.cassandra.metrics.table.coordinator
- allow:org.apache.cassandra.metrics.table.dropped_mutations
- deny:system.*
- deny:reaper_db
medusa:
storageProperties:
maxBackupAge: 2
maxBackupCount: 10
podStorage:
accessModes:
- ReadWriteOnce
size: 300Gi
storageClassName: csi-cephfs-slow
storageProvider: local
I will be gratefull if any feedback.
The cause is "flush" operation takes too long. I saw some commit logs are > 1Gi
"May 5, 2023 @ 23:18:03.940","Completed flushing /var/lib/cassandra/data/demo1/product_data-60a74550712f11ed9db1679d9bc4cfab/nb-55917-big-Data.db (1017.550MiB) for commitlog position CommitLogPosition(segmentId=1683277730848, position=292404316)"
However, sometimes, even listing snapshots or deleting snapshots by managementAPI face the timeout PT30S No issue with "nodetool" though it took over 2m
cassandra@cassandra-dc1-r2-sts-1:/$ date;nodetool snapshot -t auto-20230517-0004 --skip-flush;date
Wed 17 May 2023 09:25:15 AM UTC
Requested creating snapshot(s) for [all keyspaces] with snapshot name [auto-20230517-0004] and options {skipFlush=true}
Snapshot directory: auto-20230517-0004
Wed 17 May 2023 09:27:34 AM UTC
I suspect something restrict in managementAPI. Besides, it's not allowed in k8ssandraCluster to update the livenessProbe and readinessProbe https://github.com/k8ssandra/k8ssandra-operator/issues/26
@burmanm is it possible to increase the default timeout for managementAPI?
I got the same timeout PT30S by executing medusa
CLI in medusa POD
cassandra@cassandra-dc1-r1-sts-1:~$ medusa backup --backup-name medusa-test --mode differential
[2023-05-18 09:46:07,631] WARNING: The CQL_USERNAME environment variable is deprecated and has been replaced by the MEDUSA_CQL_USERNAME variable
[2023-05-18 09:46:07,631] WARNING: The CQL_PASSWORD environment variable is deprecated and has been replaced by the MEDUSA_CQL_PASSWORD variable
[2023-05-18 09:46:07,632] INFO: Resolving ip address
[2023-05-18 09:46:07,633] INFO: ip address to resolve 192.168.37.46
[2023-05-18 09:46:07,643] INFO: Registered backup id medusa-test
[2023-05-18 09:46:07,643] INFO: Monitoring provider is noop
[2023-05-18 09:46:07,739] INFO: Starting backup using Stagger: None Mode: differential Name: medusa-test
[2023-05-18 09:46:07,739] INFO: Updated from existing status: -1 to new status: 0 for backup id: medusa-test
[2023-05-18 09:46:07,741] INFO: Saving tokenmap and schema
[2023-05-18 09:46:08,227] INFO: Resolving ip address 192.168.37.46
[2023-05-18 09:46:08,227] INFO: ip address to resolve 192.168.37.46
[2023-05-18 09:46:08,234] INFO: Resolving ip address 192.168.130.113
[2023-05-18 09:46:08,235] INFO: ip address to resolve 192.168.130.113
[2023-05-18 09:46:08,274] INFO: Resolving ip address 192.168.187.241
[2023-05-18 09:46:08,274] INFO: ip address to resolve 192.168.187.241
[2023-05-18 09:46:08,280] INFO: Resolving ip address 192.168.206.213
[2023-05-18 09:46:08,280] INFO: ip address to resolve 192.168.206.213
[2023-05-18 09:46:08,289] INFO: Resolving ip address 192.168.37.46
[2023-05-18 09:46:08,289] INFO: ip address to resolve 192.168.37.46
[2023-05-18 09:46:08,295] INFO: Resolving ip address 192.168.7.244
[2023-05-18 09:46:08,296] INFO: ip address to resolve 192.168.7.244
[2023-05-18 09:46:08,567] INFO: Node cassandra-dc1-r1-sts-1 does not have latest backup
[2023-05-18 09:46:08,567] INFO: Creating snapshot
[2023-05-18 09:46:38,708] ERROR: Issue occurred inside handle_backup Name: medusa-test Error: failed to create snapshot: Query timed out after PT30
S
[2023-05-18 09:46:38,709] INFO: Updated from existing status: 0 to new status: 2 for backup id: medusa-test
Traceback (most recent call last):
File "/home/cassandra/.local/bin/medusa", line 8, in <module>
sys.exit(cli())
File "/home/cassandra/.local/lib/python3.6/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/cassandra/.local/lib/python3.6/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/cassandra/.local/lib/python3.6/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/cassandra/.local/lib/python3.6/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/cassandra/.local/lib/python3.6/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/cassandra/.local/lib/python3.6/site-packages/click/decorators.py", line 84, in new_func
return ctx.invoke(f, obj, *args, **kwargs)
File "/home/cassandra/.local/lib/python3.6/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/medusacli.py", line 133, in backup
return backup_node.handle_backup(medusaconfig, backup_name, stagger_time, enable_md5_checks, mode)
File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/backup_node.py", line 214, in handle_backup
config
File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/utils.py", line 35, in handle_exception
raise exception
File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/backup_node.py", line 197, in handle_backup
enable_md5_checks_flag, backup_name, config, monitoring)
File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/backup_node.py", line 250, in start_backup
cassandra, node_backup, storage, differential_mode, enable_md5, config, backup_name)
File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/backup_node.py", line 294, in do_backup
with cassandra.create_snapshot(backup_name) as snapshot:
File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/cassandra_utils.py", line 449, in create_snapshot
self.snapshot_service.create_snapshot(tag=tag)
File "/home/cassandra/.local/lib/python3.6/site-packages/medusa/service/snapshot/management_api_snapshot_service.py", line 35, in create_snapshot
raise Exception(err_msg)
Exception: failed to create snapshot: Query timed out after PT30S
I built my own docker image to increase the REQUEST_TIMEOUT to 120s https://github.com/k8ssandra/management-api-for-apache-cassandra/blob/master/management-api-server/src/main/java/com/datastax/mgmtapi/UnixSocketCQLAccess.java#L131
No PT30S request timeout, but sometimes Cassandra returns ERROR {code}DriverTimeoutException: Heartbeat request: timed out after 5000 ms{code}
I noticed that the readiness (/probes/readiness) from managementAPI couldn't get the response while taking snapshot (/snapshots) The readiness will wait until it's killed by the driver (HEARTBEAT_INTERVAL+HEARBEAT_TIMEOUT) => that's the return code 500 I post in the previous comments.
For those API require the dbUnixSocketFile, management-api can only handle one request at a time. Is the statement correct? @burmanm
I made an example:
cassandra@cassandra-dc1-r1-sts-0:/$ date;curl -k -L -XPOST -H "Content-Type:application/json" 'http://localhost:8080/api/v0/ops/node/snapshots' -d '{"snapshot_name": "auto-20230605-0004"}';date
Mon 05 Jun 2023 04:06:00 AM UTC
Mon 05 Jun 2023 04:06:33 AM UTC
cassandra@cassandra-dc1-r1-sts-0:/$ date; curl -XGET 'http://localhost:8080/api/v0/probes/readiness'; date
Mon 05 Jun 2023 04:06:03 AM UTC
Mon 05 Jun 2023 04:06:33 AM UTC
For those API require the dbUnixSocketFile, management-api can only handle one request at a time. Is the statement correct?
That is not correct. The problem here is that the snapshots
endpoint implementation is blocking on the operation (i.e. not async) so the management API process isn't allowed to do anything else until the snapshot is taken.
We might have two issues here. One being the snapshot operation is not async and may take time. The second being that the snapshot operations may take longer than the driver timeout.
@emerkle826 thanks for the feedback. Indeed, from the DataStax Java-driver,
The only place where the driver blocks is when using the synchronous API (methods declared in SyncCqlSession), and when calling other synchronous wrapper methods in the public API
My workaround is
When calling the management API
/api/v0/ops/node/snapshots
to take the snapshot, I encounter the 30s timeout errorQuery timed out after PT30S
From the container logs
I am using k8ssandra-operator V1.5.2 with k8ssandra/cass-management-api:4.0.1
Does anyone encounter such issue? In my experience, sometimes
nodetool snapshot
indeed takes over 30s. Or maybe Is it something about the "mgmtAPIHeap"?Anyway to update the 30s timeout?
Regards ShaoTing
┆Issue is synchronized with this Jira Story by Unito