thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
490 stars 218 forks source link

Reaper is failing after topology change of the existing cassandra cluster in SIDECAR mode. #993

Closed lavaraja closed 3 years ago

lavaraja commented 4 years ago

We have a Cassandra cluster and running reaper in SIDECAR mode. Recently we have re-hydrated our cluster and removed old nodes and added new nodes. When we start repair on the cluster we are seeing below error message in reaper logs and repair is not progressing.

Reaper-config: ###################################

Cassandra Reaper Configuration Example.

See a bit more complete example in:

src/server/src/test/resources/cassandra-reaper.yaml

segmentCountPerNode: 64 repairParallelism: DATACENTER_AWARE repairIntensity: 0.9 scheduleDaysBetween: 7 repairRunThreadCount: 15 hangingRepairTimeoutMins: 60 storageType: cassandra enableCrossOrigin: true incrementalRepair: false blacklistTwcsTables: true enableDynamicSeedList: true repairManagerSchedulingIntervalSeconds: 10 activateQueryLogger: false jmxConnectionTimeoutInSeconds: 10 useAddressTranslator: false

purgeRecordsAfterInDays: 30

numberOfRunsToKeepPerUnit: 10

enableConcurrentMigrations: false

datacenterAvailability has three possible values: ALL | LOCAL | EACH | SIDECAR

the correct value to use depends on whether jmx ports to C* nodes in remote datacenters are accessible

If the reaper has access to all node jmx ports, across all datacenters, then configure to ALL.

If jmx access is only available to nodes in the same datacenter as reaper in running in, then configure to LOCAL.

If there's a reaper instance running in every datacenter, and it's important that nodes under duress are not involved in repairs,

then configure to EACH.

If jmx access is restricted to localhost, then configure to SIDECAR.

The default is ALL

datacenterAvailability: SIDECAR

jmxAuth:

username: myUsername

password: myPassword

logging: level: WARN loggers: com.datastax.driver.core.QueryLogger.NORMAL: level: WARN additive: false appenders:

server: type: default applicationConnectors:

cassandra: clusterName: "XXXXX" contactPoints: ["XXXXXX"] keyspace: reaper_db loadBalancingPolicy: type: tokenAware shuffleReplicas: true subPolicy: type: dcAwareRoundRobin localDC: usedHostsPerRemoteDC: 0 allowRemoteDCsForLocalConsistencyLevel: false authProvider: type: plainText username: XXXX password: XXXXX

ssl:

##type: jdk

autoScheduling: enabled: false initialDelayPeriod: PT15S periodBetweenPolls: PT10M timeBeforeFirstSchedule: PT5M scheduleSpreadPeriod: PT6H excludedKeyspaces:

Uncomment the following to enable dropwizard metrics

Configure to the reporter of your choice

Reaper also provides prometheus metrics on the admin port at /prometheusMetrics

metrics:

frequency: 1 minute

reporters:

- type: log

logger: metrics

Authentication is enabled by default

accessControl: sessionTimeout: PT10M shiro: iniConfigs: ["classpath:shiro.ini"]

################################################

Error: java.lang.IllegalArgumentException: Trying to add/update cluster using an existing name: poc_cassandra_aws_cluster. No nodes overlap between 10.24.78.217,10.24.78.249,10.24.78.63,10.24.79.19,10.24.79.227,10.24.79.99 and 10.24.76.205,10.24.78.93,10.24.78.189,10.24.76.132,10.24.79.214,10.24.79.119 at com.google.common.base.Preconditions.checkArgument(Preconditions.java:412) at io.cassandrareaper.storage.CassandraStorage.addClusterAssertions(CassandraStorage.java:640) at io.cassandrareaper.storage.CassandraStorage.addCluster(CassandraStorage.java:602) at io.cassandrareaper.storage.CassandraStorage.updateCluster(CassandraStorage.java:621) at io.cassandrareaper.service.RepairRunner.updateClusterNodeList(RepairRunner.java:306) at io.cassandrareaper.service.RepairRunner.run(RepairRunner.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:117) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:38) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77) at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

The reaper.log file is flooded with above messages and generated 15GB log file with same message. we have stopped reaper on all the nodes and dropped reaper_db keyspace and started reaper-service to fix the issue. We are seeing this issue particularly when cluster topology is changed i.e when new nodes added/removed.

Any solution for this?

adejanovski commented 3 years ago

This happens when all nodes have changed in the cluster without Reaper being updated with the new cluster definition. My recommendation here would be to re-register the cluster (through the REST API for example) when changing the topology, which will update the list of nodes in the cluster definition in reaper_db.

lavaraja commented 3 years ago

Thank you. Could you please share the command to re-register the cluster via REST API or to update the new nodes information in reaper_db?. We are running reaper in SIDECAR mode.

adejanovski commented 3 years ago

From the docs:

Capture d’écran 2021-01-19 à 10 49 31

You can also use spreaper add-cluster command or even re-register the cluster through the UI, I think that should do it as well.

Plenty of options ;)

lavaraja commented 3 years ago

Thank you. It worked.

miguelmduarte commented 3 years ago

Hey @adejanovski , We are trying to re-register the cluster via the HTTP API, but when we perform the PUT request we get the same error:

curl --location --request PUT 'myHostName:8080/cluster/myCluster?seedHost=myHostName&jmxPort=9999'

And we get "There was an error processing your request. It has been logged (ID 9210ff6a668661a0).', showing the below error in the logs:

PUT /cluster/myHostName?seedHost=myHostName&jmxPort=9999] i.d.j.e.LoggingExceptionMapper - Error handling a request: 9210ff6a668661a0 java.lang.IllegalArgumentException: Trying to add/update cluster using an existing name: collectors-data-store. No nodes overlap between 10.112.189.130,10.112.189.133,10.112.189.137 and 10.112.189.135,10.112.189.136,10.112.189.140

adejanovski commented 3 years ago

That's because all the nodes changed IP it seems. We have a mechanism to prevent Reaper from mixing clusters that have the same name (it sadly happens...), so if you're trying to register a cluster that already exists but have no overlap in nodes, we'll consider it's a different cluster and will prevent its registration. You then need to delete the cluster and then recreate it.