thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
489 stars 217 forks source link

running migration fails if query timeouts #126

Closed ostefano closed 7 years ago

ostefano commented 7 years ago

DDL statements (when using Cassandra as storage) can be quite expensive, so they might take more time than usual. Current 'master' fails migration if any of the migration scripts timeouts.


Exception in thread "main" org.cognitor.cassandra.migration.MigrationException: Error during migration of script 003_switch_to_uuids.cql while executing 'DROP TABLE IF EXISTS repair_run_by_cluster'
    at org.cognitor.cassandra.migration.Database.execute(Database.java:159)
    at java.util.ArrayList.forEach(ArrayList.java:1249)
    at org.cognitor.cassandra.migration.MigrationTask.migrate(MigrationTask.java:52)
    at com.spotify.reaper.storage.CassandraStorage.<init>(CassandraStorage.java:96)
    at com.spotify.reaper.ReaperApplication.initializeStorage(ReaperApplication.java:202)
    at com.spotify.reaper.ReaperApplication.run(ReaperApplication.java:129)
    at com.spotify.reaper.ReaperApplication.run(ReaperApplication.java:64)
    at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:43)
    at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:85)
    at io.dropwizard.cli.Cli.run(Cli.java:75)
    at io.dropwizard.Application.run(Application.java:79)
    at com.spotify.reaper.ReaperApplication.main(ReaperApplication.java:84)
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [/10.12.81.4:9042] Timed out waiting for server response
    at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:44)
    at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:26)
    at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
    at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
    at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:64)
    at org.cognitor.cassandra.migration.Database.executeStatement(Database.java:167)
    at org.cognitor.cassandra.migration.Database.execute(Database.java:151)
    ... 11 more
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [/10.12.81.4:9042] Timed out waiting for server response
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:766)
    at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1267)
    at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
    at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655)
    at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
    at java.lang.Thread.run(Thread.java:748)
adejanovski commented 7 years ago

Hi @ostefano,

we need all migrations to be performed before running Reaper, otherwise it cannot work properly. What you can do is perform the migrations yourself and update the schema_migration table by hand : https://github.com/thelastpickle/cassandra-reaper/tree/master/src/main/resources/db/cassandra

You need to match the version numbers and have applied_successful set to True.

ostefano commented 7 years ago

Hi @adejanovski ,

yes, makes sense. I was wondering if we could increase the timeout so for that query to succeed on busy clusters.

adejanovski commented 7 years ago

You can raise the timeout on the client side by changing the read timeout in the SocketOptions :

cassandra:
  clusterName: "test"
  contactPoints: ["127.0.0.1"]
  keyspace: reaper_db
  queryOptions:
    consistencyLevel: LOCAL_QUORUM
    serialConsistencyLevel: SERIAL
  socketOptions:
    readTimeoutMillis: 20000

You'll then be limited by the server side timeouts that are related to the write_request_timeout_in_msof the cassandra.yaml file. A schema migration is a set of mutations, so they go into timeout if any of those mutations goes into timeout.

The stacktrace you pasted shows a client side timeout though, so you should not need to change the nodes configuration.

ostefano commented 7 years ago

Thanks, I missed that option. Nice! Closing the issue accordingly.