thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
490 stars 218 forks source link

(Re-)Deployment fails with "com.datastax.driver.core.exceptions.AlreadyExistsException: Table reaper_db.running_repairs already exists" exception #1476

Closed FieteO closed 8 months ago

FieteO commented 9 months ago

Project board link

I observed the following stack trace for a reaper deployment:

WARN [2024-02-28 07:05:19,800] com.datastax.driver.core.utils.UUIDs: PID returned through native call was 0, JNR versions incompatible? Falling back to JMX.
INFO [2024-02-28 07:05:21,728] [main] o.e.j.u.log - Logging initialized @2703ms to org.eclipse.jetty.util.log.Slf4jLog
INFO [2024-02-28 07:05:21,861] [main] i.c.s.InitializeStorage - Initializing the database and performing schema migrations
INFO [2024-02-28 07:05:21,879] [main] c.d.d.core - DataStax Java driver 3.11.0 for Apache Cassandra
INFO [2024-02-28 07:05:21,883] [main] c.d.d.c.GuavaCompatibility - Detected Guava >= 19 in the classpath, using modern compatibility layer
INFO [2024-02-28 07:05:22,035] [main] c.d.d.c.ClockFactory - Using native clock to generate timestamps.
INFO [2024-02-28 07:05:22,237] [main] c.d.d.c.NettyUtil - Did not find Netty's native epoll transport in the classpath, defaulting to NIO.
INFO [2024-02-28 07:05:23,083] [main] c.d.d.c.p.DCAwareRoundRobinPolicy - Using provided data-center name 'dc1' for DCAwareRoundRobinPolicy
INFO [2024-02-28 07:05:23,085] [main] c.d.d.c.Cluster - New Cassandra host cassandra-dc1-service/123.456.789.1:9042 added
INFO [2024-02-28 07:05:23,086] [main] c.d.d.c.Cluster - New Cassandra host cassandra-dc1-service/123.456.789.2:9042 added
INFO [2024-02-28 07:05:23,086] [main] c.d.d.c.Cluster - New Cassandra host cassandra-dc1-service/123.456.789.3:9042 added
INFO [2024-02-28 07:05:24,166] [main] o.c.c.m.MigrationRepository - Found 17 migration scripts
WARN [2024-02-28 07:05:24,166] [main] i.c.s.c.MigrationManager - Starting db migration from 26 to 32…
WARN [2024-02-28 07:05:24,178] [main] i.c.s.c.MigrationManager - Database migration is happenning with other reaper instances possibly running. Found []
INFO [2024-02-28 07:05:24,404] [main] o.c.c.m.MigrationRepository - Found 17 migration scripts
WARN [2024-02-28 07:05:24,602] [clustername-worker-3] c.d.d.c.Cluster - Re-preparing already prepared query is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once. Query='insert into schema_migration(applied_successful, version, script_name, script, executed_at) values(?, ?, ?, ?, ?)'
WARN [2024-02-28 07:05:24,619] [clustername-worker-4] c.d.d.c.Cluster - Re-preparing already prepared query is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once. Query='INSERT INTO schema_migration_leader (keyspace_name, leader, took_lead_at, leader_hostname) VALUES (?, ?, dateOf(now()), ?) IF NOT EXISTS USING TTL 300'
WARN [2024-02-28 07:05:24,630] [clustername-worker-5] c.d.d.c.Cluster - Re-preparing already prepared query is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once. Query='DELETE FROM schema_migration_leader where keyspace_name = ? IF leader = ?'
org.cognitor.cassandra.migration.MigrationException: Error during migration of script 027_concurrent_repairs_part2.cql while executing 'CREATE TABLE running_repairs( repair_id uuid, node text, reaper_instance_host text, reaper_instance_id uuid, segment_id uuid, PRIMARY KEY(repair_id, node)) WITH CLUSTERING ORDER BY (node ASC) AND default_time_to_live = 300 AND gc_grace_seconds = 300 AND compaction = {'class':'LeveledCompactionStrategy'};'
at org.cognitor.cassandra.migration.Database.execute(Database.java:269)
at java.base/java.util.Collections$SingletonList.forEach(Collections.java:4856)
at org.cognitor.cassandra.migration.MigrationTask.migrate(MigrationTask.java:68)
at io.cassandrareaper.storage.cassandra.MigrationManager.migrate(MigrationManager.java:171)
at io.cassandrareaper.storage.cassandra.MigrationManager.initializeCassandraSchema(MigrationManager.java:101)
at io.cassandrareaper.storage.cassandra.MigrationManager.initializeAndUpgradeSchema(MigrationManager.java:59)
at io.cassandrareaper.storage.cassandra.CassandraStorageFacade.<init>(CassandraStorageFacade.java:144)
at io.cassandrareaper.storage.InitializeStorage.initializeStorageBackend(InitializeStorage.java:66)
at io.cassandrareaper.ReaperDbMigrationCommand.run(ReaperDbMigrationCommand.java:53)
at io.cassandrareaper.ReaperDbMigrationCommand.run(ReaperDbMigrationCommand.java:30)
at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:98)
at io.dropwizard.cli.Cli.run(Cli.java:78)
at io.dropwizard.Application.run(Application.java:94)
at io.cassandrareaper.ReaperApplication.main(ReaperApplication.java:99)
Caused by: com.datastax.driver.core.exceptions.AlreadyExistsException: Table reaper_db.running_repairs already exists
at com.datastax.driver.core.exceptions.AlreadyExistsException.copy(AlreadyExistsException.java:107)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)
at org.cognitor.cassandra.migration.Database.executeStatement(Database.java:277)
at org.cognitor.cassandra.migration.Database.execute(Database.java:261)
... 13 more
Caused by: com.datastax.driver.core.exceptions.AlreadyExistsException: Table reaper_db.running_repairs already exists
at com.datastax.driver.core.exceptions.AlreadyExistsException.copy(AlreadyExistsException.java:128)
at com.datastax.driver.core.Responses$Error.asException(Responses.java:185)
at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:215)
at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:235)
at com.datastax.driver.core.RequestHandler.access$2600(RequestHandler.java:61)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:1011)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:814)
at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1290)
at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1208)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:93)
at com.datastax.driver.core.InboundTrafficMeter.channelRead(InboundTrafficMeter.java:38)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.datastax.driver.core.exceptions.AlreadyExistsException: Table reaper_db.running_repairs already exists
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:128)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:65)
at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:297)
at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:268)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
... 30 more

To me it looks like the schema creation job is not idempotent. If it would help and already fix the issue, I could prepare a PR to change the CREATE TABLE in 027_concurrent_repairs_part2.cql to CREATE TABLE IF NOT EXISTS.

Shivam0609 commented 8 months ago

Faced Similar issue

Stacktrace :

INFO   [2024-03-08 10:11:22,462] [main] o.e.j.u.log - Logging initialized @1081ms to org.eclipse.jetty.util.log.Slf4jLog
INFO   [2024-03-08 10:11:22,499] [main] i.c.s.InitializeStorage - Initializing the database and performing schema migrations
INFO   [2024-03-08 10:11:22,515] [main] c.d.d.core - DataStax Java driver 3.11.0 for Apache Cassandra
INFO   [2024-03-08 10:11:22,516] [main] c.d.d.c.GuavaCompatibility - Detected Guava >= 19 in the classpath, using modern compatibility layer
INFO   [2024-03-08 10:11:22,532] [main] c.d.d.c.ClockFactory - Using native clock to generate timestamps.
INFO   [2024-03-08 10:11:22,602] [main] c.d.d.c.NettyUtil - Did not find Netty's native epoll transport in the classpath, defaulting to NIO.
INFO   [2024-03-08 10:11:22,705] [main] c.d.d.c.Cluster - Cannot connect with protocol version V5, trying with V4
INFO   [2024-03-08 10:11:22,807] [main] c.d.d.c.p.DCAwareRoundRobinPolicy - Using data-center name 'datacenter' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
INFO   [2024-03-08 10:11:22,808] [main] c.d.d.c.Cluster - New Cassandra host cassandra-datacenter-service/10.1.8.161:9042 added
INFO   [2024-03-08 10:11:22,808] [main] c.d.d.c.Cluster - New Cassandra host cassandra-datacenter-service/10.1.7.221:9042 added
INFO   [2024-03-08 10:11:22,808] [main] c.d.d.c.Cluster - New Cassandra host cassandra-datacenter-service/10.1.9.78:9042 added
INFO   [2024-03-08 10:11:22,904] [main] o.c.c.m.MigrationRepository - Found 16 migration scripts
WARN   [2024-03-08 10:11:22,905] [main] i.c.s.CassandraStorage - Starting db migration from 28 to 31…
WARN   [2024-03-08 10:11:22,908] [main] i.c.s.CassandraStorage - Database migration is happenning with other reaper instances possibly running. Found []
INFO   [2024-03-08 10:11:22,939] [main] o.c.c.m.MigrationRepository - Found 16 migration scripts
WARN   [2024-03-08 10:11:22,946] [clustername-worker-0] c.d.d.c.Cluster - Re-preparing already prepared query is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once. Query='insert into schema_migration(applied_successful, version, script_name, script, executed_at) values(?, ?, ?, ?, ?)'
WARN   [2024-03-08 10:11:22,949] [clustername-worker-0] c.d.d.c.Cluster - Re-preparing already prepared query is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once. Query='INSERT INTO schema_migration_leader (keyspace_name, leader, took_lead_at, leader_hostname) VALUES (?, ?, dateOf(now()), ?) IF NOT EXISTS USING TTL 300'
WARN   [2024-03-08 10:11:22,951] [clustername-worker-0] c.d.d.c.Cluster - Re-preparing already prepared query is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once. Query='DELETE FROM schema_migration_leader where keyspace_name = ? IF leader = ?'
org.cognitor.cassandra.migration.MigrationException: Error during migration of script 029_adaptive_repairs.cql while executing 'ALTER TABLE repair_unit_v1 ADD timeout int;'
        at org.cognitor.cassandra.migration.Database.execute(Database.java:269)
        at java.base/java.util.Collections$SingletonList.forEach(Collections.java:4856)
        at org.cognitor.cassandra.migration.MigrationTask.migrate(MigrationTask.java:68)
        at io.cassandrareaper.storage.CassandraStorage.migrate(CassandraStorage.java:376)
        at io.cassandrareaper.storage.CassandraStorage.initializeCassandraSchema(CassandraStorage.java:307)
        at io.cassandrareaper.storage.CassandraStorage.initializeAndUpgradeSchema(CassandraStorage.java:265)
        at io.cassandrareaper.storage.CassandraStorage.<init>(CassandraStorage.java:252)
        at io.cassandrareaper.storage.InitializeStorage.initializeStorageBackend(InitializeStorage.java:65)
        at io.cassandrareaper.ReaperDbMigrationCommand.run(ReaperDbMigrationCommand.java:53)
        at io.cassandrareaper.ReaperDbMigrationCommand.run(ReaperDbMigrationCommand.java:30)
        at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:98)
        at io.dropwizard.cli.Cli.run(Cli.java:78)
        at io.dropwizard.Application.run(Application.java:94)
        at io.cassandrareaper.ReaperApplication.main(ReaperApplication.java:105)
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Invalid column name timeout because it conflicts with an existing column
        at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:50)
        at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
        at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293)
        at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)
        at org.cognitor.cassandra.migration.Database.executeStatement(Database.java:277)
        at org.cognitor.cassandra.migration.Database.execute(Database.java:261)
        ... 13 more
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Invalid column name timeout because it conflicts with an existing column
        at com.datastax.driver.core.Responses$Error.asException(Responses.java:181)
        at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:215)
        at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:235)
        at com.datastax.driver.core.RequestHandler.access$2600(RequestHandler.java:61)
        at com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:1011)
        at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:814)
        at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1290)
        at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1208)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:93)
        at com.datastax.driver.core.InboundTrafficMeter.channelRead(InboundTrafficMeter.java:38)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:829)
Shivam0609 commented 8 months ago

@FieteO , are you planning to raise a PR ?

I faced similar issue but in 029_adaptive_repairs.cql. So there might be similar issue in other queries that also needs to be checked.

FieteO commented 8 months ago

@Shivam0609 Yes, I will give it a shot. I am not sure how to reproducably test it though

Shivam0609 commented 8 months ago

@Shivam0609 Yes, I will give it a shot. I am not sure how to reproducably test it though

Maybe someone from maintainers can run E2E tests.

adejanovski commented 8 months ago

Hi folks, there are some calls in the migration scripts which we cannot make idempotent anyway. All the ALTER TABLE calls which add a new column are such calls. If a file containing such a migration fails mid course, it's going to get retried on the next run and fail with the kind of error you're seeing. No other possibility there than dropping the keyspace and starting from scratch. Adding the IF NOT EXISTS to the reported CREATE TABLE call is definitely something we should do though as it's one less statement that's not idempotent.