thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
484 stars 217 forks source link

Reaper reported the schema validation errors in a very large cluster #1134

Open tongjixianing opened 2 years ago

tongjixianing commented 2 years ago

Project board link

The cluster has 100+ nodes running a high workload from the client.

Reaper failed to start and reported the schema validation errors.

In the reaper log, first it reported the schema check timeout when it created tables. maxSchemaAgreementWaitSeconds is the driver option with 10 seconds value by default but there is no place to configure this timeout in cassandra-reaper.yaml

ERROR [2021-10-28 03:22:06,481] [main] i.c.ReaperApplication - Storage is not ready yet, trying again to connect shortly...

org.cognitor.cassandra.migration.MigrationException: Error during migration of script 016_init_reaper_db.cql while executing 'CREATE TABLE IF NOT EXISTS running_reapers

Caused by: org.cognitor.cassandra.migration.MigrationException: Schema agreement could not be reached. You might consider increasing 'maxSchemaAgreementWaitSeconds'.

Then it is trying to alter the table - cluster by adding a new column and reporting the same timeout.

ERROR [2021-10-28 03:26:47,523] [main] i.c.ReaperApplication - Storage is not ready yet, trying again to connect shortly...

org.cognitor.cassandra.migration.MigrationException: Error during migration of script 017_add_custom_jmx_port.cql while executing 'ALTER TABLE cluster ADD properties text;'

   at org.cognitor.cassandra.migration.Database.execute(Database.java:269)

   at java.util.Collections$SingletonList.forEach(Collections.java:4822)

   at org.cognitor.cassandra.migration.MigrationTask.migrate(MigrationTask.java:68)

   at io.cassandrareaper.storage.CassandraStorage.migrate(CassandraStorage.java:362)

   at io.cassandrareaper.storage.CassandraStorage.initializeCassandraSchema(CassandraStorage.java:293)

   at io.cassandrareaper.storage.CassandraStorage.initializeAndUpgradeSchema(CassandraStorage.java:250)

   at io.cassandrareaper.storage.CassandraStorage.<init>(CassandraStorage.java:238)

   at io.cassandrareaper.ReaperApplication.initializeStorage(ReaperApplication.java:480)

   at io.cassandrareaper.ReaperApplication.tryInitializeStorage(ReaperApplication.java:303)

   at io.cassandrareaper.ReaperApplication.run(ReaperApplication.java:181)

   at io.cassandrareaper.ReaperApplication.run(ReaperApplication.java:98)

   at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:43)

   at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:87)

   at io.dropwizard.cli.Cli.run(Cli.java:78)

   at io.dropwizard.Application.run(Application.java:93)

   at io.cassandrareaper.ReaperApplication.main(ReaperApplication.java:117)

Caused by: org.cognitor.cassandra.migration.MigrationException: Schema agreement could not be reached. You might consider increasing 'maxSchemaAgreementWaitSeconds'.

After that, it try to add the column - properties in cluster table again but the column already existed.

It just keeps looping with adding the existing column and reaper seems never getting out of this error.


ERROR [2021-10-28 03:27:02,641] [main] i.c.ReaperApplication - Storage is not ready yet, trying again to connect shortly...

org.cognitor.cassandra.migration.MigrationException: Error during migration of script 017_add_custom_jmx_port.cql while executing 'ALTER TABLE cluster ADD properties text;'

   at org.cognitor.cassandra.migration.Database.execute(Database.java:269)

   at java.util.Collections$SingletonList.forEach(Collections.java:4822)

   at org.cognitor.cassandra.migration.MigrationTask.migrate(MigrationTask.java:68)

   at io.cassandrareaper.storage.CassandraStorage.migrate(CassandraStorage.java:362)

   at io.cassandrareaper.storage.CassandraStorage.initializeCassandraSchema(CassandraStorage.java:293)

   at io.cassandrareaper.storage.CassandraStorage.initializeAndUpgradeSchema(CassandraStorage.java:250)

   at io.cassandrareaper.storage.CassandraStorage.<init>(CassandraStorage.java:238)

   at io.cassandrareaper.ReaperApplication.initializeStorage(ReaperApplication.java:480)

   at io.cassandrareaper.ReaperApplication.tryInitializeStorage(ReaperApplication.java:303)

   at io.cassandrareaper.ReaperApplication.run(ReaperApplication.java:181)

   at io.cassandrareaper.ReaperApplication.run(ReaperApplication.java:98)

   at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:43)

   at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:87)

   at io.dropwizard.cli.Cli.run(Cli.java:78)

   at io.dropwizard.Application.run(Application.java:93)

   at io.cassandrareaper.ReaperApplication.main(ReaperApplication.java:117)

Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Invalid column name properties because it conflicts with an existing column

┆Issue is synchronized with this Jira Story by Unito

tongjixianing commented 2 years ago

I checked the table schema for table cluster and properties columns already existed.

CREATE TABLE reaper_db.cluster (

  name text PRIMARY KEY,

  partitioner text,

  properties text,

  seed_hosts set<text>

) WITH bloom_filter_fp_chance = 0.01

  AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}

  AND comment = ''

  AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}

  AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}

  AND crc_check_chance = 1.0

  AND dclocal_read_repair_chance = 0.1

  AND default_time_to_live = 0

  AND gc_grace_seconds = 864000

  AND max_index_interval = 2048

  AND memtable_flush_period_in_ms = 0

  AND min_index_interval = 128

  AND read_repair_chance = 0.0

  AND speculative_retry = '99PERCENTILE';

List of tables in reaper_db keyspace and it looks like all the tables have been created.

describe tables;

running_reapers             cluster         leader

repair_unit_v1              schema_migration_leader

repair_schedule_by_cluster_and_keyspace snapshot        

repair_run_by_cluster          node_metrics_v1     

repair_schedule_v1            repair_run       

schema_migration             repair_run_by_unit
jdonenine commented 2 years ago

@adejanovski this should be unblocked now, I finally got around to making the changes and they just got merged: https://github.com/dropwizard/dropwizard-cassandra/pull/106

kb-elmo commented 10 months ago

Any news on this?

It seems that there is a "maxSchemaAgreementWait" parameter mentioned in the cassandra backend specific configuration but it doesn't seem to do anything when changing the value and still causes a timeout after just 10 seconds.

adejanovski commented 10 months ago

Not for now. We'd need to move to the dropwizard-cassandra bundle to benefit from this option. We'll soon attempt to upgrade to java 17 or 21, which will probably be the right time to upgrade all of our dependencies and probably switch to this bundle as well.

adejanovski commented 10 months ago

I've created #1437 to track this.