thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
481 stars 216 forks source link

Cassandra Reaper 3.6.0 - Noticed bug(tombstone issue) & downgraded to 3.2.0 #1508

Open anumod1234 opened 3 weeks ago

anumod1234 commented 3 weeks ago

Project board link

Hi Team ,

We have upgraded the Cassandra reaper tool from 3.2.0 to 3.6.0.

But we noticed a serious tombstone generation on multiple reaper Keyspace tables , and downgraded back to 3.2.0 , which I though of bringing up to your notice.

I'm opening the ticket first time in this community, so if any mis information or need more details let me know.

I found that this table is reading from one partition column and a non PK column with allow filtering , is anti pattern for cassandra, which will degrade casandra cluster performance & also lot of log messages pushed to kafka , which may cause network performance issues.

Once we downgraded to 3.2.0 the issue fixed. So its a bug from the new version of reaper 3.6 (I have not checked the versions between 3.2 to 3.6.)

I have tried reducing the gcgrace to 1 day & ran garbage collect , & tried to add TTL , but not helped.

Some of the tombstone queries found on the logs are listed below :

{"msg":"WARN [ReadStage-4] ReadCommand.java:536 - Read 2529 live rows and 2547 tombstone cells for query SELECT segment_state FROM cassandra_reaper.repair_run WHERE id = 52103b80-22e7-11ef-83b1-195f6f679993 AND segment_state = 2 LIMIT 5000; token 3000425142431644441 (see tombstone_warn_threshold)","pid":1780832,"fields":{"stream":"stdout"}}}

cassandra_reaper.repair_run

SELECT repair_unit_id, coordinator_host, end_token, fail_count, host_id, replicas, segment_end_time, segment_start_time, segment_state, start_token, token_ranges FROM cassandra_reaper.repair_run WHERE id = 18eddad0-193f-11ef-8f3e-eb3c8eea5a87 AND segment_state = 3 LIMIT 5000;

WARN [ReadStage-3] ReadCommand.java:536 - Read 1151 live rows and 1174 tombstone cells for query SELECT repair_unit_id, coordinator_host, end_token, fail_count, host_id, replicas, segment_end_time, segment_start_time, segment_state, start_token, token_ranges FROM cassandra_reaper.repair_run WHERE id = 18eddad0-193f-11ef-8f3e-eb3c8eea5a87 AND segment_state = 3 LIMIT 5000; token -4168918588338023632 (see tombstone_warn_threshold)

SELECT repair_unit_id, coordinator_host, end_token, fail_count, host_id, replicas, segment_end_time, segment_start_time, segment_state, start_token, token_ranges FROM cassandra_reaper.repair_run WHERE id = 18eddad0-193f-11ef-8f3e-eb3c8eea5a87 AND segment_state = 3 LIMIT 5000;

Table DDL -

CREATE TABLE cassandra_reaper.repair_run (

id timeuuid,
segment_id timeuuid,

adaptive_schedule boolean static,
cause text static,
cluster_name text static,
creation_time timestamp static,
end_time timestamp static,
intensity double static,
last_event text static,
owner text static,
pause_time timestamp static,
repair_parallelism text static,
repair_unit_id timeuuid static,
segment_count int static,
start_time timestamp static,
state text static,
tables set<text> static,
coordinator_host text,
end_token varint,
fail_count int,
host_id uuid,
replicas frozen<map<text, text>>,
segment_end_time timestamp,
segment_start_time timestamp,

segment_state int,

start_token varint,
token_ranges text,

PRIMARY KEY (id, segment_id)

thanks