thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
489 stars 217 forks source link

Failed creating a merkle tree for [repair #c0c157f0-e14a-11ee-b320-bdc4e5fd08de on reaper_current/running_repairs, [(-8896895687978311387,-8888847627918162438], (-881535601694419919,-867088063190097011], (1356177826732174702,1357186491253880239], (6263469266031338450,6279809997892748801]]], /<IP>:7000 #1486

Open kapilgit123 opened 7 months ago

kapilgit123 commented 7 months ago

Project board link

ERROR] [ValidationExecutor:4] 2024-03-13 10:02:47,558 ValidationManager.java:173 - Validation failed. java.lang.RuntimeException: Parent repair session with id = c0bb8b90-e14a-11ee-b320-bdc4e5fd08de has failed. at org.apache.cassandra.service.ActiveRepairService.getParentRepairSession(ActiveRepairService.java:690) at org.apache.cassandra.db.repair.CassandraValidationIterator.getSSTablesToValidate(CassandraValidationIterator.java:116) at org.apache.cassandra.db.repair.CassandraValidationIterator.(CassandraValidationIterator.java:203) at org.apache.cassandra.db.repair.CassandraTableRepairManager.getValidationIterator(CassandraTableRepairManager.java:51) at org.apache.cassandra.repair.ValidationManager.getValidationIterator(ValidationManager.java:89) at org.apache.cassandra.repair.ValidationManager.doValidation(ValidationManager.java:112) at org.apache.cassandra.repair.ValidationManager.access$000(ValidationManager.java:41) at org.apache.cassandra.repair.ValidationManager$1.call(ValidationManager.java:162) at java.util.concurrent.FutureTask.run(FutureTask.java:277) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:826) [ERROR] [ValidationExecutor:4] 2024-03-13 10:02:47,558 CassandraDaemon.java:581 - Exception in thread Thread[ValidationExecutor:4,1,main] java.lang.RuntimeException: Parent repair session with id = c0bb8b90-e14a-11ee-b320-bdc4e5fd08de has failed. at org.apache.cassandra.service.ActiveRepairService.getParentRepairSession(ActiveRepairService.java:690) at org.apache.cassandra.db.repair.CassandraValidationIterator.getSSTablesToValidate(CassandraValidationIterator.java:116) at org.apache.cassandra.db.repair.CassandraValidationIterator.(CassandraValidationIterator.java:203) at org.apache.cassandra.db.repair.CassandraTableRepairManager.getValidationIterator(CassandraTableRepairManager.java:51) at org.apache.cassandra.repair.ValidationManager.getValidationIterator(ValidationManager.java:89) at org.apache.cassandra.repair.ValidationManager.doValidation(ValidationManager.java:112) at org.apache.cassandra.repair.ValidationManager.access$000(ValidationManager.java:41) at org.apache.cassandra.repair.ValidationManager$1.call(ValidationManager.java:162) at java.util.concurrent.FutureTask.run(FutureTask.java:277) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:826)

Please note that we have ran the ./nodetool scrub command to check if it resolves the issue, but we get the same erorrs on all 6 cassandra nodes. This issue exists for all the keyspaces/tablenames on each cassandra node.

Cassandra version :- 3.11.6 Reaper version :- 1.1.0

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: REAP-12

kapilgit123 commented 7 months ago

@adejanovski

Please let me know if any other details are required for this issue

adejanovski commented 7 months ago

@kapilgit123, I sure hope you're not using Reaper 1.1.0 😅

These stack trace aren't giving the reason why validation has failed. It could be that the segment hit the timeout and you should check in the Reaper logs for how long this segment has been running. If that's the case, the adaptive nature of the repairs should extend the timeout along the next attempts (assuming you're running a recent version of Reaper). Otherwise you can change the segment timeout for this repair explicitly (or globally change the default timeout).

That's just an assumption and should be verified by checking the logs more thoroughly in both Reaper and Cassandra.

kapilgit123 commented 7 months ago

@adejanovski

I just confirmed the cassandra and reaper versions are as follows. Cass 4.0.10 and Reaper 3.3.1