Closed karol-kokoszka closed 1 month ago
It is probably caused by some issue with SM DB, because the following error happens a lot in the logs:
2024-07-15T08:33:31.6064887Z 08:33:31.605 [31mERROR[0m repair.progress Update repair progress {"key": {"Host":"192.168.200.23","Keyspace":"test_repair_2","Table":"test_table_1"}, "error": "Operation failed for test_scylla_manager.repair_run_progress - received 0 responses and 1 failures from 1 CL=QUORUM."}
...
2024-07-15T08:33:32.9520388Z 08:33:32.951 [31mERROR[0m repair.progress Update repair state {"key": {}, "error": "Operation failed for test_scylla_manager.repair_run_state - received 0 responses and 1 failures from 1 CL=QUORUM."}
Also, a small side issue, in the Update repair state {"key": {}
log the key
is printed as an empty struct because it's not exported.
My suspicion is that there are too many ranges inserted to SM DB (per our small docker env) and it overwhelms it.
My suspicion is that there are too many ranges inserted to SM DB (per our small docker env) and it overwhelms it.
Could you please elaborate on it ? What ranges do we insert to the DB ? Is it a part of the progress calculation ? Maybe we have too much of the historical data that is useless basically.
What ranges do we insert to the DB?
We make 2 updates after each repair job:
For each table with X ranges, we will have a row (in repair state) with 2X int64 values in the DB. The progress rows are constant in size, but there are #tables #hosts of them.
Possibly related issue https://github.com/scylladb/scylla-enterprise/issues/4285
We need to revisit the way how the repair progress is reported to Scylla Manager's backend DB.
After some recent merge, it appears that repair integration tests can take even up to 69 minutes. See https://github.com/scylladb/scylla-manager/actions/runs/9935863712/job/27442900589
It happened on 6.0.1
integration-tests-6.0.1-IPV4-tablets / Test repair (push) Successful in 69m Details
integration-tests-6.0.1-IPV6-tablets / Test repair (push) Successful in 57m Details
Other env looks better - more realistic, although performance seems to be degragated still integration-tests-2024.1.5-IPV4 / Test repair (push) Successful in 17m Details
integration-tests-2024.1.5-IPV4-tablets / Test repair (push) Successful in 17m Details
integration-tests-2024.1.5-IPV6-tablets / Test repair (push) Successful in 16m Details
integration-tests-6.0.1-IPV4 / Test repair (push) Successful in 11m Details
It must be checked and understood.