Repair integration tests can take even up to 69 minutes (!)

scylladb / scylla-manager

The Scylla Manager

https://manager.docs.scylladb.com/stable/

Other

51 stars 33 forks source link

Repair integration tests can take even up to 69 minutes (!) #3929

Closed karol-kokoszka closed 1 month ago

karol-kokoszka commented 1 month ago

After some recent merge, it appears that repair integration tests can take even up to 69 minutes. See https://github.com/scylladb/scylla-manager/actions/runs/9935863712/job/27442900589

It happened on 6.0.1

integration-tests-6.0.1-IPV4-tablets / Test repair (push) Successful in 69m Details

integration-tests-6.0.1-IPV6-tablets / Test repair (push) Successful in 57m Details

Other env looks better - more realistic, although performance seems to be degragated still integration-tests-2024.1.5-IPV4 / Test repair (push) Successful in 17m Details

integration-tests-2024.1.5-IPV4-tablets / Test repair (push) Successful in 17m Details

integration-tests-2024.1.5-IPV6-tablets / Test repair (push) Successful in 16m Details

integration-tests-6.0.1-IPV4 / Test repair (push) Successful in 11m Details

It must be checked and understood.

Michal-Leszczynski commented 1 month ago

It is probably caused by some issue with SM DB, because the following error happens a lot in the logs:

2024-07-15T08:33:31.6064887Z 08:33:31.605   [31mERROR[0m  repair.progress Update repair progress  {"key": {"Host":"192.168.200.23","Keyspace":"test_repair_2","Table":"test_table_1"}, "error": "Operation failed for test_scylla_manager.repair_run_progress - received 0 responses and 1 failures from 1 CL=QUORUM."}
...
2024-07-15T08:33:32.9520388Z 08:33:32.951   [31mERROR[0m  repair.progress Update repair state {"key": {}, "error": "Operation failed for test_scylla_manager.repair_run_state - received 0 responses and 1 failures from 1 CL=QUORUM."}

Also, a small side issue, in the Update repair state {"key": {} log the key is printed as an empty struct because it's not exported.

My suspicion is that there are too many ranges inserted to SM DB (per our small docker env) and it overwhelms it.

karol-kokoszka commented 1 month ago

My suspicion is that there are too many ranges inserted to SM DB (per our small docker env) and it overwhelms it.

Could you please elaborate on it ? What ranges do we insert to the DB ? Is it a part of the progress calculation ? Maybe we have too much of the historical data that is useless basically.

Michal-Leszczynski commented 1 month ago

What ranges do we insert to the DB?

We make 2 updates after each repair job:

repair state - we update the list of all repaired ranges per table
repair progress - we update the count of success/fail ranges per table per host

For each table with X ranges, we will have a row (in repair state) with 2X int64 values in the DB. The progress rows are constant in size, but there are #tables #hosts of them.

karol-kokoszka commented 1 month ago

We need to revisit the way how the repair progress is reported to Scylla Manager's backend DB.