Add separate parallel/intensity control for tablets

Michal-Leszczynski commented 2 months ago

As per tablet repair design, SM should introduce separate flags for controlling repair parallelism/intensity for tablet tables. Those new flags should also be supported by the sctool repair control command.

As per mentioned doc, the default for --tablet-intensity should be the amount of shards (even despite max_repair_ranges_in_parallel). I'm not aware of any reasonable limit for this value, so I suppose that SM won't cap it at any point.

I'm wondering whether it makes sense to introduce --tablet-parallel flag (@tgrabiec please advise). If there is a need for this flag, then I believe that it should behave in the same way as --parallel flag. This means that it has the same default (1) and the same cap.

cc: @karol-kokoszka @tzach @tgrabiec @asias

Michal-Leszczynski commented 2 months ago

Also, it might be confusing for the user that the default value of --intensity is 1 - meaning that we want to repair with the lowest intensity by default. On the other hand, suggested default value of --tablet-intensity is number of shards which sounds like something that utilizes repaired node way more.

asias commented 2 months ago

We already have enough parameters for the parallel intensity. I think it is better to use the existing ones but adjust automatically internally for the tablet tables when scheduling. It would be a nightmare for users to figure out the exact meaning of --tablet-intensity, --tablet-parallel, ---intensity and --parallel.

E.g., for tablet, if ---intensity is 1, we make sure SM sends 1 range per shard when scheduling.

Michal-Leszczynski commented 2 months ago

@asias In theory I like this approach, but what would happen when setting --intenisty 0 (max supported intensity)? Should SM send max_repair_ranges_in_parallel * number_of_shards token ranges for a single job?

@tzach what do you think?

tzach commented 2 months ago

We already have enough parameters for the parallel intensity.

Since Tablets are new, we will likely need to update the default instantly for tablets in the field while keeping the vNode intensity as is (or vice versa)

So, I suggest having different APIs for Tablets and vNodes as long as we have a mixed cluster with both. It does add complexity, but one does not have to set any of these values.

The suggested default above makes sense.

asias commented 2 months ago

We already have enough parameters for the parallel intensity. I think it is better to use the existing ones but adjust automatically internally for the tablet tables when scheduling. It would be a nightmare for users to figure out the exact meaning of --tablet-intensity, --tablet-parallel, ---intensity and --parallel.

E.g., for tablet, if ---intensity is 1, we make sure SM sends 1 range per shard when scheduling.

@asias In theory I like this approach, but what would happen when setting --intenisty 0 (max supported intensity)? Should SM send max_repair_ranges_in_parallel * number_of_shards token ranges for a single job?

With --intenisty 0, SM still sends max_repair_ranges_in_parallel ranges per shard, resulting max_repair_ranges_in_parallel * number_of_shards ranges in total. We know which shard owns the range for tablet, so SM could select the correct ranges for each shard easily.

@tzach what do you think?

asias commented 2 months ago

We already have enough parameters for the parallel intensity.

Since Tablets are new, we will likely need to update the default instantly for tablets in the field while keeping the vNode intensity as is (or vice versa)

So, I suggest having different APIs for Tablets and vNodes as long as we have a mixed cluster with both. It does add complexity, but one does not have to set any of these values.

If we expose the new options, users will start to use. It is confusing to set values differently for vnode or tablet tables.

In the long term, scylla core is going to schedule and control the intensity internally once we have a built-in repair scheduler which works better with tablet migration, e.g., no need to disable migration.

The suggested default above makes sense.

tzach commented 2 months ago

If we expose the new options, users will start to use. It is confusing to set values differently for vnode or tablet tables.

The reality is these are two different parameters with two different defaults. Coupling them under one API will break the first time a user needs to update only one of the two (probably soon)

In the long term, scylla core is going to schedule and control the intensity internally once we have a built-in repair scheduler which works better with tablet migration, e.g., no need to disable migration.

+1

asias commented 2 months ago

If we expose the new options, users will start to use. It is confusing to set values differently for vnode or tablet tables.

The reality is these are two different parameters with two different defaults. Coupling them under one API will break the first time a user needs to update only one of the two (probably soon)

No. The defaults are the same for both vnode and tablet tables. The meaning of the intensity is also the same, which controls the number of ranges repaired per shard.

E.g.,

Intensity specifies how many token ranges can be repaired in a Scylla node at every given time. The default intensity is one, you can change that using [sctool repair –intensity flag](https://manager.docs.scylladb.com/stable/sctool/repair.html#sctool-repair).

The default –intensity 1, still makes sense for tablet. There is not need to change it. The only difference is internal to SM, which sends one range per shard to scylla core (number of shard ranges to scylla core), but this does not change the promise that one range per shard is repaired, which is exactly like what we do for vnode table previously.

In the long term, scylla core is going to schedule and control the intensity internally once we have a built-in repair scheduler which works better with tablet migration, e.g., no need to disable migration.

+1

Michal-Leszczynski commented 2 months ago

With --intenisty 0, SM still sends max_repair_ranges_in_parallel ranges per shard, resulting max_repair_ranges_in_parallel * number_of_shards ranges in total. We know which shard owns the range for tablet, so SM could select the correct ranges for each shard easily.

@asias Just to clarify: Right now, when intensity is set to 1, SM sends just 1 token range per Scylla repair job - SM does not multiply intensity by shard_cnt for vnode tables. Is this correct behavior from core POV? So you suggest that the only difference between vnode and tablet intensity is to multiply it by the max_repair_ranges_in_parallel?

Is max_repair_ranges_in_parallel a per shard limit or per node limit? I remember that when we were discussing changes in SM 3.2 repair, it was said that SM shouldn't really take shards into consideration (except for calculating max_repair_ranges_in_parallel or choosing repair

We know which shard owns the range for tablet, so SM could select the correct ranges for each shard easily.

Also, can elaborate on this? Does it mean that when choosing ranges for repair job, SM should do it so that each shard owns max_repair_ranges_in_parallel from selected ranges?

mykaul commented 2 months ago

Keeping in mind that for some time we'll have both tablets and vnodes in the same cluster, what does it actually mean to run with different intensity? How do we prevent one hurting the other?

asias commented 2 months ago

With --intenisty 0, SM still sends max_repair_ranges_in_parallel ranges per shard, resulting max_repair_ranges_in_parallel * number_of_shards ranges in total. We know which shard owns the range for tablet, so SM could select the correct ranges for each shard easily.

@asias Just to clarify: Right now, when intensity is set to 1, SM sends just 1 token range per Scylla repair job - SM does not multiply intensity by shard_cnt for vnode tables. Is this correct behavior from core POV?

This is correct. The requested range will be worked in parallel by all shards. It does what the --intensity 1 is supposed to control.

So you suggest that the only difference between vnode and tablet intensity is to multiply it by the max_repair_ranges_in_parallel?

Not multiply by max_repair_ranges_in_parallel. With --intensity 1, SM will find ranges for each shard and send one such range per shard. With tablet, each range will always be owned by one shard.

E.g, there are two shards and there are 8 tablets (8 ranges)

r1, r2, r3, r4 shard0 r5, r6, r7, r8 shard1

With --intensity 1, SM will find r1,r2,r3,r4 belongs to shard 0 and the rests belongs to shard 1. SM sends one range ( out of r1 to r4) to shard 0 and one range (out of r5 to r8) to shard 1 independently.

This ensures, at any point in time, scylla will repair at most one range per shard. This is exactly what --intensity 1 does for vnode.

Does this make sense now?

Is max_repair_ranges_in_parallel a per shard limit or per node limit? I remember that when we were discussing changes in SM 3.2 repair, it was said that SM shouldn't really take shards into consideration (except for calculating max_repair_ranges_in_parallel or choosing repair

max_repair_ranges_in_parallel is per SHARD limit.

We know which shard owns the range for tablet, so SM could select the correct ranges for each shard easily.

Also, can elaborate on this? Does it mean that when choosing ranges for repair job, SM should do it so that each shard owns max_repair_ranges_in_parallel from selected ranges?

See the example above.

asias commented 2 months ago

Keeping in mind that for some time we'll have both tablets and vnodes in the same cluster, what does it actually mean to run with different intensity? How do we prevent one hurting the other?

I am actually suggesting not to run different intensity for vnode and tablet tables. SM repairs table after table, so they will not hurt each other.

Michal-Leszczynski commented 2 months ago

Not multiply by max_repair_ranges_in_parallel. With --intensity 1, SM will find ranges for each shard and send one such range per shard. With tablet, each range will always be owned by one shard.

@asias, after reading integration with SM from tablet repair design doc, I wasn't aware that SM is supposed to calculate ranges ownership and treat intensity as owned ranges per shard. Right now it's not done, but can be added. I understand that it is important to support this?

Also, is this range to shard ownership calculated based on Murmur3Partitioner arithmetic (like here)?

Michal-Leszczynski commented 2 months ago

But after this explanation, perhaps it's indeed ok to use the same intensity flag for both vnode and tablet tables (but send intensity owned ranges per shard for tablet table).

@tzach @mykaul @karol-kokoszka

asias commented 2 months ago

Not multiply by max_repair_ranges_in_parallel. With --intensity 1, SM will find ranges for each shard and send one such range per shard. With tablet, each range will always be owned by one shard.

@asias, after reading integration with SM from tablet repair design doc, I wasn't aware that SM is supposed to calculate ranges ownership and treat intensity as owned ranges per shard. Right now it's not done, but can be added. I understand that it is important to support this?

Also, is this range to shard ownership calculated based on Murmur3Partitioner arithmetic (like here)?

No, the tablet token range to shard ownership is completely not using the vnode algorithms. We need scylla core to expose this information if needed.

However, I have an idea to avoid asking SM to be aware of this shard mapping, which simplifies the requests logic on SM side significantly. We could use ranges_parallelism option we introduced to control the parallelism.

                  {
                     "name":"ranges_parallelism",
                     "description":"An integer specifying the number of ranges to repair in parallel by user request. If this number is bigger than the max_repair_ranges_in_parallel calculated by Scylla core, the smaller one will be used.",
                     "required":false,
                     "allowMultiple":false,
                     "type":"string",
                     "paramType":"query"
                  },

Without knowing which ranges belong to which shard, SM sends all ranges (or big portion of them) need to repair per repair job, in the mean while speechifying ranges_parallelism to 1. This will ensure all shards have work to do and each shard will only work on 1 range in parallel. This avoids asking SM to send only 1 token range per repair job in order to implement the --intensity 1.

Michal-Leszczynski commented 2 months ago

I tried to put your idea into an issue (#3789) and from my understanding, it can also be really useful for the vnode based table. (there is also another repair improvement issue #3790 that perhaps can also improve tablet table repair)

asias commented 2 months ago

I tried to put your idea into an issue (#3789) and from my understanding, it can also be really useful for the vnode based table. (there is also another repair improvement issue #3790 that perhaps can also improve tablet table repair)

Yes, the option is introduced exactly for the vnode tables in the past. I suggested in the past to use the ranges_parallelism to implement ranges parallelism while sending more ranges in a batch as well.

asias commented 2 months ago

@Michal-Leszczynski FYI. This PR adds ranges_parallelism for tablet. https://github.com/scylladb/scylladb/pull/18385

Michal-Leszczynski commented 2 months ago

@asias btw, can we expect small_table_optimization to be implemented for tablets in Scylla 6.0?

asias commented 2 months ago

No. It is not implemented. We do not need small table optimization for tablet tables.

On Fri, Apr 26, 2024, 18:41 Michal-Leszczynski @.***> wrote:

@asias https://github.com/asias btw, can we expect small_table_optimization to be implemented for tablets in Scylla 6.0?

— Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla-manager/issues/3792#issuecomment-2079137344, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACOETBPTSAVVOF5YB63IF3Y7IVNVAVCNFSM6AAAAABGG7V4X6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZGEZTOMZUGQ . You are receiving this because you were mentioned.Message ID: @.***>

asias commented 1 month ago

@Michal-Leszczynski FYI, the https://github.com/scylladb/scylladb/pull/18385 is now in master.

Michal-Leszczynski commented 1 month ago

Always sending all (or a lot of) ranges belonging to given replica set in a single repair job is hurting task granularity. AFAIK there is no way to tell which ranges were successfully repaired in a failed repair job, so even if only a single range fails, all will be retried. The same goes for situations when repair is paused by user or because of specified --window flag.

Is re-repairing given token range as expensive as the first repair of that token range? Or is it way cheaper and can be ignored (because it just streams hashes and not actual data)?

If re-repairing is usually as expensive as the first one, then we would need to somehow control the amount of ranges sent in a single repair job. I see 3 options:

Introduce yet another flag (e.g. --batch-size). The downside is that it would make repair control more complicated for the user and might be ignored.
Introduce new field in scylla-manager.yaml config (e.g. repair: batch-size: x) with big default value (e.g. 256). This would indicate that it's best not to touch it (most users won't even notice it), but it would still be possible to disable it if it does not play well with given cluster and task setup.
Make it so that the first execution of repair task sends all (or a lot of) ranges in a single repair job. All next executions (retries after errors, resuming after pause etc.) will fall back to the current behavior. This could prevent situations when some ranges are re-repaired many times because some other ranges failed or because of pausing.

@asias what do you think?

asias commented 1 month ago

Always sending all (or a lot of) ranges belonging to given replica set in a single repair job is hurting task granularity. AFAIK there is no way to tell which ranges were successfully repaired in a failed repair job, so even if only a single range fails, all will be retried. The same goes for situations when repair is paused by user or because of specified --window flag.

Is re-repairing given token range as expensive as the first repair of that token range? Or is it way cheaper and can be ignored (because it just streams hashes and not actual data)?

It should be much cheaper because the difference between node is much smaller than it was. When there is no difference, repair only read data and calculate a check. A granularity of 10% ranges looks good enough, i.e, sending 10% of ranges in a single job. if the job with 10% ranges fails, we retry it. It is not a big deal. Also, failed repair is supposed to be rare compared to successful repair.

If re-repairing is usually as expensive as the first one, then we would need to somehow control the amount of ranges sent in a single repair job. I see 3 options:

* Introduce yet another flag (e.g. `--batch-size`). The downside is that it would make repair control more complicated for the user and might be ignored.

* Introduce new field in `scylla-manager.yaml` config (e.g. `repair: batch-size: x`) with big default value (e.g. 256). This would indicate that it's best not to touch it (most users won't even notice it), but it would still be possible to disable it if it does not play well with given cluster and task setup.

It is best to avoid the --batch-size config to users. We have enough options to control it.

* Make it so that the first execution of repair task sends all (or a lot of) ranges in a single repair job. All next executions (retries after errors, resuming after pause etc.) will fall back to the current behavior. This could prevent situations when some ranges are re-repaired many times because some other ranges failed or because of pausing.

If some 10% of ranges fail, we could continue to repair the next 10% of ranges. This prevents some of the ranges blocking the other ranges infinitely. We could also add some randomness how we group them into 10% of ranges.

@asias what do you think?

In the long run, we could provide a api to return the successfully repaired ranges.

Michal-Leszczynski commented 1 month ago

It is best to avoid the --batch-size config to users. We have enough options to control it.

I agree.

If some 10% of ranges fail, we could continue to repair the next 10% of ranges. This prevents some of the ranges blocking the other ranges infinitely. We could also add some randomness how we group them into 10% of ranges.

By default, repair does not end task execution on first encountered error, but just goes through all ranges, repairs them, and then retries any failed jobs after some backoff. So when I wrote the following:

Make it so that the first execution of repair task sends all (or a lot of) ranges in a single repair job. All next executions (retries after errors, resuming after pause etc.) will fall back to the current behavior. This could prevent situations when some ranges are re-repaired many times because some other ranges failed or because of pausing.

I meant that perhaps SM should drop this 10% batching (return to the current behavior) when retrying failed ranges after backoff. So all ranges will be first repaired with 10% batching and only in case of an error, they will be retried without batching.

The same could be applied to the situation when repair task needs to be interrupted because getting outside of --window. I'm worried about a situation where:

--window is specified to be many, small size windows when SM can be performing repair
repaired table is really big and fully replicated

There is some slight possibility that applying this 10% batching could result in SM not being able to make progress between going out of work windows.

So this approach (dropping batching on retry) would be a general safety valve for this implementation that could perhaps be removed in the future.

@asias @karol-kokoszka @tzach

karol-kokoszka commented 1 month ago

@Michal-Leszczynski I labelled this issue so that we can go through it on Monday's grooming. Let's sum up your discussion with @asias there and prepare the summary for Wednesday's planning and triage it with expected SM version.

karol-kokoszka commented 1 month ago

GROOMING NOTES

The initial idea was to introduce completely new flags to control intensity and parallelism for tablet-based tables.

@asias suggested that we should use the same flag as we do for regular tables.

Intensity = 1 means that only one range is sent in a single repair job, and the job is going to be handled by all the shards. For vnodes, we don't care about the range-to-shard mapping.

For tablet-based repair, ranges are mapped to shards, which means that a single range will be repaired by the "owner" shard only. Scylla Manager (SM) doesn't know the mapping and is not aware of which shard owns which ranges. This makes it impossible to find the correct set of ranges to be sent in a single repair job.

Due to the facts described above, the proposed solution includes changing the internal meaning of the intensity flag for tablet-based tables. The intensity flag currently has the following meaning: https://manager.docs.scylladb.com/stable/repair/#repair-intensity

Intensity specifies how many token ranges can be repaired in a Scylla node at any given time. The default intensity is one, you can change that using sctool repair –intensity flag.

This is interpreted by SM code as "how many token ranges to send in a single repair job." The proposal is to change the interpretation on the SM side to send, for every single repair job, 10% of the ranges owned by the current replica set. The intensity specifies how many token ranges can be repaired in a Scylla node at any given time, which will be guaranteed by setting the ranges-parallelism parameter in the Scylla API call: https://github.com/scylladb/scylla-manager/blob/4d169fecc8d3d8b4de38952946dca2ffe9d8de04/swagger/scylla_v1.json#L10407-L10412

By implementing this, we improve the CPU/shard utilization during the repair process, which should lead to improved performance.

The following design document must be updated to include the chosen approach: https://docs.google.com/document/d/1mBaufbomXU6hDO_25KkhC7CO65AbFgKpHVpB5Mei8xA/edit#heading=h.o646d4880ftd

The proposal from @Michal-Leszczynski is to create a fallback for the approach with 10% of ranges sent in a single repair job. If one range of these 10% fails, then SM would need to repair all these ranges once again. To avoid a loop of retried repairs, we should use 10% only in the first repair job execution; failover should proceed with the old approach (intensity = number of ranges in the repair job).

To evaluate if the given approach is correct, we need to have Scylla Cluster Test (SCT) allowing us to execute the repair on a large cluster (1TB?) and compare the metrics. (cc: @mikliapko)

cc: @mykaul @tzach @vladzcloudius -> We need to decide if we want to change the repair now and include @asias's suggestions, or if we agree to underutilize the nodes during the repair and send the intensity number of token ranges in a single repair job, maintaining the 3.2 approach.

Let's bring this to the planning.

asias commented 1 month ago

It is best to avoid the --batch-size config to users. We have enough options to control it.

I agree.

If some 10% of ranges fail, we could continue to repair the next 10% of ranges. This prevents some of the ranges blocking the other ranges infinitely. We could also add some randomness how we group them into 10% of ranges.

By default, repair does not end task execution on first encountered error, but just goes through all ranges, repairs them, and then retries any failed jobs after some backoff. So when I wrote the following:

Make it so that the first execution of repair task sends all (or a lot of) ranges in a single repair job. All next executions (retries after errors, resuming after pause etc.) will fall back to the current behavior. This could prevent situations when some ranges are re-repaired many times because some other ranges failed or because of pausing.

I meant that perhaps SM should drop this 10% batching (return to the current behavior) when retrying failed ranges after backoff. So all ranges will be first repaired with 10% batching and only in case of an error, they will be retried without batching.

The same could be applied to the situation when repair task needs to be interrupted because getting outside of --window. I'm worried about a situation where:
* `--window` is specified to be many, small size windows when SM can be performing repair

* repaired table is really big and fully replicated
There is some slight possibility that applying this 10% batching could result in SM not being able to make progress between going out of work windows.

So this approach (dropping batching on retry) would be a general safety valve for this implementation that could perhaps be removed in the future.

Retry after failure and Resume after pause with less than 10% ranges per job sounds reasonable. It is possible that 10% ranges would take more than the specified window. However, it is also possible that even a single range would exceed the window too, which should be rare.

Consider

10% ranges -> failure -> 1 range at a time until all ranges in the first 10% are repaired

How about the next 10% ranges, would we use 10% or 1 range at a time?

@asias @karol-kokoszka @tzach

asias commented 1 month ago

GROOMING NOTES

The initial idea was to introduce completely new flags to control intensity and parallelism for tablet-based tables.

@asias suggested that we should use the same flag as we do for regular tables.

Intensity = 1 means that only one range is sent in a single repair job, and the job is going to be handled by all the shards. For vnodes, we don't care about the range-to-shard mapping.

For tablet-based repair, ranges are mapped to shards, which means that a single range will be repaired by the "owner" shard only. Scylla Manager (SM) doesn't know the mapping and is not aware of which shard owns which ranges. This makes it impossible to find the correct set of ranges to be sent in a single repair job.

Due to the facts described above, the proposed solution includes changing the internal meaning of the intensity flag for tablet-based tables. The intensity flag currently has the following meaning: https://manager.docs.scylladb.com/stable/repair/#repair-intensity

Intensity specifies how many token ranges can be repaired in a Scylla node at any given time. The default intensity is one, you can change that using sctool repair –intensity flag.

This is interpreted by SM code as "how many token ranges to send in a single repair job." The proposal is to change the interpretation on the SM side to send, for every single repair job, 10% of the ranges owned by the current replica set. The intensity specifies how many token ranges can be repaired in a Scylla node at any given time, which will be guaranteed by setting the ranges-parallelism parameter in the Scylla API call:

https://github.com/scylladb/scylla-manager/blob/4d169fecc8d3d8b4de38952946dca2ffe9d8de04/swagger/scylla_v1.json#L10407-L10412

By implementing this, we improve the CPU/shard utilization during the repair process, which should lead to improved performance.

The following design document must be updated to include the chosen approach: https://docs.google.com/document/d/1mBaufbomXU6hDO_25KkhC7CO65AbFgKpHVpB5Mei8xA/edit#heading=h.o646d4880ftd

The proposal from @Michal-Leszczynski is to create a fallback for the approach with 10% of ranges sent in a single repair job. If one range of these 10% fails, then SM would need to repair all these ranges once again. To avoid a loop of retried repairs, we should use 10% only in the first repair job execution; failover should proceed with the old approach (intensity = number of ranges in the repair job).

To evaluate if the given approach is correct, we need to have Scylla Cluster Test (SCT) allowing us to execute the repair on a large cluster (1TB?) and compare the metrics. (cc: @mikliapko)

cc: @mykaul @tzach @vladzcloudius -> We need to decide if we want to change the repair now and include @asias's suggestions, or if we agree to underutilize the nodes during the repair and send the intensity number of token ranges in a single repair job, maintaining the 3.2 approach.

Let's bring this to the planning.

I am fine with either sending some percentage of ranges per job (+ retry with 1 job per time) or sending one job at a time (under-utilize the cluster for repair) for tablets. But let's not add new options for the tablet intensity control.

Another point is that, I am already working on the in-core automatic repair scheduler that schedules repair jobs in side scylla core, which could cope better with tablet migration and topology changes.

karol-kokoszka commented 1 month ago

The following issue is expected to help with measuring/validating the upgraded repair on tablets https://github.com/scylladb/scylla-manager/issues/3853

scylladb / scylla-manager

Add separate parallel/intensity control for tablets #3792