pingcap / tidb

TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
https://pingcap.com
Apache License 2.0
37.17k stars 5.84k forks source link

br/restore: remove the `--ddl-batch-size` flag, or make it size-based instead of length-based #55639

Closed YuJuncen closed 2 weeks ago

YuJuncen commented 2 months ago

Enhancement

The argument --ddl-batch-size for restoring was initially added at https://github.com/pingcap/tidb/pull/29380 for controlling the size of the ddl job CreateTabels: once it grows up to the max size of a raft command, we cannot create any table, eventually we fail.

The problem is that this controls the number of tables of a batch but not the total size of table infos. So with complex tables, a batch may be too huge to be committed (We made https://github.com/pingcap/tidb/issues/39829 for fixing this...). With simple tables, a batch may be too small to be effective enough.

We can either:

YuJuncen commented 2 months ago

A patch that shows how we need to modify the code for batching.

Patch ```diff diff --git a/br/pkg/restore/client.go b/br/pkg/restore/client.go index 56b8a4e96c..908b293eb0 100644 --- a/br/pkg/restore/client.go +++ b/br/pkg/restore/client.go @@ -15,6 +15,7 @@ import ( "sync" "time" + "github.com/docker/go-units" "github.com/fatih/color" "github.com/opentracing/opentracing-go" "github.com/pingcap/errors" @@ -1024,20 +1025,50 @@ func (rc *Client) createTablesWithDBPool(ctx context.Context, return eg.Wait() } +type DiscardCount int + +func (dc *DiscardCount) Write(p []byte) (n int, err error) { + *(*int)(dc) += len(p) + return len(p), nil +} + +func tableSize(t *metautil.Table) int { + // Note: maybe directly read while unmarshing metautil.Table. + dc := new(DiscardCount) + enc := json.NewEncoder(dc) + if err := enc.Encode(t.Info); err != nil { + log.Panic("Unreachable: marshaling a trivial object failed.", zap.Error(err)) + } + + return int(*dc) +} + +func (rc *Client) batchOfCreateTables(tables []*metautil.Table) (batch, rem []*metautil.Table) { + total := 7 * units.MiB // Leave 1 MiB for extra costs during marshaling. + + len := 0 + for i := range tables { + thisLen := tableSize(tables[i]) + if thisLen+len > total { + return tables[:i], tables[i:] + } + } + return tables, nil +} + func (rc *Client) createTablesInWorkerPool(ctx context.Context, dom *domain.Domain, tables []*metautil.Table, newTS uint64, outCh chan<- CreatedTable) error { eg, ectx := errgroup.WithContext(ctx) rater := logutil.TraceRateOver(logutil.MetricTableCreatedCounter) workers := utils.NewWorkerPool(uint(len(rc.dbPool)), "Create Tables Worker") - numOfTables := len(tables) - - for lastSent := 0; lastSent < numOfTables; lastSent += int(rc.batchDdlSize) { - end := mathutil.Min(lastSent+int(rc.batchDdlSize), len(tables)) - log.Info("create tables", zap.Int("table start", lastSent), zap.Int("table end", end)) + rem := tables + for len(rem) > 0 { + var thisBatch []*metautil.Table + thisBatch, rem = rc.batchOfCreateTables(tables) - tableSlice := tables[lastSent:end] + log.Info("create tables", zap.Int("remaining", len(rem)), zap.Int("batch", len(thisBatch))) workers.ApplyWithIDInErrorGroup(eg, func(id uint64) error { db := rc.dbPool[id%uint64(len(rc.dbPool))] - cts, err := rc.createTables(ectx, db, dom, tableSlice, newTS) // ddl job for [lastSent:i) + cts, err := rc.createTables(ectx, db, dom, thisBatch, newTS) // ddl job for [lastSent:i) failpoint.Inject("restore-createtables-error", func(val failpoint.Value) { if val.(bool) { err = errors.New("sample error without extra message") ```
benmeadowcroft commented 2 months ago

I think the option to "Remove the --ddl-batch-size parameter directly, then fetch the max raft entry size from TiKV, then use this as the target batch size." seems like the simplest from an operational perspective and reduces complexity by avoiding potentially conflicting configuration settings across components.

YuJuncen commented 2 weeks ago

It seems now ddl-batch-size doesn't impact the speed of creating tables. Probably because the optimization of DDL in the master branch. Close it for now.