tikv / migration

Migration tools for TiKV, e.g. online bulk load.
Apache License 2.0
36 stars 26 forks source link

[Help Wanted] cdc: why my replication_gap so large? #402

Open Smityz opened 7 months ago

Smityz commented 7 months ago

image

./cdc cli  changefeed statistics
{
  "ops": 0,
  "count": 0,
  "sink_gap": "650ms",
  "replication_gap": "7250ms"
}
{
  "ops": 0,
  "count": 0,
  "sink_gap": "450ms",
  "replication_gap": "7334ms"
}
{
  "ops": 0,
  "count": 0,
  "sink_gap": "1000ms",
  "replication_gap": "7700ms"
}

I used this command to monitor my cdc job, and found the latency is large than I expected. replication_gap is big but sink_gap is small, could this be due to slow data pulling? How can I increase its speed?

Smityz commented 7 months ago

I noticed that there are many hard-coded parameters. After modifying some of them, I found that the latency decreased from 8 seconds to 4 seconds. Would it be possible to make these parameters configurable? @pingyu

pingyu commented 7 months ago

I think yes. Which parameters, and why they affect the latency in your case ?

Smityz commented 7 months ago
https://github.com/tikv/tikv/blob/v6.5.8/components/causal_ts/src/tso.rs#L70

/// Maximum available interval of TSO cache.
/// It means the duration that TSO we cache would be available despite failure
/// of PD. The longer of the value can provide better "High-Availability"
/// against PD failure, but more overhead of `TsoBatchList` & pressure to TSO
/// service.
pub(crate) const DEFAULT_TSO_BATCH_ALLOC_AHEAD_BUFFER_MS: u64 = 500;

This parameter in TiKV can cause a timestamp discrepancy with physical time, resulting in incorrect calculations, but seems no real negative effect.

https://github.com/tikv/migration/blob/main/cdc/pkg/config/kvclient.go#L27

// the safe interval to move backward resolved ts
ResolvedTsSafeInterval time.Duration `toml:"resolved-ts-safe-interval" json:"resolved-ts-safe-interval"`

This parameter have a real negative effect, but it seems related to a server side bug.

I also tried other parameters, like sink default concurrency and flush ticker in https://github.com/tikv/migration/blob/main/cdc/cdc/sink/tikv.go, but no use

pingyu commented 7 months ago

This parameter in TiKV can cause a timestamp discrepancy with physical time, resulting in incorrect calculations, but seems no real negative effect.

This parameter is configurable, See https://docs.pingcap.com/tidb/v6.5/tikv-configuration-file#alloc-ahead-buffer-new-in-v640.

It's indeed an issue, the cached TSO should not affect the calculation of replication gap. But currently I don't find a good way to fix it. (Maybe carry the information about the cache in resolved ts message ?)

This parameter have a real negative effect, but it seems related to a server side bug.

Emmm... I think the issue has been fixed by https://github.com/tikv/tikv/pull/13520. The parameter is a work around (#196) but we forget to remove it.

I will remove it and do some tests.

Smityz commented 7 months ago

Great! Thank you master @pingyu But I think hard code parameters in https://github.com/tikv/migration/blob/main/cdc/cdc/sink/tikv.go is still an issue. Perhaps we could consolidate these parameters into a single file as constant variables

zeminzhou commented 7 months ago

You are right, the ResolvedTsSafeInterval is a way to workaround a bug, the perfect fix depends on this PR. If you can be sure that the tikv version you are using is greater than or equal to 6.5, you can set ResolvedTsSafeInterval to 0.

pingyu commented 7 months ago

Great! Thank you master @pingyu But I think hard code parameters in https://github.com/tikv/migration/blob/main/cdc/cdc/sink/tikv.go is still an issue. Perhaps we could consolidate these parameters into a single file as constant variables

En ? You mean these parameters should be constant or configurable ?

Smityz commented 7 months ago

Great! Thank you master @pingyu But I think hard code parameters in https://github.com/tikv/migration/blob/main/cdc/cdc/sink/tikv.go is still an issue. Perhaps we could consolidate these parameters into a single file as constant variables

En ? You mean there parameters should be constant or configurable ?

There are many hard-coded numbers in the code now, which is not very friendly for some optimization operations because developers often do not realize the impact these numbers may have. If we can give these numbers some names, it will make it easier for developers to optimize them instead of reading through all the code. When we find parameters that really have a huge impact on performance, we can consider making them configurable. Before that, we can consider giving them constant names and corresponding comments to facilitate retrieval. For example: https://github.com/tikv/migration/blob/main/cdc/cdc/sink/tikv.go#L47 https://github.com/tikv/migration/blob/main/cdc/cdc/kv/client.go#L1164

pingyu commented 7 months ago

Great! Thank you master @pingyu But I think hard code parameters in https://github.com/tikv/migration/blob/main/cdc/cdc/sink/tikv.go is still an issue. Perhaps we could consolidate these parameters into a single file as constant variables

En ? You mean there parameters should be constant or configurable ?

There are many hard-coded numbers in the code now, which is not very friendly for some optimization operations because developers often do not realize the impact these numbers may have. If we can give these numbers some names, it will make it easier for developers to optimize them instead of reading through all the code. When we find parameters that really have a huge impact on performance, we can consider making them configurable. Before that, we can consider giving them constant names and corresponding comments to facilitate retrieval. For example: https://github.com/tikv/migration/blob/main/cdc/cdc/sink/tikv.go#L47 https://github.com/tikv/migration/blob/main/cdc/cdc/kv/client.go#L1164

Get it.

BTW, the concurrency is configurable by concurrency in sink URL of changefeed. See https://github.com/tikv/migration/blob/1b33b2a92709b631cfab2b7d546ae0c230083353/cdc/cdc/sink/tikv.go#L99.