pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
425 stars 283 forks source link

CDC initial scan are not parallel well #10117

Open fubinzh opened 10 months ago

fubinzh commented 10 months ago

What did you do?

  1. Deploy TiDB cluster with 2 CDC nodes. TiKV configuration cdc.incremental-fetch-speed-limit = 300MiB, dc.incremental-scan-speed-limit = 96MiB.
  2. Start cdc changefeed and pause it.
    
    [root@tc-ticdc-0 /]# /cdc cli changefeed --server http://127.0.0.1:8301 query -c test1
    {
    "upstream_id": 7301604617795285762,
    "namespace": "default",
    "id": "test1",
    "sink_uri": "blackhole:",
    "config": {
    "memory_quota": 1073741824,
    "case_sensitive": true,
    "force_replicate": false,
    "ignore_ineligible_table": false,
    "check_gc_safe_point": true,
    "filter": {
      "rules": [
        "*.*"
      ]
    },
    "mounter": {
      "worker_num": 16
    },
    "sink": {
      "protocol": "",
      "transaction_atomicity": "",
      "terminator": "\r\n",
      "delete_only_output_handle_key_columns": null
    },
    "scheduler": {
      "enable_table_across_nodes": false,
      "region_threshold": 100000,
      "write_key_threshold": 0
    },
    "integrity": {
      "integrity_check_level": "none",
      "corruption_handle_level": "warn"
    },
    "changefeed_error_stuck_duration": 1800000000000,
    "sql_mode": "ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION"
    },
    "create_time": "2023-11-15 18:14:51.806",
    "start_ts": 445718200320000000,
    "resolved_ts": 445768313518424347,
    "target_ts": 0,
    "checkpoint_tso": 445733430529884442,
    "checkpoint_time": "2023-11-19 04:08:18.640",
    "state": "normal",
    "creator_version": "v6.5.3",
    "task_status": [
    {
      "capture_id": "11512494-c874-43ed-b111-831962ed47ca",
      "table_ids": [
        84,
        88,
        91,
        94
      ]
    },
    {
      "capture_id": "f5cf9c55-47ec-4c4c-8527-010eccf9a8e5",
      "table_ids": [
        90,
        93,
        96,
        86
      ]
    }
    ]
    }

3. Run tpcc workload for e.g. 24h to create lots of logs to be scanned by cdc
4. Rerun cdc changefeed 

### What did you expect to see?

Both CDC node should have initial scan workloads at the same time.

### What did you see instead?

For the first half if initial scan cdc-0 has lots of workload, while for the rest, cdc-1 has lots of workload. This resulted in a long initial scan time.

![image](https://github.com/pingcap/tiflow/assets/7403864/4eb9e047-f254-47f7-b6a6-c46b6c0e10a9)

![image](https://github.com/pingcap/tiflow/assets/7403864/22702a0d-1b3e-4fff-8a90-37f11f922d4c)

### Versions of the cluster

[root@tc-ticdc-0 /]# /cdc version
Release Version: v7.5.0
Git Commit Hash: 99c1f8fdffe72f2a9dbce6d0b58a52a162ce72b7
Git Branch: heads/refs/tags/v7.5.0
UTC Build Time: 2023-11-16 10:33:24
Go Version: go version go1.21.3 linux/amd64
Failpoint Build: false
fubinzh commented 10 months ago

/severity moderate

asddongmen commented 4 months ago

It should not be considered a bug.