pingcap / br

A command-line tool for distributed backup and restoration of the TiDB cluster data
https://pingcap.com/docs/dev/how-to/maintain/backup-and-restore/br/
Apache License 2.0
123 stars 104 forks source link

sysbench oltp_point_select test latency downgraded after br backup and restore #1214

Open fubinzh opened 3 years ago

fubinzh commented 3 years ago

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.

    1. Use lightning to import 9TB data to TiDB ( 1 databases, 3 tables, 3.75T / 2.5T / 2.5T for each table)
    2. Use br to backup the database, and at the same time run sysbench oltp_point_select testing
    3. Drop the tables, use br to restore the backup to original TiDB, and at the same time run sysbench oltp_point_select testing
    4. After restore is finished, run sysbench oltp_point_select testing
    5. Compare the sysbench testing results
  2. What did you expect to see? There should be no big performance downgrade after br backup/restore.

  3. What did you see instead? Comparing to sysbench result during backup and restore, avg and 95th percentile latency during after br restore downgraded a lot.

    Latency(ms) backup restore after restore avg 0.28 0.28 0.53 max 320.15 200.79 268706.60 95th percentile 0.38 0.36 0.63

=== sysbench result during backup === sysbench --mysql-host=172.16.6.6 --mysql-port=4000 --mysql-user=root --config-file=config oltp_point_select --tables=10 --table-size=1000000 --time=36000 run SQL statistics: queries performed: read: 127086779 write: 0 other: 0 total: 127086779 transactions: 127086779 (3530.19 per sec.) queries: 127086779 (3530.19 per sec.) ignored errors: 0 (0.00 per sec.) reconnects: 0 (0.00 per sec.)

General statistics: total time: 36000.0005s total number of events: 127086779

Latency (ms): min: 0.18 avg: 0.28 max: 320.15 95th percentile: 0.38 sum: 35927114.30

Threads fairness: events (avg/stddev): 127086779.0000/0.00 execution time (avg/stddev): 35927.1143/0.00 === sysbench result during restore === sysbench --mysql-host=172.16.6.6 --mysql-port=4000 --mysql-user=root --config-file=config oltp_point_select --tables=10 --table-size=1000000 --time=21600 run

config: No such file or directory sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options: Number of threads: 1 Initializing random number generator from current time

Initializing worker threads...

Threads started!

SQL statistics: queries performed: read: 76510850 write: 0 other: 0 total: 76510850 transactions: 76510850 (3542.17 per sec.) queries: 76510850 (3542.17 per sec.) ignored errors: 0 (0.00 per sec.) reconnects: 0 (0.00 per sec.)

General statistics: total time: 21600.0007s total number of events: 76510850

Latency (ms): min: 0.17 avg: 0.28 max: 200.79 95th percentile: 0.36 sum: 21551168.55

Threads fairness: events (avg/stddev): 76510850.0000/0.00 execution time (avg/stddev): 21551.1685/0.00

=== sysbench result after restore === sysbench --mysql-host=172.16.6.6 --mysql-port=4000 --mysql-user=root --config-file=config oltp_point_select --tables=10 --table-size=1000000 --time=7200 run

SQL statistics: queries performed: read: 13656748 write: 0 other: 0 total: 13656748 transactions: 13656748 (1896.77 per sec.) queries: 13656748 (1896.77 per sec.) ignored errors: 0 (0.00 per sec.) reconnects: 0 (0.00 per sec.)

General statistics: total time: 7200.0009s total number of events: 13656748

Latency (ms): min: 0.19 avg: 0.53 max: 268706.60 95th percentile: 0.63 sum: 7191942.64

Threads fairness: events (avg/stddev): 13656748.0000/0.00 execution time (avg/stddev): 7191.9426/0.00

  1. What version of BR and TiDB/TiKV/PD are you using?

br: v5.1.0-20210611 TiDB: v5.1.0-20210608

  1. Operation logs

    • Please upload br.log for BR if possible
    • Please upload tidb-lightning.log for TiDB-Lightning if possible
    • Please upload tikv-importer.log from TiKV-Importer if possible
    • Other interesting logs
  2. Configuration of the cluster and the task

    • tidb-lightning.toml for TiDB-Lightning if possible
    • tikv-importer.toml for TiKV-Importer if possible
    • topology.yml if deployed by TiUP
  3. Screenshot/exported-PDF of Grafana dashboard or metrics' graph in Prometheus if possible

ZipFast commented 3 years ago

We suspect that the checksum action in restore process will flush the block cache in servers which cause oltp_point_select test latency increase. Further tests are needed.