P2-[4.0 bug hunting]-[BR]-Backup/Restore to S3 consistently fails

wwar commented 4 years ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. What did you do?

Deploy simple TiDB install (1x TiDB/TiKV/PD).
Import a large data set with lightning (80G mydumper file)
Attempt to backup directly to S3.

2. What did you expect to see?

It should copy without errors, retrying individual failed requests as necessary.

When trying to reproduce this issue, please try a large TiKV instance (10GiB+) copying from outside AWS.

3. What did you see instead?

The compressed backup size is ~16G, which would take a couple of hours to upload on my internet connection. It consistently fails. After 5 attempts I gave up:

mysql> BACKUP DATABASE `ontime` TO 's3://wwartmp/ontime';
ERROR 8124 (HY000): Backup failed: msg:"Io(Custom { kind: Other, error: \"failed to put object Error during dispatch: error trying to connect: dns error: failed to lookup address information: Name or service not known\" })"
mysql> BACKUP DATABASE `ontime` TO 's3://wwartmp/ontime2';
ERROR 8124 (HY000): Backup failed: msg:"Io(Custom { kind: Other, error: \"failed to put object Error during dispatch: error trying to connect: dns error: failed to lookup address information: Name or service not known\" })"

However, if I backup locally first and the use aws s3 cp, it always works:

mysql> BACKUP DATABASE `ontime` TO 'local:///mnt/evo860/tmp/ontime';
shell> aws s3 cp --recursive /mnt/evo860/tmp/ontime s3://wwartmp/

Restores also fails (but works consistently with a aws s3 cp):

mysql> RESTORE DATABASE `ontime` FROM 's3://wwartmp/ontime';                                                                                                                                |
ERROR 8125 (HY000): Restore failed: Cannot read s3://wwartmpontime/1_2020_55_3238459751c5023457ca60f42e34336ddf53a1394308f28b036f8f50c76ea445_default.sst: download sst failed               |

(Note the error message. There should be an extra / in there between wwartmp and ontime).

I took a look into this issue, and I think it is due to the rusoto library never retrying requests. i.e. the AWS guidelines state that clients should retry. See: https://github.com/rusoto/rusoto/issues/234

4. What version of TiDB are you using? (`tidb-server -V` or run `select tidb_version();` on TiDB)

mysql> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v4.0.0-beta.2-391-g43764a59b
Git Commit Hash: 43764a59b7dcb846dc1e9754e8f125818c69a96f
Git Branch: master
UTC Build Time: 2020-04-30 11:38:13
GoVersion: go1.13.8
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)

IANTHEREAL commented 4 years ago

thanks @wwar, your issue is really timely. (｡•̀ᴗ-)✧

shuke987 commented 4 years ago

/bug P1

wwar commented 4 years ago

I retried against nightly builds today. I can still reproduce:

mysql> BACKUP DATABASE `ontime` TO 's3://wwartmp/ontimenew';
ERROR 8124 (HY000): Backup failed: msg:"Io(Custom { kind: Other, error: \"failed to put object Error during dispatch: error trying to connect: dns error: failed to lookup address information: Name or service not known\" })"

mysql> RESTORE DATABASE `ontime` FROM 's3://wwartmp/ontime';
ERROR 8125 (HY000): Restore failed: Cannot read s3://wwartmpontime/1_1260_58_263e789adc07fdc44244ac975c4aff550101495350b7a84d4734b9a1a2e6d51a_write.sst: failed to get object Error during dispatch: error trying to connect: dns error: failed to lookup address information: Name or service not known: download sst failed

mysql> RESTORE DATABASE `ontime` FROM 's3://wwartmp/ontime';
ERROR 8125 (HY000): Restore failed: Cannot read s3://wwartmpontime/1_2052_31_1ed2a90e32c3e217caa2d6742961d48dd499098321429ff7169099cecc5727e9_default.sst: failed to get object Error during dispatch: error trying to connect: dns error: failed to lookup address information: Name or service not known: download sst failed

The tikv logs are not any more descriptive:

[2020/05/27 19:20:54.766 -06:00] [INFO] [endpoint.rs:698] ["run backup task"] [task="BackupTask { start_ts: TimeStamp(0), end_ts: TimeStamp(416973810489622529), start_key: \"74800000000000002F5F720000000000000000\", end_key: \"74800000000000002F5F72FFFFFFFFFFFFFFFF00\", is_raw_kv: false, cf: \"default\" }"]
[2020/05/27 19:21:42.415 -06:00] [INFO] [util.rs:407] ["connecting to PD endpoint"] [endpoints=http://127.0.0.1:2379]
[2020/05/27 19:21:42.415 -06:00] [INFO] [util.rs:407] ["connecting to PD endpoint"] [endpoints=http://127.0.0.1:2379]
[2020/05/27 19:21:42.416 -06:00] [INFO] [util.rs:472] ["connected to PD leader"] [endpoints=http://127.0.0.1:2379]
[2020/05/27 19:21:42.416 -06:00] [INFO] [util.rs:183] ["heartbeat sender and receiver are stale, refreshing ..."]
[2020/05/27 19:21:42.417 -06:00] [WARN] [util.rs:202] ["updating PD client done"] [spend=2.029447ms]
[2020/05/27 19:21:42.417 -06:00] [INFO] [client.rs:402] ["cancel region heartbeat sender"]
[2020/05/27 19:27:33.705 -06:00] [ERROR] [endpoint.rs:284] ["backup save file failed"] [error="Io(Custom { kind: Other, error: \"failed to put object Error during dispatch: error trying to connect: dns error: failed to lookup address information: Name or service not known\" })"]
[2020/05/27 19:27:33.705 -06:00] [ERROR] [endpoint.rs:669] ["backup region failed"] [error="Io(Custom { kind: Other, error: \"failed to put object Error during dispatch: error trying to connect: dns error: failed to lookup address information: Name or service not known\" })"] [end_key=] [start_key=] [region="id: 1062 start_key: 7480000000000000FF2F5F728000000001FFCC555B0000000000FA end_key: 7480000000000000FF2F5F728000000001FFCE8DF20000000000FA region_epoch { conf_ver: 1 version: 34 } peers { id: 1063 store_id: 1 }"]
[2020/05/27 19:27:33.706 -06:00] [ERROR] [service.rs:86] ["backup canceled"] [error=RemoteStopped]
[2020/05/27 19:27:38.879 -06:00] [ERROR] [endpoint.rs:680] ["backup failed to send response"] [error="TrySendError { kind: Disconnected }"]

If it helps, I actually do expect my DNS to be pretty reliable. I am not using the DNS provided by my ISP, but the google service 8.8.8.8. And again.. if I use the aws s3 cp command line tool, the backup works without problem.

Versions:

mysql> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v4.0.0-beta.2-517-gaf7bbbe24
Edition: Community
Git Commit Hash: af7bbbe2412f9a0174338526daa01fe270500806
Git Branch: master
UTC Build Time: 2020-05-27 12:50:37
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)

wwar@ivory:/mnt/evo860/bin$ ./tikv-server -V
TiKV 
Release Version:   4.1.0-alpha
Edition:           Community
Git Commit Hash:   970a9bf2a9ea782a455ae579ad237aaf6cb1daec
Git Commit Branch: master
UTC Build Time:    2020-05-27 06:17:35
Rust Version:      rustc 1.44.0-nightly (b2e36e6c2 2020-04-22)
Enable Features:   jemalloc portable sse protobuf-codec
Profile:           dist_release

kennytm commented 4 years ago

Backup retry needs tikv/tikv#7917 which has not been merged.

Similarly, TiDB's BR dependency has not been updated to include pingcap/br#298 (perhaps TiDB should display the bundled BR version somewhere 🤔)

kennytm commented 4 years ago

(Should reopen until https://github.com/tidb-challenge-program/bug-hunting-issue/issues/72#issuecomment-635176501 is completed...)

wwar commented 4 years ago

I could still reproduce this using nightly builds yesterday. I hope that the bug fixes address the issue correctly; I am not sure when TiDB's BR dependency has/will be updated.

wwar commented 4 years ago

I managed to restore a 16GiB database correctly, but a backup failed. I am going to open a new issue in pingcap/tidb, since I'm not able to reopen this.

tidb-challenge-program / bug-hunting-issue