Closed wwar closed 4 years ago
thanks @wwar, your issue is really timely. (。•̀ᴗ-)✧
/bug P1
I retried against nightly builds today. I can still reproduce:
mysql> BACKUP DATABASE `ontime` TO 's3://wwartmp/ontimenew';
ERROR 8124 (HY000): Backup failed: msg:"Io(Custom { kind: Other, error: \"failed to put object Error during dispatch: error trying to connect: dns error: failed to lookup address information: Name or service not known\" })"
mysql> RESTORE DATABASE `ontime` FROM 's3://wwartmp/ontime';
ERROR 8125 (HY000): Restore failed: Cannot read s3://wwartmpontime/1_1260_58_263e789adc07fdc44244ac975c4aff550101495350b7a84d4734b9a1a2e6d51a_write.sst: failed to get object Error during dispatch: error trying to connect: dns error: failed to lookup address information: Name or service not known: download sst failed
mysql> RESTORE DATABASE `ontime` FROM 's3://wwartmp/ontime';
ERROR 8125 (HY000): Restore failed: Cannot read s3://wwartmpontime/1_2052_31_1ed2a90e32c3e217caa2d6742961d48dd499098321429ff7169099cecc5727e9_default.sst: failed to get object Error during dispatch: error trying to connect: dns error: failed to lookup address information: Name or service not known: download sst failed
The tikv logs are not any more descriptive:
[2020/05/27 19:20:54.766 -06:00] [INFO] [endpoint.rs:698] ["run backup task"] [task="BackupTask { start_ts: TimeStamp(0), end_ts: TimeStamp(416973810489622529), start_key: \"74800000000000002F5F720000000000000000\", end_key: \"74800000000000002F5F72FFFFFFFFFFFFFFFF00\", is_raw_kv: false, cf: \"default\" }"]
[2020/05/27 19:21:42.415 -06:00] [INFO] [util.rs:407] ["connecting to PD endpoint"] [endpoints=http://127.0.0.1:2379]
[2020/05/27 19:21:42.415 -06:00] [INFO] [util.rs:407] ["connecting to PD endpoint"] [endpoints=http://127.0.0.1:2379]
[2020/05/27 19:21:42.416 -06:00] [INFO] [util.rs:472] ["connected to PD leader"] [endpoints=http://127.0.0.1:2379]
[2020/05/27 19:21:42.416 -06:00] [INFO] [util.rs:183] ["heartbeat sender and receiver are stale, refreshing ..."]
[2020/05/27 19:21:42.417 -06:00] [WARN] [util.rs:202] ["updating PD client done"] [spend=2.029447ms]
[2020/05/27 19:21:42.417 -06:00] [INFO] [client.rs:402] ["cancel region heartbeat sender"]
[2020/05/27 19:27:33.705 -06:00] [ERROR] [endpoint.rs:284] ["backup save file failed"] [error="Io(Custom { kind: Other, error: \"failed to put object Error during dispatch: error trying to connect: dns error: failed to lookup address information: Name or service not known\" })"]
[2020/05/27 19:27:33.705 -06:00] [ERROR] [endpoint.rs:669] ["backup region failed"] [error="Io(Custom { kind: Other, error: \"failed to put object Error during dispatch: error trying to connect: dns error: failed to lookup address information: Name or service not known\" })"] [end_key=] [start_key=] [region="id: 1062 start_key: 7480000000000000FF2F5F728000000001FFCC555B0000000000FA end_key: 7480000000000000FF2F5F728000000001FFCE8DF20000000000FA region_epoch { conf_ver: 1 version: 34 } peers { id: 1063 store_id: 1 }"]
[2020/05/27 19:27:33.706 -06:00] [ERROR] [service.rs:86] ["backup canceled"] [error=RemoteStopped]
[2020/05/27 19:27:38.879 -06:00] [ERROR] [endpoint.rs:680] ["backup failed to send response"] [error="TrySendError { kind: Disconnected }"]
If it helps, I actually do expect my DNS to be pretty reliable. I am not using the DNS provided by my ISP, but the google service 8.8.8.8. And again.. if I use the aws s3 cp
command line tool, the backup works without problem.
Versions:
mysql> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v4.0.0-beta.2-517-gaf7bbbe24
Edition: Community
Git Commit Hash: af7bbbe2412f9a0174338526daa01fe270500806
Git Branch: master
UTC Build Time: 2020-05-27 12:50:37
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)
wwar@ivory:/mnt/evo860/bin$ ./tikv-server -V
TiKV
Release Version: 4.1.0-alpha
Edition: Community
Git Commit Hash: 970a9bf2a9ea782a455ae579ad237aaf6cb1daec
Git Commit Branch: master
UTC Build Time: 2020-05-27 06:17:35
Rust Version: rustc 1.44.0-nightly (b2e36e6c2 2020-04-22)
Enable Features: jemalloc portable sse protobuf-codec
Profile: dist_release
Backup retry needs tikv/tikv#7917 which has not been merged.
Similarly, TiDB's BR dependency has not been updated to include pingcap/br#298 (perhaps TiDB should display the bundled BR version somewhere 🤔)
(Should reopen until https://github.com/tidb-challenge-program/bug-hunting-issue/issues/72#issuecomment-635176501 is completed...)
I could still reproduce this using nightly builds yesterday. I hope that the bug fixes address the issue correctly; I am not sure when TiDB's BR dependency has/will be updated.
I managed to restore a 16GiB database correctly, but a backup failed. I am going to open a new issue in pingcap/tidb, since I'm not able to reopen this.
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. What did you do?
2. What did you expect to see?
It should copy without errors, retrying individual failed requests as necessary.
When trying to reproduce this issue, please try a large TiKV instance (10GiB+) copying from outside AWS.
3. What did you see instead?
The compressed backup size is ~16G, which would take a couple of hours to upload on my internet connection. It consistently fails. After 5 attempts I gave up:
However, if I backup locally first and the use
aws s3 cp
, it always works:Restores also fails (but works consistently with a
aws s3 cp
):(Note the error message. There should be an extra
/
in there betweenwwartmp
andontime
).I took a look into this issue, and I think it is due to the rusoto library never retrying requests. i.e. the AWS guidelines state that clients should retry. See: https://github.com/rusoto/rusoto/issues/234
4. What version of TiDB are you using? (
tidb-server -V
or runselect tidb_version();
on TiDB)