pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
36.8k stars 5.8k forks source link

Region data restore when tikv-server down #7807

Closed victorggsimida closed 5 years ago

victorggsimida commented 5 years ago

Question

Before asking a question, make sure you have:

one of the region info : 禄 region 88042 { "id": 88042, "start_key": "t\x80\x00\x00\x00\x00\x00\x02\xff\xd1_r\x80\x00\x00\x00\x00\xffZ7\xaf\x00\x00\x00\x00\x00\xfa", "end_key": "t\x80\x00\x00\x00\x00\x00\x02\xff\xd1r\x80\x00\x00\x00\x00\xff\xc8a\x00\x00\x00\x00\x00\xfa", "epoch": { "conf_ver": 260, "version": 4304 }, "peers": [ { "id": 45412102, "store_id": 45276315 }, { "id": 46840252, "store_id": 38566112 } ], "leader": { "id": 46840252, "store_id": 38566112 }, "approximate_size": 25 }

this region has replication in 45276315 and 38566112. But 45276315 is down, and data is lost.

And 38566112 is fine.

operator show

"makeUpOfflineReplica (kind:region,replica, region:88042(4304,260), createAt:2018-09-28 14:38:03.552894614 +0800 CST m=+4420223.190830856, currentStep:0, steps:[add peer 48665174 on store 48166008 remove peer on store 45276315]) ",

but this makeup is not ok. And this happens all the time from Sep,08th.

And in 38566112 tikv .log


2018/09/28 14:51:26.031 ERRO scheduler.rs:1152: get snapshot failed for cids=[150279], error Request(message: "peer is not leader" not_leader {region_id: 88042})

how to resolve this?

this two command has already run.

禄 region --jq=".regions[] | {id: .id, remove_peer: [.peers[].store_id] | select(length>1) | map(if .==(45276315) then . else empty end) | select(length==1)}"
{"id":252164,"remove_peer":[45276315]}
{"id":42513983,"remove_peer":[45276315]}
{"id":265485,"remove_peer":[45276315]}
{"id":44162503,"remove_peer":[45276315]}
{"id":88042,"remove_peer":[45276315]}
{"id":47413206,"remove_peer":[45276315]}
{"id":46937338,"remove_peer":[45276315]}

禄 region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(45276315) then . else empty end) | length>=$total-length) }"
{"id":47413206,"peer_stores":[45276315,45276314]}
{"id":46937338,"peer_stores":[45276315,10]}
{"id":252164,"peer_stores":[45276315,10]}
{"id":42513983,"peer_stores":[45276256,45276315]}
{"id":265485,"peer_stores":[45276256,45276315]}
{"id":44162503,"peer_stores":[38566113,45276315]}
{"id":88042,"peer_stores":[45276315,38566112]}

but it still the same

禄 operator show
[
  "makeUpOfflineReplica (kind:region,replica, region:42513983(51443,1146), createAt:2018-09-28 14:40:16.955593773 +0800 CST m=+4420356.593530015, currentStep:0, steps:[add peer 48665189 on store 48166008 remove peer on store 45276315]) ",
  "makeUpOfflineReplica (kind:region,replica, region:44162503(55206,1924), createAt:2018-09-28 14:40:30.510508351 +0800 CST m=+4420370.148444593, currentStep:0, steps:[add peer 48665190 on store 48166008 remove peer on store 45276315]) ",
  "makeUpOfflineReplica (kind:region,replica, region:47413206(62052,4780), createAt:2018-09-28 14:41:07.655764708 +0800 CST m=+4420407.293700951, currentStep:0, steps:[add peer 48665192 on store 48166008 remove peer on store 45276315]) ",
  "makeUpOfflineReplica (kind:region,replica, region:265485(17047,254), createAt:2018-09-28 14:38:14.553562609 +0800 CST m=+4420234.191498827, currentStep:0, steps:[add peer 48665183 on store 48166008 remove peer on store 45276315]) ",
  "makeUpOfflineReplica (kind:region,replica, region:252164(14168,256), createAt:2018-09-28 14:38:12.050752151 +0800 CST m=+4420231.688688400, currentStep:0, steps:[add peer 48665182 on store 48166008 remove peer on store 45276315]) ",
  "makeUpOfflineReplica (kind:region,replica, region:46937338(50678,2588), createAt:2018-09-28 14:36:08.137799669 +0800 CST m=+4420107.775735895, currentStep:0, steps:[add peer 48665138 on store 48166008 remove peer on store 45276315]) timeout",
  "makeUpOfflineReplica (kind:region,replica, region:88042(4304,260), createAt:2018-09-28 14:38:03.552894614 +0800 CST m=+4420223.190830856, currentStep:0, steps:[add peer 48665174 on store 48166008 remove peer on store 45276315]) "
]
victorggsimida commented 5 years ago

And

how to find which table this region belongs to

victorggsimida commented 5 years ago

opertor log in pd.log

2018/09/28 15:00:57.971 cluster.go:449: [info] [region 0xca8f30] operator timeout: makeUpOfflineReplica (kind:region,replica, region:88042(4304,260), createAt:2018-09-28 14:50:17.276408872 +0800 CST m=+4420956.914345118, currentStep:0, steps:[add peer 48665230 on store 48166008 remove peer on store 45276315]) timeout
rleungx commented 5 years ago

Could you please tell us which versions of PD and TiKV do you use? @victorggsimida

victorggsimida commented 5 years ago

@rleungx pd

Release Version: v2.0.5
Git Commit Hash: b64716707b7279a4ae822be767085ff17b5f3fea
Git Branch: release-2.0
UTC Build Time:  2018-07-06 10:27:51

tikv

TiKV
Release Version:   2.1.0-beta
Git Commit Hash:   96022a982b6b34d9cd8b690cf6d4b0b85ffae247
Git Commit Branch: master
UTC Build Time:    2018-08-06 11:53:06
Rust Version:      rustc 1.29.0-nightly (4f3c7a472 2018-07-17)
rleungx commented 5 years ago

Do you only have two replicas and one of them is down? If so, it seems that when we have two replicas and one of them is down, it cannot elect a new leader to make up the replica. But PD will still think there are two replicas so that it will continue to do it and it will fail definitely.

how to find which table this region belongs to

You can use tidb-ctl to get it.

how to resolve this?

You can remove the down peer by using tikv-ctl unsafe-recover. Here is a documentation about how to use it.

victorggsimida commented 5 years ago

@rleungx

tikv-ctl --path ( which store path shou i input )

45276315 this store is lost. path is not existed

rleungx commented 5 years ago

@victorggsimida You need to login to store which still exists to use tikv-ctl unsafe-recover remove-fail-stores with specifying some parameters:

victorggsimida commented 5 years ago

@rleungx

  1. tidb-ctl get region info
./tidb-ctl -H*.*.*.*   -P10083  region -i88042
{
    "region_id": 88042,
    "start_key": "dIAAAAAAAALRX3KAAAAAAFo3rw==",
    "end_key": "dIAAAAAAAALRX3KAAAAAAF/IYQ==",
    "frames": null
}
is there something wrong with my test?
  1. tikv-ctl

./tikv-ctl --db /ssd2/tidb/tikv/deploy2/data/db unsafe-recover remove-fail-stores --stores 45276315 --regions 88042 error: Found argument '--stores' which wasn't expected, or isn't valid in this context

$ tikv-ctl --db /path/to/tikv/db unsafe-recover remove-fail-stores --stores 3 --regions 1001,1002

USAGE: tikv-ctl unsafe-recover remove-fail-stores ...

another command

./tikv-ctl --db /ssd2/tidb/tikv/deploy2/data/db unsafe-recover remove-fail-stores 45276315 88042 thread 'main' panicked at 'called Result::unwrap() on an Err value: "IO error: While lock file: /ssd2/tidb/tikv/deploy2/data/db/LOCK: Resource temporarily unavailable"', libcore/result.rs:945:5 note: Run with RUST_BACKTRACE=1 for a backtrace.

i download the latest version of tikv-ctl, and it has no stores option

./tikv-ctl --help TiKV Ctl PingCAP Distributed transactional key value database powered by Rust and Raft

USAGE: tikv-ctl [OPTIONS] [SUBCOMMAND]

FLAGS: -h, --help Prints help information -V, --version Prints version information

OPTIONS: --ca-path set CA certificate path --cert-path set certificate path --config set config for rocksdb --db set rocksdb path --decode decode a key in escaped format --encode encode a key in escaped format --to-hex convert escaped key to hex key --to-escaped convert hex key to escaped key --host set remote host --key-path set private key path --pd pd address --raftdb set raft rocksdb path

rleungx commented 5 years ago

@victorggsimida Sorry for that. There are something wrong with this documentation. You can replace --stores and --regions with -s and -r respectively. Have you ever dropped the table? It seems that the table which this region belongs to might be dropped.

victorggsimida commented 5 years ago

@rleungx so i want to find the table which this region belongs to .

database name and table name

victorggsimida commented 5 years ago

@ rleungx it still did not work. is there something wrong with my command? thank you

./tikv-ctl --db /ssd2/tidb/tikv/deploy2/data/db unsafe-recover remove-fail-stores -s 45276315 -r 88042
error: Found argument '-s' which wasn't expected, or isn't valid in this context

USAGE:
    tikv-ctl unsafe-recover remove-fail-stores <stores>...

For more information try --help
rleungx commented 5 years ago

@victorggsimida Could you just try to use the latest tikv-ctl? It seems that this one is outdated. And also you can use curl http://{TiDBIP}:10080/schema?table_id={tableID} to find the table name. The ID of this table is 721. If there is no result, you can try to use admin show ddl jobs; to find the table name.

victorggsimida commented 5 years ago

@rleungx

i try the latest one again.

md5sum tikv-ctl
c78a5189c3fdf3bfa515897fa541aeb6  tikv-ctl

./tikv-ctl  --db /ssd2/tidb/tikv/deploy2/data/db unsafe-recover remove-fail-stores -s 45276315 -r 88042
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: "IO error: While lock file: /ssd2/tidb/tikv/deploy2/data/db/LOCK: Resource temporarily unavailable"', libcore/result.rs:945:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.

 ./tikv-ctl RUST_BACKTRACE=1  --db /ssd2/tidb/tikv/deploy2/data/db unsafe-recover remove-fail-stores -s 45276315 -r 88042
error: Found argument 'RUST_BACKTRACE=1' which wasn't expected, or isn't valid in this context
rleungx commented 5 years ago

You should stop the TiKV which you run this command on. @victorggsimida

victorggsimida commented 5 years ago

@rleungx
my wrong. i did not tell u that i have try with shutdown the tikv.

and it seems that data/db/LOCK will not release itself.

can i just delete LOCK file?

rleungx commented 5 years ago

Here is the recovery process:

  1. stop the TiKV whose id is 38566112 in your case
  2. run ./tikv-ctl --db /ssd2/tidb/tikv/deploy2/data/db unsafe-recover remove-fail-stores -s 45276315 -r 88042 according to your case

can i just delete LOCK file?

You can use lsof to check whether this file is occupied by some process. If it is not occupied, you can delete it.

victorggsimida commented 5 years ago

@rleungx sry to disturb u again

  1. latest tikv still has no -s option and not option to config regionid
    
    /tikv-ctl --db /ssd2/tidb/tikv/deploy2/data/db unsafe-recover remove-fail-stores  45276315
    removing stores [45276315] from configrations...
    success
    ./tikv-ctl --db /ssd2/tidb/tikv/deploy2/data/db unsafe-recover remove-fail-stores -s 45276315
    error: Found argument '-s' which wasn't expected, or isn't valid in this context

USAGE: tikv-ctl unsafe-recover remove-fail-stores ...


2. after run the command belove :sucessful

and  i still see region info include 

禄 region 88042 { "id": 88042, "start_key": "t\x80\x00\x00\x00\x00\x00\x02\xff\xd1_r\x80\x00\x00\x00\x00\xffZ7\xaf\x00\x00\x00\x00\x00\xfa", "end_key": "t\x80\x00\x00\x00\x00\x00\x02\xff\xd1r\x80\x00\x00\x00\x00\xff\xc8a\x00\x00\x00\x00\x00\xfa", "epoch": { "conf_ver": 260, "version": 4304 }, "peers": [ { "id": 45412102, "store_id": 45276315 }, { "id": 46840252, "store_id": 38566112 } ], "leader": { "id": 46840252, "store_id": 38566112 }, "approximate_size": 25 }

victorggsimida commented 5 years ago

@rleungx

thank u for your help. the region has already remove from the down store.

if i did not handle it manualy, it will occupy the schedul process with "makeUpOfflineReplica" and maybe it will affect the region balance and hot region balance.

thank u again.