Fail to start a TiDB cluster

fengzhao commented 5 years ago

Bug Report

英文不太好，我就直接上中文了，不过你用中文或英文回复我都可以。谢谢！ Please answer these questions before submitting your issue. Thanks!

What did you do? If possible, provide a recipe for reproducing the error. tidb tikv tird 各自一个节点，tidb为中控机，按照官网 ansible 部署教程启动时，提示连不到 PD。curl 看 PD 时，服务是正常的。

What did you expect to see? 启动成功
What did you see instead? 连不上PD
What version of TiDB are you using (tidb-server -V or run select tidb_version(); on TiDB)? Release Version: v2.1.0

fengzhao commented 5 years ago

shenli commented 5 years ago

@fengzhao Please show use the log of PD.

fengzhao commented 5 years ago

➜ log date Thu Dec 13 17:53:20 HKT 2018

➜ log date Thu Dec 13 17:54:00 HKT 2018 ➜ log ➜ log ➜ log tail -300f pd.log 2018/12/13 17:52:47.818 util.go:56: [info] Welcome to Placement Driver (PD). 2018/12/13 17:52:47.819 util.go:57: [info] Release Version: v2.1.0 2018/12/13 17:52:47.819 util.go:58: [info] Git Commit Hash: 757e91d4bf3df047a8953f239c7c2eb2aea6e7e1 2018/12/13 17:52:47.819 util.go:59: [info] Git Branch: release-2.1 2018/12/13 17:52:47.819 util.go:60: [info] UTC Build Time: 2018-11-30 11:36:10 2018/12/13 17:52:47.819 metricutil.go:79: [info] disable Prometheus push client 2018/12/13 17:52:47.819 server.go:104: [info] PD config - { "client-urls": "http://192.168.8.36:2379", "peer-urls": "http://192.168.8.36:2380", "advertise-client-urls": "http://192.168.8.36:2379", "advertise-peer-urls": "http://192.168.8.36:2380", "name": "pd1", "data-dir": "/home/tidb/deploy/data.pd", "initial-cluster": "pd1=http://192.168.8.36:2380", "initial-cluster-state": "new", "join": "", "lease": 1, "log": { "level": "info", "format": "text", "disable-timestamp": false, "file": { "filename": "/home/tidb/deploy/log/pd.log", "log-rotate": true, "max-size": 300, "max-days": 0, "max-backups": 0 } }, "log-file": "", "log-level": "", "tso-save-interval": "3s", "metric": { "job": "pd1", "address": "", "interval": "0s" }, "schedule": { "max-snapshot-count": 3, "max-pending-peer-count": 16, "max-merge-region-size": 0, "max-merge-region-keys": 0, "split-merge-interval": "1h0m0s", "patrol-region-interval": "100ms", "max-store-down-time": "30m0s", "leader-schedule-limit": 4, "region-schedule-limit": 4, "replica-schedule-limit": 8, "merge-schedule-limit": 8, "tolerant-size-ratio": 5, "low-space-ratio": 0.8, "high-space-ratio": 0.6, "disable-raft-learner": "false", "disable-remove-down-replica": "false", "disable-replace-offline-replica": "false", "disable-make-up-replica": "false", "disable-remove-extra-replica": "false", "disable-location-replacement": "false", "disable-namespace-relocation": "false", "schedulers-v2": [ { "type": "balance-region", "args": null, "disable": false }, { "type": "balance-leader", "args": null, "disable": false }, { "type": "hot-region", "args": null, "disable": false }, { "type": "label", "args": null, "disable": false } ] }, "replication": { "max-replicas": 3, "location-labels": "" }, "namespace": {}, "cluster-version": "0.0.0", "quota-backend-bytes": "0 B", "auto-compaction-mode": "periodic", "auto-compaction-retention-v2": "1h", "TickInterval": "500ms", "ElectionInterval": "3s", "PreVote": true, "security": { "cacert-path": "", "cert-path": "", "key-path": "" }, "label-property": null, "WarningMsgs": null, "namespace-classifier": "table" } 2018/12/13 17:52:47.821 server.go:136: [info] start embed etcd 2018/12/13 17:52:47.821 log.go:88: [info] embed: [pprof is enabled under /debug/pprof] 2018/12/13 17:52:47.822 log.go:88: [info] etcdserver: [name = pd1] 2018/12/13 17:52:47.822 log.go:88: [info] etcdserver: [data dir = /home/tidb/deploy/data.pd] 2018/12/13 17:52:47.822 log.go:88: [info] etcdserver: [member dir = /home/tidb/deploy/data.pd/member] 2018/12/13 17:52:47.822 log.go:88: [info] etcdserver: [heartbeat = 500ms] 2018/12/13 17:52:47.822 log.go:88: [info] etcdserver: [election = 3000ms] 2018/12/13 17:52:47.822 log.go:88: [info] etcdserver: [snapshot count = 100000] 2018/12/13 17:52:47.822 log.go:88: [info] etcdserver: [advertise client URLs = http://192.168.8.36:2379] 2018/12/13 17:52:47.823 systime_mon.go:24: [info] start system time monitor 2018/12/13 17:52:47.866 log.go:88: [info] etcdserver: [restarting member 61784f96623c1953 in cluster 7e6abe01b25b1b1c at commit index 16455] 2018/12/13 17:52:47.868 raft.go:656: [info] 61784f96623c1953 became follower at term 4 2018/12/13 17:52:47.868 raft.go:364: [info] newRaft 61784f96623c1953 [peers: [], term: 4, commit: 16455, applied: 0, lastindex: 16455, lastterm: 4] 2018/12/13 17:52:47.869 log.go:88: [info] mvcc: [restore compact to 15100] 2018/12/13 17:52:47.872 log.go:84: [warning] auth: [simple token is not cryptographically signed] 2018/12/13 17:52:47.872 log.go:88: [info] embed: [61784f96623c1953 starting with cors [""]] 2018/12/13 17:52:47.872 log.go:88: [info] embed: [61784f96623c1953 starting with host whitelist [""]] 2018/12/13 17:52:47.872 log.go:88: [info] etcdserver: [starting server... [version: 3.3.0+git, cluster version: to_be_decided]] 2018/12/13 17:52:47.875 log.go:88: [info] embed: [listening for peers on 192.168.8.36:2380] 2018/12/13 17:52:47.875 log.go:88: [info] etcdserver/membership: [added member 61784f96623c1953 [http://192.168.8.36:2380] to cluster 7e6abe01b25b1b1c] 2018/12/13 17:52:47.875 log.go:86: [info] etcdserver/membership: [set the initial cluster version to 3.3] 2018/12/13 17:52:47.875 log.go:88: [info] etcdserver/api: [enabled capabilities for version 3.3] 2018/12/13 17:52:52.868 raft.go:857: [info] 61784f96623c1953 is starting a new election at term 4 2018/12/13 17:52:52.868 raft.go:684: [info] 61784f96623c1953 became pre-candidate at term 4 2018/12/13 17:52:52.868 raft.go:755: [info] 61784f96623c1953 received MsgPreVoteResp from 61784f96623c1953 at term 4 2018/12/13 17:52:52.868 raft.go:669: [info] 61784f96623c1953 became candidate at term 5 2018/12/13 17:52:52.868 raft.go:755: [info] 61784f96623c1953 received MsgVoteResp from 61784f96623c1953 at term 5 2018/12/13 17:52:52.869 raft.go:712: [info] 61784f96623c1953 became leader at term 5 2018/12/13 17:52:52.869 node.go:306: [info] raft.node: 61784f96623c1953 elected leader 61784f96623c1953 at term 5 2018/12/13 17:52:52.869 server.go:166: [info] create etcd v3 client with endpoints [http://192.168.8.36:2379] 2018/12/13 17:52:52.869 log.go:88: [info] etcdserver: [published {Name:pd1 ClientURLs:[http://192.168.8.36:2379]} to cluster 7e6abe01b25b1b1c] 2018/12/13 17:52:52.869 log.go:88: [info] embed: [ready to serve client requests] 2018/12/13 17:52:52.870 log.go:86: [info] embed: [serving insecure client requests on 192.168.8.36:2379, this is strongly discouraged!] 2018/12/13 17:52:52.871 server.go:205: [info] init cluster id 6634201091088044800 2018/12/13 17:52:52.871 namespace_classifier.go:438: [info] load 0 namespacesInfo cost 283.89µs 2018/12/13 17:52:52.872 leader.go:95: [warning] leader is still name:"pd1" member_id:7023451126196083027 peer_urls:"http://192.168.8.36:2380" client_urls:"http://192.168.8.36:2379" , delete and campaign again 2018/12/13 17:52:52.875 tso.go:104: [info] sync and save timestamp: last 2018-12-13 17:50:52.573357585 +0800 HKT save 2018-12-13 17:52:55.874996105 +0800 HKT m=+8.067861500 next 2018-12-13 17:52:52.874996105 +0800 HKT m=+5.067861500 2018/12/13 17:52:52.878 leader.go:263: [info] cluster version is 0.0.0 2018/12/13 17:52:52.878 leader.go:264: [info] PD cluster leader pd1 is ready to serve

fengzhao commented 5 years ago

@shenli @lysu

fengzhao commented 5 years ago

@lysu could you help me slove the problem first ? hurry hurry hurry ! I made a test on my server using docker to deploy tidb . it works ! but when it doesn't work using ansible 。 thx

lysu commented 5 years ago

@fengzhao Hi, our ansible expert are look at this question please wait a mount.

superlzs0476 commented 5 years ago

Hi,May be a pd-server conflict, please confirm that the pd-server has been started before starting the cluster with ansible.

fengzhao commented 5 years ago

should i comment this code ? does it matters ? @lysu

fengzhao commented 5 years ago

Hi,May be a pd-server conflict, please confirm that the pd-server has been started before starting the cluster with ansible.

what does a pd-server conflict mean ? the ansbile start.yml defines the order of the server start . it seems the pd-server start OK . but DB is wating pd-server's http responce . but i use curl to make a request , it works . that's why i can't understand it 。

fengzhao commented 5 years ago

should i comment this code ? does it matters ? @lysu

it works . when i comment the wait for pd up services . it start OK and i can use mysql client to connect it and access the moniter using chrome 。I'm unfamiliar with the ansible。 but according to the code ， it seems to make a http request and until "200 OK" in the response to be searched ?

but i don't konw how ansible make this reqesut . show u consider log the request clearly like the "curl -v " cammond ?

superlzs0476 commented 5 years ago

Login pd node ,execute command "ps -ef |grep pd-server" ,and then output result . What is the time that executed "ansible-playbook start.yml" ?

fengzhao commented 5 years ago

不用质疑时间的问题了，能不能关注一下重点，我发的日志只不过是因为我第二天一大早清空了pd头天的日志，然后在中控机又启动了一下，再贴上pd的日志的，在中控机执行启动命令之后，就去pd server里面看进程和日志了，都正常，然后中控机启动命令执行到等待连接pd的时候挂在那里，一直到连接超时退出，然后我在中控机手动curl向pd发http请求，连接正常。然后我把检查连接的语句注释再重启一下。就启动成功了。

LinuxGit commented 5 years ago

Could you provide the output of the following commands, Thanks.

$ time curl -v "172.16.30.193:2379/pd/api/v1/members"
$ env | grep -i http

fengzhao commented 5 years ago

Could you provide the output of the following commands, Thanks.
$ time curl -v "172.16.30.193:2379/pd/api/v1/members"
$ env | grep -i http

我curl访问pd的url是秒开，连接正常，我不知道你让我查看环境变量是想做什么

fengzhao commented 5 years ago

我对ansible不是很熟悉，我想问问，start.yml 文件中的这段代码，具体是用什么去发起http请求的。 `

- name: wait for PD up
  wait_for: |
    host={{ ansible_host }} port={{ pd_client_port }} state=present
    send='GET /pd/api/v1/members HTTP/1.0\r\n\r\n' search_regex='200 OK'
  when: not enable_tls|default(false)

- name: wait for PD up
  uri:
    url: "https://{{ ansible_host }}:{{ pd_client_port }}/pd/api/v1/members"
    validate_certs: no
    client_cert: "{{ pd_cert_dir }}/pd-server-{{ ansible_host }}.pem"
    client_key: "{{ pd_cert_dir }}/pd-server-{{ ansible_host }}-key.pem"
  register: result
  until: result.status is defined and result.status == 200
  retries: 10
  delay: 5
  when: enable_tls|default(false)

`

LinuxGit commented 5 years ago

Ansible use wait_for module currently. I'd like to check if you have set any env about HTTP, like HTTP_PROXY.

We're testing uri module to check http interface and if we can reproduce this error on TiDB 2.1.

fengzhao commented 5 years ago

我就算有代理，也是在外层的网络设置的透明代理，而且也只是针对部分被墙的网站才走代理。我的节点都是vmware esxi中同一网段内的虚拟机，不会走代理的。真的很希望你们能重现这个问题。 thx

fengzhao commented 5 years ago

没必要造轮子啊，我觉得可以考虑在wait_for 中引requests这个库来发http请求啊。这个库是很优秀的库。我觉得你们可以考虑一下这个。

fengzhao commented 5 years ago

我把以前的老虚拟机删了，重新开了虚拟机，环境是这样的。 192.168.8.35 db 中控机 192.168.8.36 pd 192.168.8.37 kv

inventory.ini
## TiDB Cluster Part
[tidb_servers]
192.168.8.35

[tikv_servers]
192.168.8.37

[pd_servers]
192.168.8.36

[spark_master]

[spark_slaves]

[lightning_server]

[importer_server]

## Monitoring Part
# prometheus and pushgateway servers
[monitoring_servers]
192.168.8.35

[grafana_servers]
192.168.8.35

# node_exporter and blackbox_exporter servers
[monitored_servers]
192.168.8.35
[alertmanager_servers]

[kafka_exporter_servers]

在start.yml中注释了检查[wait for pd-server up]的任务，然后启动，这次都没到pd-server就已经在[wait for node_exporter up]连接超时。手动 curl 访问，postman，浏览器访问。都是正常。下面是详细截图

LinuxGit commented 5 years ago

Could you run following commands and check if there're any errors?

cd /home/tidb/tidb-ansible/library
mv wait_for.py wait_for.py.bak
cd /home/tidb/tidb-ansible/
mv start.yml start.yml.bak
wget https://raw.githubusercontent.com/LinuxGit/tidb-ansible/Louis/wait_for/start.yml
ansible-playbook start.yml -v

I have a commit, use uri module to check http interface, but not merged yet. And I could not reproduce it on TiDB 2.1.

dveeden commented 3 years ago

Looks like this issue went stale and should be closed

tisonkun commented 3 years ago

closed as stale. Thank you for reporting this issue @fengzhao ! Now we're using TiUP to manage TiDB cluster. You can file an issue there if found any question.

fengzhao commented 3 years ago

TiUP也用过了，也是有些小问题，问题一堆。不太稳定。

tisonkun commented 3 years ago

@fengzhao as mentioned above, you can file an issue there if found any concrete question.

BTW, for user question, we host a Chinese forum https://asktug.com/ you may be interested in.

fengzhao commented 3 years ago

@fengzhao as mentioned above, you can file an issue there if found any concrete question.

BTW, for user question, we host a Chinese forum https://asktug.com/ you may be interested in.

anyway，我依然觉得pingcap是个牛逼的公司，你们在做的东西是技术含量很高的东西。持续关注，我这边中英文均可。

pingcap / tidb

Fail to start a TiDB cluster #8671

Bug Report