pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.22k stars 493 forks source link

Blocked when creating a tidb cluster because the state of getting pd failed #154

Closed cwen0 closed 4 years ago

cwen0 commented 5 years ago

the log of tidb-controller-manager

E1030 03:02:59.483099       1 pd_member_manager.go:206] failed to sync TidbCluster: [test1-exp4-cat0-tidb-cluster/tidb-cluster]'s status, error: Get http://tidb-cluster-pd.test1-exp4-cat0-tidb-cluster:2379/pd/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I1030 03:02:59.484234       1 tidb_cluster_control.go:100] tidbcluster: [test1-exp4-cat0-tidb-cluster/tidb-cluster]'s pd cluster is not running.
E1030 03:03:29.483254       1 pd_member_manager.go:206] failed to sync TidbCluster: [test1-exp4-cat0-tidb-cluster/tidb-cluster]'s status, error: Get http://tidb-cluster-pd.test1-exp4-cat0-tidb-cluster:2379/pd/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I1030 03:03:29.484312       1 tidb_cluster_control.go:100] tidbcluster: [test1-exp4-cat0-tidb-cluster/tidb-cluster]'s pd cluster is not running.

but I check the health state of pd is true, I exec this command in the tidb-controller-manager pod

/ # wget -qO- http://tidb-cluster-pd.test1-exp4-cat0-tidb-cluster:2379/pd/health
[
  {
    "name": "tidb-cluster-pd-0",
    "member_id": 2384070400007428437,
    "client_urls": [
      "http://tidb-cluster-pd-0.tidb-cluster-pd-peer.test1-exp4-cat0-tidb-cluster.svc:2379"
    ],
    "health": true
  }
] 
myonlyzzy commented 5 years ago

@cwen0 @weekface I think this not a bug .maybe two reason for this. 1 tidb-controller-manager have a 2s timeout for pdClient,so if pd cann't return response under 2s. tidb-controller-manager will print a context timeout error,pd that deploy on dind maybe need loog time than 2s. 2 tidb-controller-manager start before pd .so maybe pd not start when tidb-controller-manager send request. so deploy tidb cluster before tidb-controller-manager maybe solve this problem.

weekface commented 5 years ago

@myonlyzzy

We haven't found the real reason. But the http.Client has cache about the TCP connections: https://github.com/golang/go/blob/master/src/net/http/client.go#L30

// The Client's Transport typically has internal state (cached TCP
// connections), so Clients should be reused instead of created as
// needed. Clients are safe for concurrent use by multiple goroutines.

So we remove the client pool and run it over and over again, to see if this is the reason.

2 tidb-controller-manager start before pd .so maybe pd not start when tidb-controller-manager send request. so deploy tidb cluster before tidb-controller-manager maybe solve this problem.

PD is created by tidb-controller-manager, it can't started before tidb-controller-manager.

weekface commented 5 years ago

Blocked again this morning.

xiaojingchen commented 5 years ago

i think is golang bug https://github.com/golang/go/issues/23399

cwen0 commented 5 years ago

@xiaojingchen If this a bug of golang and we can reproduce the problem. we should create a new issue on go repo. but I see this pr( https://go-review.googlesource.com/c/go/+/123435) had investigated this problem.

myonlyzzy commented 5 years ago

@weekface @cwen0 This problem that block create tidb cluster happen on laptop dind ? I deploy on my mbp dind also find this problem. But I find the reason is kube-dns container dnsmasq. I change my k8s dns plugin to coredns and then very thing ok. on gce very thing ok. never block. check your dind kube-dns log.

weekface commented 5 years ago

@myonlyzzy Yes, it happens on my laptop DinD env.

I google this error, there are many DNS issues:

But when this happens, i use kubectl exec to login the tidb-controller-manager Pod, it works as expected.

/ # wget -qO- http://tidb-cluster-pd.test1-exp4-cat0-tidb-cluster:2379/pd/health
[
  {
    "name": "tidb-cluster-pd-0",
    "member_id": 2384070400007428437,
    "client_urls": [
      "http://tidb-cluster-pd-0.tidb-cluster-pd-peer.test1-exp4-cat0-tidb-cluster.svc:2379"
    ],
    "health": true
  }
] 
gregwebs commented 5 years ago

so we can fix this by configuring different DNS in our e2e installer?

tennix commented 5 years ago

After changing the HTTP client timeout to 5 seconds, this problem hasn't occurred anymore. Or at least for a week testing. Before that, it happens frequently in our VM chaos testing environment. But we haven't encountered this issue on GKE cluster.

From Go http standard library documentation:

    // Timeout specifies a time limit for requests made by this
    // Client. The timeout includes connection time, any
    // redirects, and reading the response body. The timer remains
    // running after Get, Head, Post, or Do return and will
    // interrupt reading of the Response.Body.

The timeout includes connection time, redirect time. PD internally uses Etcd which will redirect requests to leader if the client connects to a follower member thus increasing the total time. Considering the network in the above environments, DinD and our VM chaos testing environment has poor network, it's slow and unstable (there are logs showing get tso too slow). It may cost more than 2 seconds to retrieve info from PD for tidb-operator. So we can fix this by increasing HTTP client timeout to 5 seconds or maybe longer.

xiaojingchen commented 5 years ago

the error happen again the operator was blocked 15 minutes to update TidbCluster after a node fails

I1219 10:16:33.404008       1 tidbcluster_control.go:68] TidbCluster: [example/example] updated successfully
I1219 10:16:33.501700       1 tidbcluster_control.go:68] TidbCluster: [example3/example3] updated successfully
I1219 10:16:33.502996       1 tidbcluster_control.go:68] TidbCluster: [example/example] updated successfully
E1219 10:17:08.270822       1 pd_member_manager.go:204] failed to sync TidbCluster: [example/example]'s status, error: Get http://example-pd.example:2379/pd/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1219 10:17:08.271488       1 pd_member_manager.go:204] failed to sync TidbCluster: [example2/example2]'s status, error: Get http://example2-pd.example2:2379/pd/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1219 10:17:08.273779       1 pd_member_manager.go:204] failed to sync TidbCluster: [example3/example3]'s status, error: Get http://example3-pd.example3:2379/pd/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1219 10:33:07.388671       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host
E1219 10:33:07.389658       1 tidbcluster_control.go:71] failed to update TidbCluster: [example3/example3], error: Put https://10.192.0.1:443/apis/pingcap.com/v1alpha1/namespaces/example3/tidbclusters/example3: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host
E1219 10:33:07.390950       1 tidb_cluster_controller.go:239] TidbCluster: example3/example3, sync failed [Get http://example3-pd.example3:2379/pd/api/v1/schedulers: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers), Put https://10.192.0.1:443/apis/pingcap.com/v1alpha1/namespaces/example3/tidbclusters/example3: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host], requeuing
E1219 10:33:07.390986       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host
E1219 10:33:07.391006       1 leaderelection.go:224] error retrieving resource lock tidb-operator/tidb-controller-manager: Get https://10.192.0.1:443/api/v1/namespaces/tidb-operator/endpoints/tidb-controller-manager: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host
E1219 10:33:07.391672       1 tidbcluster_control.go:71] failed to update TidbCluster: [example2/example2], error: Put https://10.192.0.1:443/apis/pingcap.com/v1alpha1/namespaces/example2/tidbclusters/example2: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host
E1219 10:33:07.391869       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host
E1219 10:33:07.391908       1 tidb_cluster_controller.go:239] TidbCluster: example2/example2, sync failed [Get http://example2-pd.example2:2379/pd/api/v1/schedulers: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers), Put https://10.192.0.1:443/apis/pingcap.com/v1alpha1/namespaces/example2/tidbclusters/example2: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host], requeuing
E1219 10:33:07.392255       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host
E1219 10:33:07.392603       1 tidbcluster_control.go:71] failed to update TidbCluster: [example/example], error: Put https://10.192.0.1:443/apis/pingcap.com/v1alpha1/namespaces/example/tidbclusters/example: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host
E1219 10:33:07.392797       1 tidb_cluster_controller.go:239] TidbCluster: example/example, sync failed [Get http://example-pd.example:2379/pd/api/v1/schedulers: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers), Put https://10.192.0.1:443/apis/pingcap.com/v1alpha1/namespaces/example/tidbclusters/example: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host], requeuing
E1219 10:33:07.390907       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host
E1219 10:33:07.393319       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host
E1219 10:33:07.393464       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host
I1219 10:33:07.393408       1 event.go:218] Event(v1.ObjectReference{Kind:"TidbCluster", Namespace:"example2", Name:"example2", UID:"5dfd02ef-0375-11e9-8476-525400664c3b", APIVersion:"pingcap.com", ResourceVersion:"14758348", FieldPath:""}): type: 'Warning' reason: 'FailedUpdate' update TidbCluster example2 failed error: Put https://10.192.0.1:443/apis/pingcap.com/v1alpha1/namespaces/example2/tidbclusters/example2: read tcp 10.128.42.238:45929->10.192.0.1:443: read: no route to host

the operator was blocked from 10:17:08 until 10:33:07.

aylei commented 4 years ago

Upstream issue: https://github.com/kubernetes/client-go/issues/374 The fix for golang is still pending review: https://go-review.googlesource.com/c/net/+/173952/

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 15 days