pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.22k stars 493 forks source link

PD url in discovery service need to be fix #4933

Open wxiaomou opened 1 year ago

wxiaomou commented 1 year ago

Bug Report

What version of Kubernetes are you using?

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.17", GitCommit:"a7736eaf34d823d7652415337ac0ad06db9167fc", GitTreeState:"clean", BuildDate:"2022-12-08T11:41:04Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.16-airbnb0", GitCommit:"3051216402e96974819f485e35143cbe2fdf43a8", GitTreeState:"clean", BuildDate:"2023-02-07T20:29:26Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

What version of TiDB Operator are you using?

TiDB Operator Version: version.Info{GitVersion:"v0.0.0-master+$Format:%h$", GitCommit:"d9f634343ef71926b43c85df17458fb55dc955b8", GitTreeState:"clean", BuildDate:"2023-02-15T05:18:23Z", GoVersion:"go1.18.10", Compiler:"gc", Platform:"linux/amd64"}

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

N/A What's the status of the TiDB cluster pods?

N/A What did you do?

https://github.com/pingcap/tidb-operator/blob/master/pkg/discovery/discovery.go#L271 need to be fixed and include not only PeerMembers but Members as well.

And for https://github.com/pingcap/tidb-operator/blob/master/pkg/discovery/discovery.go#L260, it needs to actually connect to the pd and fetch the member instead of just construct a url from spec

What did you expect to see?

What did you see instead? In the discovery.go when it's generating the list of PD urls it only includes the PeerMembers but not local member. We are deploying tidb across 3 k8s clusters. Does it make sense to include the local PD in the VerifyPDEndpoint response?

var returnPDMember string returnPDMembers := []string{pdURL} for _, peerPDMember := range tc.Status.PD.PeerMembers { if peerPDMember.Health { if len(pdEndpoint.scheme) == 0 { peerPDEndpoint := parsePDURL(peerPDMember.ClientURL) returnPDMember = fmt.Sprintf("%s:%s", peerPDEndpoint.pdMemberName, peerPDEndpoint.pdMemberPort) } else { returnPDMember = peerPDMember.ClientURL } returnPDMembers = append(returnPDMembers, returnPDMember) } }

Also we have Heterogeneous tidb and tikv cluster, currently discovery is generating to url from spec, this gave us wrong pd url in our use case. Does it make sense to actually connect to PD and fetch the PD member?

// if local pd doesn't exist, return target cluster pd peer addr if tc.Heterogeneous() && tc.WithoutLocalPD() { addr := controller.PDPeerFullyDomain(tc.Spec.Cluster.Name, tc.Spec.Cluster.Namespace, tc.Spec.Cluster.ClusterDomain) if pdEndpoint.scheme != "" { addr = fmt.Sprintf("%s://%s", pdEndpoint.scheme, addr) } addr = addr + ":" + pdEndpoint.pdMemberPort return addr, nil }

csuzhangxc commented 1 year ago

https://github.com/pingcap/tidb-operator/blob/master/pkg/discovery/discovery.go#L271 need to be fixed and include not only PeerMembers but Members as well.

Members should already be included in returnPDMembers := []string{pdURL} where pdURL is the URL to the local Service.


And for https://github.com/pingcap/tidb-operator/blob/master/pkg/discovery/discovery.go#L260, it needs to actually connect to the pd and fetch the member instead of just construct a url from spec this gave us wrong pd url in our use case

Can you show the case with the wrong PD URL? Yeah, connecting and fetching members may be better.

wxiaomou commented 1 year ago

Members should already be included in returnPDMembers := []string{pdURL} where pdURL is the URL to the local Service.

For example, we are running tidb in 3 k8s cluster 1a, 1b and 1e. This is from 1e. When render the TiKV start script it uses https://mussel-prod-pd:2379 when it call verifyPdendPoint which is the pdUrl in returnPDMembers := []string{pdURL}

/tikv-server --pd=https://mussel-prod-pd:2379,https://mussel-prod-pd-0.mussel-prod-pd-peer.tidb-mussel-prod.svc.us-east-1a.mussel-prod.tidb.musta.ch:2379,https://mussel-prod-pd-0.mussel-prod-pd-peer.tidb-mussel-prod.svc.us-east-1b.mussel-prod.tidb.musta.ch:2379,https://mussel-prod-pd-1.mussel-prod-pd-peer.tidb-mussel-prod.svc.us-east-1a.mussel-prod.tidb.musta.ch:2379,https://mussel-prod-pd-1.mussel-prod-pd-peer.tidb-mussel-prod.svc.us-east-1b.mussel-prod.tidb.musta.ch:2379 --advertise-addr=mussel-prod-tikv-0.mussel-prod-tikv-peer.tidb-mussel-prod.svc.us-east-1e.mussel-prod.tidb.musta.ch:20160 --addr=0.0.0.0:20160 --status-addr=0.0.0.0:20180 --advertise-status-addr=mussel-prod-tikv-0.mussel-prod-tikv-peer.tidb-mussel-prod.svc.us-east-1e.mussel-prod.tidb.musta.ch:20180 --data-dir=/var/lib/tikv --capacity=0 --config=/etc/tikv/tikv.toml

@csuzhangxc

wxiaomou commented 1 year ago

Can you show the case with the wrong PD URL? Yeah, connecting and fetching members may be better. For example we have a heterogeneous tidb cluster which just contains tidb instance. So we have two tc in this case

  1. mussel-prod
  2. mussel-prod-heterogeneous

addr := controller.PDPeerFullyDomain(tc.Spec.Cluster.Name, tc.Spec.Cluster.Namespace, tc.Spec.Cluster.ClusterDomain)

so the tc.Spec.Cluster.Name is mussel-prod-heterogeneous which is wrong since PD is in tc mussel-prod. Does this make sense to you @csuzhangxc ?

csuzhangxc commented 1 year ago

Members should already be included in returnPDMembers := []string{pdURL} where pdURL is the URL to the local Service.

For example, we are running tidb in 3 k8s cluster 1a, 1b and 1e. This is from 1e. When render the TiKV start script it uses https://mussel-prod-pd:2379 when it call verifyPdendPoint which is the pdUrl in returnPDMembers := []string{pdURL}

/tikv-server --pd=https://mussel-prod-pd:2379,https://mussel-prod-pd-0.mussel-prod-pd-peer.tidb-mussel-prod.svc.us-east-1a.mussel-prod.tidb.musta.ch:2379,https://mussel-prod-pd-0.mussel-prod-pd-peer.tidb-mussel-prod.svc.us-east-1b.mussel-prod.tidb.musta.ch:2379,https://mussel-prod-pd-1.mussel-prod-pd-peer.tidb-mussel-prod.svc.us-east-1a.mussel-prod.tidb.musta.ch:2379,https://mussel-prod-pd-1.mussel-prod-pd-peer.tidb-mussel-prod.svc.us-east-1b.mussel-prod.tidb.musta.ch:2379 --advertise-addr=mussel-prod-tikv-0.mussel-prod-tikv-peer.tidb-mussel-prod.svc.us-east-1e.mussel-prod.tidb.musta.ch:20160 --addr=0.0.0.0:20160 --status-addr=0.0.0.0:20180 --advertise-status-addr=mussel-prod-tikv-0.mussel-prod-tikv-peer.tidb-mussel-prod.svc.us-east-1e.mussel-prod.tidb.musta.ch:20180 --data-dir=/var/lib/tikv --capacity=0 --config=/etc/tikv/tikv.toml

@csuzhangxc

is there any problem when using https://mussel-prod-pd:2379 in 1e (for its local member)?

csuzhangxc commented 1 year ago

Can you show the case with the wrong PD URL? Yeah, connecting and fetching members may be better. For example we have a heterogeneous tidb cluster which just contains tidb instance. So we have two tc in this case

  1. mussel-prod
  2. mussel-prod-heterogeneous

addr := controller.PDPeerFullyDomain(tc.Spec.Cluster.Name, tc.Spec.Cluster.Namespace, tc.Spec.Cluster.ClusterDomain)

so the tc.Spec.Cluster.Name is mussel-prod-heterogeneous which is wrong since PD is in tc mussel-prod. Does this make sense to you @csuzhangxc ?

you mean, tc.Spec.Cluster.Name in mussel-prod is mussel-prod-heterogeneous, but not tc.Spec.Cluster.Name in mussel-prod-heterogeneous is mussel-prod?

if PD is in mussel-prod: