Open plevart opened 1 year ago
Cloud you show your TidbCluster CR? or cloud you set an empty config for PD if no special config needed? e.g .spec.pd.config: {}
Here's my last version of TidbCluster that produces above problems...
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
name: galago
namespace: cpq-db-test
spec:
timezone: UTC
version: v6.5.2
configUpdateStrategy: RollingUpdate
enableDynamicConfiguration: true
pvReclaimPolicy: Retain
startScriptVersion: v2
imagePullPolicy: IfNotPresent
podSecurityContext:
runAsNonRoot: true
discovery:
requests:
cpu: 100m
memory: 200Mi
limits:
cpu: 200m
memory: 200Mi
pd:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- pd
topologyKey: kubernetes.io/hostname
maxFailoverCount: 0
mountClusterClientSecret: true
replicas: 3
requests:
cpu: 125m
memory: 500Mi
storage: 10Gi
limits:
cpu: 250m
memory: 500Mi
storageClassName: ocs-storagecluster-ceph-rbd
tidb:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- tidb
topologyKey: kubernetes.io/hostname
maxFailoverCount: 0
replicas: 3
requests:
cpu: 1000m
memory: 6Gi
limits:
cpu: 1000m
memory: 6Gi
tikv:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- tikv
topologyKey: kubernetes.io/hostname
maxFailoverCount: 0
mountClusterClientSecret: true
replicas: 3
scalePolicy:
scaleInParallelism: 1
scaleOutParallelism: 1
requests:
cpu: 1000m
memory: 6Gi
storage: 40Gi
limits:
cpu: 1000m
memory: 6Gi
storageClassName: ocs-storagecluster-ceph-rbd
I'll try with empty pd and follow up with report.
Well, empty pd would not make sense here. We need 3 replicas, we need to provide correct storageClassName, we need to disperse instances across cluster nodes. And that's about it. Nothing left to remove from the pd config then. I don't think these settings have anything to do with the problem at hand, do you?
I mean, .spec.pd.config = {}
, but not .spec.pd = {}
Well, that is already not specified in the TidbCluster that I'm trying to use above. No .spec.pd.config attribute. Do you mean I should explicitly set it to empty object {} ?
YES, explicitly set it to empty object {}.
Yes, it worked. PD pods are running now! Cool. Let's see what follows... Should I do the same for tidb and tikv pods too?
It seems tikv pods have the same problem. So should I set .spec.tikv.config = {} too?
Yes, it's better to set .config
to {}
explicitly if no special config items are needed.
Operator is now performing rolling restart of tikv pods (old ones that never started up are being terminated and new ones are replacing them). Let's see if this comes up after all...
tikv pods are now running. But tidb pods have not been created at all yet. Here's the .status of TidbCluster:
{
"clusterID": "7239252567293204966",
"conditions": [
{
"lastTransitionTime": "2023-05-31T07:56:41Z",
"lastUpdateTime": "2023-05-31T08:09:47Z",
"message": "TiDB(s) are not healthy",
"reason": "TiDBUnhealthy",
"status": "False",
"type": "Ready"
}
],
"pd": {
"image": "pingcap/pd:v6.5.2",
"leader": {
"clientURL": "http://galago-pd-0.galago-pd-peer.cpq-db-test.svc:2379",
"health": true,
"id": "483099124998360962",
"lastTransitionTime": "2023-05-31T07:57:17Z",
"name": "galago-pd-0"
},
"members": {
"galago-pd-0": {
"clientURL": "http://galago-pd-0.galago-pd-peer.cpq-db-test.svc:2379",
"health": true,
"id": "483099124998360962",
"lastTransitionTime": "2023-05-31T07:57:17Z",
"name": "galago-pd-0"
},
"galago-pd-1": {
"clientURL": "http://galago-pd-1.galago-pd-peer.cpq-db-test.svc:2379",
"health": true,
"id": "8271775992873509316",
"lastTransitionTime": "2023-05-31T07:57:24Z",
"name": "galago-pd-1"
},
"galago-pd-2": {
"clientURL": "http://galago-pd-2.galago-pd-peer.cpq-db-test.svc:2379",
"health": true,
"id": "11500123880375400105",
"lastTransitionTime": "2023-05-31T07:57:24Z",
"name": "galago-pd-2"
}
},
"phase": "Normal",
"statefulSet": {
"collisionCount": 0,
"currentReplicas": 3,
"currentRevision": "galago-pd-6b6ff75544",
"observedGeneration": 1,
"readyReplicas": 3,
"replicas": 3,
"updateRevision": "galago-pd-6b6ff75544",
"updatedReplicas": 3
},
"synced": true,
"volumes": {
"pd": {
"boundCount": 3,
"currentCapacity": "10Gi",
"currentCount": 3,
"currentStorageClass": "ocs-storagecluster-ceph-rbd",
"modifiedCapacity": "10Gi",
"modifiedCount": 3,
"modifiedStorageClass": "ocs-storagecluster-ceph-rbd",
"name": "pd",
"resizedCapacity": "10Gi",
"resizedCount": 3
}
}
},
"pump": {},
"ticdc": {},
"tidb": {
"image": "pingcap/tidb:v6.5.2",
"phase": "Normal",
"statefulSet": {
"collisionCount": 0,
"currentRevision": "galago-tidb-74686d487b",
"observedGeneration": 1,
"replicas": 0,
"updateRevision": "galago-tidb-74686d487b"
}
},
"tiflash": {},
"tikv": {
"bootStrapped": true,
"image": "pingcap/tikv:v6.5.2",
"phase": "Normal",
"statefulSet": {
"collisionCount": 0,
"currentReplicas": 3,
"currentRevision": "galago-tikv-84d4c8959",
"observedGeneration": 4,
"readyReplicas": 3,
"replicas": 3,
"updateRevision": "galago-tikv-84d4c8959",
"updatedReplicas": 3
},
"stores": {
"1": {
"id": "1",
"ip": "galago-tikv-2.galago-tikv-peer.cpq-db-test.svc",
"lastTransitionTime": "2023-05-31T08:04:47Z",
"leaderCount": 1,
"podName": "galago-tikv-2",
"state": "Up"
},
"4": {
"id": "4",
"ip": "galago-tikv-1.galago-tikv-peer.cpq-db-test.svc",
"lastTransitionTime": "2023-05-31T08:07:06Z",
"leaderCount": 0,
"podName": "galago-tikv-1",
"state": "Up"
},
"6": {
"id": "6",
"ip": "galago-tikv-0.galago-tikv-peer.cpq-db-test.svc",
"lastTransitionTime": "2023-05-31T08:09:47Z",
"leaderCount": 0,
"podName": "galago-tikv-0",
"state": "Up"
}
},
"synced": true,
"volumes": {
"tikv": {
"boundCount": 3,
"currentCapacity": "40Gi",
"currentCount": 3,
"currentStorageClass": "ocs-storagecluster-ceph-rbd",
"modifiedCapacity": "40Gi",
"modifiedCount": 3,
"modifiedStorageClass": "ocs-storagecluster-ceph-rbd",
"name": "tikv",
"resizedCapacity": "40Gi",
"resizedCount": 3
}
}
},
"tiproxy": {}
}
Ok, the reason for tidb pods not appearing seems to be the following: the ReplicaSet for tidb pods has been created and is now emiting the following event:
create Pod galago-tidb-0 in StatefulSet galago-tidb failed error: pods "galago-tidb-0" is forbidden: failed quota: cpq-db-test-quota: must specify limits.cpu for: slowlog; limits.memory for: slowlog; requests.cpu for: slowlog; requests.memory for: slowlog
So I have to configure the limits for the slowlog container. Let us see...
It seems to be working. Now I just have to tune the tikv/pd/tidb requests/limits to squeeze them into per-namespace limits imposed by OpenShift admin. Seems he gave me 24G and not 24Gi of memory that I requested :-( ...
I have managed to deploy the TidbCluster. Thank you very much @csuzhangxc for assistance. Out of curiosity, what is the difference between not specifying .config and specifying it as an empty object {} ? Do different defaults get used in either case? I can see that empty {} actually generates empty cinfig-file
entry in the ConfigMap. Is not specifying .config any different? Oh, perhaps this is what causes controller to fail?
In the current implementation, the ConfigMap is only created if .config
is set. In fact, we may still create an empty ConfigMap if it is not set.
That makes sense now. Thanks!
Hi,
I'm trying to deploy a TidbCluster resource using TiDB Operator 1.4.4. on OpenShift. I get the following messages repeating in the log of tidb-controller-manager:
Nevertheless the
discovery
pod gets created and starts up successfully, whilepd
pods get created, but don't start due to (pod event):galago
is the name of the TidbCluster and I don't seegalago-pd
ConfigMap created. So do we have a chichen-egg problem here? tidb-controller-manager not creating ConfigMap due to timing out pd healh checks and pd PODs not starting due to absence of ConfigMap? Or are these "PodSecurity" warnings causing tidb-controller-manager to bail-out prematurely? May I add that regardless of those "warnings",discovery
POD is created with all these "missing" securityContext stuff and it works (looks like OpenShift is adding those by itself):Any advice as how to work arround this problem or is TiDB Operator 1.4.4 simply not compatible with OpenShift?