pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.22k stars 493 forks source link

Unsuccessfully trying to deploy TidbCluster using TiDB Operator 1.4.4. on OpenShift #5034

Open plevart opened 1 year ago

plevart commented 1 year ago

Hi,

I'm trying to deploy a TidbCluster resource using TiDB Operator 1.4.4. on OpenShift. I get the following messages repeating in the log of tidb-controller-manager:

W0530 15:50:16.731498 1 warnings.go:67] would violate PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "discovery" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "instana-instrumentation-init", "discovery" must set securityContext.capabilities.drop=["ALL"]), seccompProfile (pod or containers "instana-instrumentation-init", "discovery" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
W0530 15:50:16.742115 1 warnings.go:67] would violate PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "discovery" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "instana-instrumentation-init", "discovery" must set securityContext.capabilities.drop=["ALL"]), seccompProfile (pod or containers "instana-instrumentation-init", "discovery" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
E0530 15:50:21.753426 1 pd_member_manager.go:205] failed to sync TidbCluster: [cpq-db-test/galago]'s status, error: Get "http://galago-pd.cpq-db-test:2379/pd/api/v1/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers), service cpq-db-test/galago-pd has no endpoints
I0530 15:50:21.754140 1 tidb_cluster_controller.go:131] TidbCluster: cpq-db-test/galago, still need sync: TidbCluster: [cpq-db-test/galago], waiting for PD cluster running, requeuing

Nevertheless the discovery pod gets created and starts up successfully, while pd pods get created, but don't start due to (pod event):

MountVolume.SetUp failed for volume "config" : configmap "galago-pd" not found

galago is the name of the TidbCluster and I don't see galago-pd ConfigMap created. So do we have a chichen-egg problem here? tidb-controller-manager not creating ConfigMap due to timing out pd healh checks and pd PODs not starting due to absence of ConfigMap? Or are these "PodSecurity" warnings causing tidb-controller-manager to bail-out prematurely? May I add that regardless of those "warnings", discovery POD is created with all these "missing" securityContext stuff and it works (looks like OpenShift is adding those by itself):

spec:
  securityContext:
    seLinuxOptions:
      level: 's0:c38,c27'
    runAsNonRoot: true
    fsGroup: 1001460000
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: discovery
      securityContext:
        capabilities:
          drop:
            - ALL
        runAsUser: 1001460000
        allowPrivilegeEscalation: false

Any advice as how to work arround this problem or is TiDB Operator 1.4.4 simply not compatible with OpenShift?

csuzhangxc commented 1 year ago

Cloud you show your TidbCluster CR? or cloud you set an empty config for PD if no special config needed? e.g .spec.pd.config: {}

plevart commented 1 year ago

Here's my last version of TidbCluster that produces above problems...

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: galago
  namespace: cpq-db-test
spec:
  timezone: UTC
  version: v6.5.2
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  pvReclaimPolicy: Retain
  startScriptVersion: v2
  imagePullPolicy: IfNotPresent
  podSecurityContext:
    runAsNonRoot: true
  discovery:
    requests:
      cpu: 100m
      memory: 200Mi
    limits:
      cpu: 200m
      memory: 200Mi
  pd:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/component
                  operator: In
                  values:
                    - pd
            topologyKey: kubernetes.io/hostname
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      cpu: 125m
      memory: 500Mi
      storage: 10Gi
    limits:
      cpu: 250m
      memory: 500Mi
    storageClassName: ocs-storagecluster-ceph-rbd
  tidb:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/component
                  operator: In
                  values:
                    - tidb
            topologyKey: kubernetes.io/hostname
    maxFailoverCount: 0
    replicas: 3
    requests:
      cpu: 1000m
      memory: 6Gi
    limits:
      cpu: 1000m
      memory: 6Gi
  tikv:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/component
                  operator: In
                  values:
                    - tikv
            topologyKey: kubernetes.io/hostname
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    scalePolicy:
      scaleInParallelism: 1
      scaleOutParallelism: 1
    requests:
      cpu: 1000m
      memory: 6Gi
      storage: 40Gi
    limits:
      cpu: 1000m
      memory: 6Gi
    storageClassName: ocs-storagecluster-ceph-rbd

I'll try with empty pd and follow up with report.

plevart commented 1 year ago

Well, empty pd would not make sense here. We need 3 replicas, we need to provide correct storageClassName, we need to disperse instances across cluster nodes. And that's about it. Nothing left to remove from the pd config then. I don't think these settings have anything to do with the problem at hand, do you?

csuzhangxc commented 1 year ago

I mean, .spec.pd.config = {}, but not .spec.pd = {}

plevart commented 1 year ago

Well, that is already not specified in the TidbCluster that I'm trying to use above. No .spec.pd.config attribute. Do you mean I should explicitly set it to empty object {} ?

csuzhangxc commented 1 year ago

YES, explicitly set it to empty object {}.

plevart commented 1 year ago

Yes, it worked. PD pods are running now! Cool. Let's see what follows... Should I do the same for tidb and tikv pods too?

plevart commented 1 year ago

It seems tikv pods have the same problem. So should I set .spec.tikv.config = {} too?

csuzhangxc commented 1 year ago

Yes, it's better to set .config to {} explicitly if no special config items are needed.

plevart commented 1 year ago

Operator is now performing rolling restart of tikv pods (old ones that never started up are being terminated and new ones are replacing them). Let's see if this comes up after all...

plevart commented 1 year ago

tikv pods are now running. But tidb pods have not been created at all yet. Here's the .status of TidbCluster:

{
  "clusterID": "7239252567293204966",
  "conditions": [
    {
      "lastTransitionTime": "2023-05-31T07:56:41Z",
      "lastUpdateTime": "2023-05-31T08:09:47Z",
      "message": "TiDB(s) are not healthy",
      "reason": "TiDBUnhealthy",
      "status": "False",
      "type": "Ready"
    }
  ],
  "pd": {
    "image": "pingcap/pd:v6.5.2",
    "leader": {
      "clientURL": "http://galago-pd-0.galago-pd-peer.cpq-db-test.svc:2379",
      "health": true,
      "id": "483099124998360962",
      "lastTransitionTime": "2023-05-31T07:57:17Z",
      "name": "galago-pd-0"
    },
    "members": {
      "galago-pd-0": {
        "clientURL": "http://galago-pd-0.galago-pd-peer.cpq-db-test.svc:2379",
        "health": true,
        "id": "483099124998360962",
        "lastTransitionTime": "2023-05-31T07:57:17Z",
        "name": "galago-pd-0"
      },
      "galago-pd-1": {
        "clientURL": "http://galago-pd-1.galago-pd-peer.cpq-db-test.svc:2379",
        "health": true,
        "id": "8271775992873509316",
        "lastTransitionTime": "2023-05-31T07:57:24Z",
        "name": "galago-pd-1"
      },
      "galago-pd-2": {
        "clientURL": "http://galago-pd-2.galago-pd-peer.cpq-db-test.svc:2379",
        "health": true,
        "id": "11500123880375400105",
        "lastTransitionTime": "2023-05-31T07:57:24Z",
        "name": "galago-pd-2"
      }
    },
    "phase": "Normal",
    "statefulSet": {
      "collisionCount": 0,
      "currentReplicas": 3,
      "currentRevision": "galago-pd-6b6ff75544",
      "observedGeneration": 1,
      "readyReplicas": 3,
      "replicas": 3,
      "updateRevision": "galago-pd-6b6ff75544",
      "updatedReplicas": 3
    },
    "synced": true,
    "volumes": {
      "pd": {
        "boundCount": 3,
        "currentCapacity": "10Gi",
        "currentCount": 3,
        "currentStorageClass": "ocs-storagecluster-ceph-rbd",
        "modifiedCapacity": "10Gi",
        "modifiedCount": 3,
        "modifiedStorageClass": "ocs-storagecluster-ceph-rbd",
        "name": "pd",
        "resizedCapacity": "10Gi",
        "resizedCount": 3
      }
    }
  },
  "pump": {},
  "ticdc": {},
  "tidb": {
    "image": "pingcap/tidb:v6.5.2",
    "phase": "Normal",
    "statefulSet": {
      "collisionCount": 0,
      "currentRevision": "galago-tidb-74686d487b",
      "observedGeneration": 1,
      "replicas": 0,
      "updateRevision": "galago-tidb-74686d487b"
    }
  },
  "tiflash": {},
  "tikv": {
    "bootStrapped": true,
    "image": "pingcap/tikv:v6.5.2",
    "phase": "Normal",
    "statefulSet": {
      "collisionCount": 0,
      "currentReplicas": 3,
      "currentRevision": "galago-tikv-84d4c8959",
      "observedGeneration": 4,
      "readyReplicas": 3,
      "replicas": 3,
      "updateRevision": "galago-tikv-84d4c8959",
      "updatedReplicas": 3
    },
    "stores": {
      "1": {
        "id": "1",
        "ip": "galago-tikv-2.galago-tikv-peer.cpq-db-test.svc",
        "lastTransitionTime": "2023-05-31T08:04:47Z",
        "leaderCount": 1,
        "podName": "galago-tikv-2",
        "state": "Up"
      },
      "4": {
        "id": "4",
        "ip": "galago-tikv-1.galago-tikv-peer.cpq-db-test.svc",
        "lastTransitionTime": "2023-05-31T08:07:06Z",
        "leaderCount": 0,
        "podName": "galago-tikv-1",
        "state": "Up"
      },
      "6": {
        "id": "6",
        "ip": "galago-tikv-0.galago-tikv-peer.cpq-db-test.svc",
        "lastTransitionTime": "2023-05-31T08:09:47Z",
        "leaderCount": 0,
        "podName": "galago-tikv-0",
        "state": "Up"
      }
    },
    "synced": true,
    "volumes": {
      "tikv": {
        "boundCount": 3,
        "currentCapacity": "40Gi",
        "currentCount": 3,
        "currentStorageClass": "ocs-storagecluster-ceph-rbd",
        "modifiedCapacity": "40Gi",
        "modifiedCount": 3,
        "modifiedStorageClass": "ocs-storagecluster-ceph-rbd",
        "name": "tikv",
        "resizedCapacity": "40Gi",
        "resizedCount": 3
      }
    }
  },
  "tiproxy": {}
}
plevart commented 1 year ago

Ok, the reason for tidb pods not appearing seems to be the following: the ReplicaSet for tidb pods has been created and is now emiting the following event:

create Pod galago-tidb-0 in StatefulSet galago-tidb failed error: pods "galago-tidb-0" is forbidden: failed quota: cpq-db-test-quota: must specify limits.cpu for: slowlog; limits.memory for: slowlog; requests.cpu for: slowlog; requests.memory for: slowlog

So I have to configure the limits for the slowlog container. Let us see...

plevart commented 1 year ago

It seems to be working. Now I just have to tune the tikv/pd/tidb requests/limits to squeeze them into per-namespace limits imposed by OpenShift admin. Seems he gave me 24G and not 24Gi of memory that I requested :-( ...

plevart commented 1 year ago

I have managed to deploy the TidbCluster. Thank you very much @csuzhangxc for assistance. Out of curiosity, what is the difference between not specifying .config and specifying it as an empty object {} ? Do different defaults get used in either case? I can see that empty {} actually generates empty cinfig-file entry in the ConfigMap. Is not specifying .config any different? Oh, perhaps this is what causes controller to fail?

csuzhangxc commented 1 year ago

In the current implementation, the ConfigMap is only created if .config is set. In fact, we may still create an empty ConfigMap if it is not set.

plevart commented 1 year ago

That makes sense now. Thanks!