pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.23k stars 498 forks source link

Improve e2e test framework #3652

Open dragonly opened 3 years ago

dragonly commented 3 years ago

Feature Request

Is your feature request related to a problem? Please describe:

Currently the e2e cases are flaky and hard to debug, which hinders the development process like PR merge. We should Make E2E Test Great Again (MEGA)!

Describe the feature you'd like:

A stable, debuggable, easy to write, understandable e2e test framework with no surprises.

Describe alternatives you've considered:

None.

Teachability, Documentation, Adoption, Migration Strategy:

Things needs to be done in small PRs as possible. We can reference to Writing good e2e tests for Kubernetes

Here is the proposed todo list:

shonge commented 3 years ago

need print kubectl get po -A when e2e test error exit.

dragonly commented 3 years ago

need print kubectl get po -A when e2e test error exit.

@shonge Could you PTAL at DumpAllNamespaceInfo, which is called in every ginkgo.Describe such as https://github.com/pingcap/tidb-operator/blob/146b2645a95d7d8e884b3620372aaed7d828d9cc/tests/e2e/tidbcluster/tidbcluster.go#L74

There's a LogPodStates in the DumpAllPodInfoForNamespace, which is in turn called by DumpAllNamespaceInfo. Here is a sample output from a failed jenkins ci job:

Dec 28 16:31:56.763: INFO: POD NODE PHASE GRACE CONDITIONS
Dec 28 16:31:56.763: INFO: basic-v3-discovery-7b48dc7996-qrng5 tidb-operator-worker3 Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:11 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:13 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:13 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:11 +0000 UTC }]
Dec 28 16:31:56.763: INFO: basic-v3-pd-0 tidb-operator-worker Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:03:47 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:03:48 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:03:48 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:03:47 +0000 UTC }]
Dec 28 16:31:56.763: INFO: basic-v3-pd-1 tidb-operator-worker2 Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:02:33 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:03:04 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:03:04 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:02:33 +0000 UTC }]
Dec 28 16:31:56.763: INFO: basic-v3-pd-2 tidb-operator-worker3 Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:01:59 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:02:01 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:02:01 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:01:59 +0000 UTC }]
Dec 28 16:31:56.763: INFO: basic-v3-tidb-0 tidb-operator-worker2 Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:57 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:00:13 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:00:13 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:57 +0000 UTC }]
Dec 28 16:31:56.763: INFO: basic-v3-tidb-1 tidb-operator-worker3 Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:57 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:00:13 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:00:13 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:57 +0000 UTC }]
Dec 28 16:31:56.763: INFO: basic-v3-tikv-0 tidb-operator-worker2 Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:44 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:46 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:46 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:44 +0000 UTC }]
Dec 28 16:31:56.763: INFO: basic-v3-tikv-1 tidb-operator-worker3 Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:51 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:54 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:54 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:51 +0000 UTC }]
Dec 28 16:31:56.763: INFO: basic-v3-tikv-2 tidb-operator-worker Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:49 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:52 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:52 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 15:59:49 +0000 UTC }]
Dec 28 16:31:56.763: INFO: basic-v3-tikv-3 tidb-operator-worker Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:04:12 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:25:36 +0000 UTC ContainersNotReady containers with unready status: [tikv]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:25:36 +0000 UTC ContainersNotReady containers with unready status: [tikv]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:04:12 +0000 UTC }]
Dec 28 16:31:56.763: INFO: basic-v3-tikv-4 tidb-operator-worker2 Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:14:27 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:25:09 +0000 UTC ContainersNotReady containers with unready status: [tikv]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:25:09 +0000 UTC ContainersNotReady containers with unready status: [tikv]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-28 16:14:27 +0000 UTC }]

If this is not what you want, could you please describe exactly the message you are interested in for a failed case?

DanielZhangQD commented 3 years ago

In some cases, the TiDB clusters are deployed by helm, we should change them to use CR. And some backup & restore cases are also done by helm chart, we should change them to use CR.