Support multi-datacenter clusters in test runner and script

rzetelskik commented 6 months ago

Description of your changes: This PR adds support for multi-datacenter clusters in our test runner and scripts.

Which issue is resolved by this Pull Request: Prerequisite for https://github.com/scylladb/scylla-operator/pull/1632.

/kind feature /priority important-longterm /cc

scylla-operator-bot[bot] commented 6 months ago

@rzetelskik: GitHub didn't allow me to request PR reviews from the following users: rzetelskik.

Note that only scylladb members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to [this](https://github.com/scylladb/scylla-operator/pull/1881): > > >**Description of your changes:** WIP > >**Which issue is resolved by this Pull Request:** >Resolves # > >/kind feature >/priority important-longterm >/cc > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

rzetelskik commented 6 months ago

/cc zimnx tnozicka

rzetelskik commented 5 months ago

https://github.com/scylladb/scylla-operator/issues/1525#issuecomment-2049733594

https://github.com/scylladb/scylla-operator/issues/1694#issuecomment-2049735439

I don't see how these changes could contribute to flakiness, since it's just some changes in machinery, so I'm not investigating.

/test images /retest

rzetelskik commented 5 months ago

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command ci/prow/e2e-gke-parallel 9ff84af link true /test e2e-gke-parallel Full PR test history. Your PR dashboard.

Cluster provisioning failed. /test images /retest

rzetelskik commented 5 months ago

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command ci/prow/e2e-gke-serial 61326f2 link true /test e2e-gke-serial Full PR test history. Your PR dashboard.

Cluster provisioning failed. /retest

rzetelskik commented 5 months ago

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command ci/prow/e2e-gke-parallel-clusterip d3b46ef link true /test e2e-gke-parallel-clusterip Full PR test history. Your PR dashboard.

Known Scylla Manager flake. Already working on it, this PR is unrelated.

/retest

rzetelskik commented 5 months ago

@zimnx thanks for the review, I replied to all of your comments

rzetelskik commented 5 months ago

@rzetelskik: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command ci/prow/e2e-gke-parallel 66167d5 link true /test e2e-gke-parallel ci/prow/e2e-gke-parallel-clusterip 66167d5 link true /test e2e-gke-parallel-clusterip Full PR test history. Your PR dashboard.

known manager flake for both, not investigating since it's unrelated and we're already working on fixing it /retest

rzetelskik commented 4 months ago

/hold for https://github.com/scylladb/scylla-operator/pull/1901 to rebase

rzetelskik commented 4 months ago

/hold cancel

rzetelskik commented 4 months ago

I'll test the changes against a multi-datacenter e2e after we'll have gone through all review iterations, I takes too long to do this every time /hold

rzetelskik commented 4 months ago

lgtm, but you need to fix the CI failure

https://github.com/scylladb/scylla-operator-release/pull/206 sent a PR to fix the typo on the CI side

tnozicka commented 4 months ago

https://github.com/scylladb/scylla-operator-release/pull/206 landed /retest

tnozicka commented 4 months ago

/lgtm

rzetelskik commented 4 months ago

Waiting for cluster to be provisioned...
Cluster provisioning failed. Exiting.
Missing kubeconfigs.
Usage: /usr/bin/bash kubeconfig [kubeconfig ...]

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command ci/prow/e2e-gke-release-script-latest d03fb8e link unknown /test e2e-gke-release-script-latest Full PR test history. Your PR dashboard.

Waiting for cluster to be provisioned...
Cluster provisioning failed. Exiting.
Missing kubeconfigs.
Usage: /usr/bin/bash kubeconfig [kubeconfig ...]

@tnozicka should I fix this in scylla-operator-release (pass kubeconfig to funcs) or make them discover kubeconfigs here?

tnozicka commented 4 months ago

I'd use the env vars for KUBECONFIGS to match how everything else handles KUBECONFIG. If you get KUBECONFIGS it wins over a KUBECONFIG, if you get KUBECONFIG, translate it to KUBECONFIGS[0] so you can use KUBECONFIGS consistently, if needed.

rzetelskik commented 4 months ago

I'd use the env vars for KUBECONFIGS to match how everything else handles KUBECONFIG. If you get KUBECONFIGS it wins over a KUBECONFIG, if you get KUBECONFIG, translate it to KUBECONFIGS[0] so you can use KUBECONFIGS consistently, if needed.

You can't really pass/test arrays as env vars that way, so best I could do here is to do this for KUBECONFIG_DIR on sourcing e2e lib. Unless you know a reasonable workaround for this.

tnozicka commented 4 months ago

You can't really pass/test arrays as env vars that way, so best I could do here is to do this for KUBECONFIG_DIR on sourcing e2e lib.

sounds good

rzetelskik commented 4 months ago

You can't really pass/test arrays as env vars that way, so best I could do here is to do this for KUBECONFIG_DIR on sourcing e2e lib. sounds good

ok, done

@tnozicka I realised I haven't passed kubeconfigs to the e2e pod in this PR. Should we land this as a starting point regardless? As I'm trying to run a multi-dc e2e test I'll probably bump into some other issues, but I think the baseline for framework etc is solid.

tnozicka commented 4 months ago

I am fine with followups

/approve /lgtm

scylla-operator-bot[bot] commented 4 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rzetelskik, tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/scylladb/scylla-operator/blob/master/OWNERS)~~ [tnozicka,zimnx] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

rzetelskik commented 4 months ago

/hold cancel

scylladb / scylla-operator

Support multi-datacenter clusters in test runner and script #1881