openshift / svt

Apache License 2.0
123 stars 105 forks source link

add rosa and hypershift support #762

Closed qiliRedHat closed 1 year ago

qiliRedHat commented 1 year ago

Changes include in the PR

  1. replace kubeadmin with a more comman 'admin' user
  2. generate_auth_files.sh to generate the kubeconfig, admin, users files for Prow installed rosa cluster and Flexy-installed cluster
  3. Update the other files to support the changes above
  4. Change README.md to includes the support of rosa
  5. A Docker file for run containerized reliability in future
  6. A bug fixe for cronjob.sh
qiliRedHat commented 1 year ago

generate_auth_files.sh test result: rosa

% ./generate_auth_files.sh     
TOKEN and SERVER are provided for Prow provisioned rosa cluster.
Logged into "https://api.build03.ky4t.p1.openshiftapps.com:6443" as "qili" using the token provided.

You have access to the following projects and can switch between them with 'oc project <projectname>':

  * ci-op-lcfgvbqd
    ci-op-m7ncgdvd

Using project "ci-op-lcfgvbqd".
Now using project "ci-op-m7ncgdvd" on server "https://api.build03.ky4t.p1.openshiftapps.com:6443".
Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init), inject-cli (init)
kubeconfig file is created
Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init), inject-cli (init)
Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init), inject-cli (init)
admin file is created
users file is created
qiliRedHat commented 1 year ago

Test result on classic rosa - staging https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39801/rehearse-39801-periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-rosa-4.12-reliability-perfscale-reliability-rosa-staging/1669279168463900672

2023-06-15 21:02:31,735 - INFO - Reliability test results:
[Function]               |     Total|    Passed|    Failed|Failure Rate|
-----------------------------------------------------------------------
[delete_all_projects]    |         3|         3|         0|      0.0%|
[check_operators]        |         1|         1|         0|      0.0%|
[oc_task]                |         1|         1|         0|      0.0%|
[verify_project_deletion]|        11|        11|         0|      0.0%|
[kubectl_task]           |         1|         1|         0|      0.0%|
[new_project]            |         5|         5|         0|      0.0%|
[check_all_projects]     |         1|         1|         0|      0.0%|
[cronjob.sh -n 10]       |         1|         1|         0|      0.0%|
[cronjob.sh -c]          |         1|         1|         0|      0.0%|
[new_app]                |         4|         4|         0|      0.0%|
[load_app]               |        40|        40|         0|      0.0%|
[build]                  |         1|         1|         0|      0.0%|
[check_pods]             |         2|         2|         0|      0.0%|
[delete_project]         |         5|         5|         0|      0.0%|
[scale_deployment]       |         4|         4|         0|      0.0%|
[cronjob.sh -d]          |         1|         1|         0|      0.0%|
-----------------------------------------------------------------------
qiliRedHat commented 1 year ago

@svetsa-rh The tests have been done on rosa, rosa hypershift provisioning is blocked now, I will run a test on rosa hypershift after it is unlocked.

svetsa-rh commented 1 year ago

@qiliRedHat For Rosa on prow, tests by default are run using rosa-admin user (not sure if this is tied to user profile), which is granted cluster-admin role. Will your scripts be compatible with it?

Do you have prow runs with these scripts or these changes are necessary before you can try on prow?

qiliRedHat commented 1 year ago

@qiliRedHat For Rosa on prow, tests by default are run using rosa-admin user (not sure if this is tied to user profile), which is granted cluster-admin role. Will your scripts be compatible with it?

Do you have prow runs with these scripts or these changes are necessary before you can try on prow?

@svetsa-rh Thanks for the info. Yes, the admin user for rosa is rosa-admin, which is get with generate_auth_files.sh by providing the TOKEN and SERVER. The change in the PR is mainly to resolve this - use a more common admin user instead of hardcoded kubeadmin, which use used for Flexy-install cluster. The test result above was run with the rosa cluster provisioned from prow.

svetsa-rh commented 1 year ago

@qiliRedHat Thanks for clarifying, can you share the link to prow job and artifacts - just want to check the logs and verify.

qiliRedHat commented 1 year ago

@qiliRedHat Thanks for clarifying, can you share the link to prow job and artifacts - just want to check the logs and verify.

@svetsa-rh Add the prow job link, the prow job only run a wait step to keep the cluster not being destroyed while I can run the reliability test. But as I said in our meeting yesterday, there is some issue keeping the wait step run for 7 days. The reherse job can run only 4 hours and timed out.

INFO[2023-06-15T09:42:30Z] ci-operator version v20230614-f253d86bf     
...
INFO[2023-06-15T10:43:28Z] Running multi-stage phase test               
[36](https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39801/rehearse-39801-periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-rosa-4.12-reliability-perfscale-reliability-rosa-staging/1669279168463900672#1:build-log.txt%3A36)
INFO[2023-06-15T10:43:28Z] Running step perfscale-reliability-rosa-staging-openshift-qe-perfscale-wait. 
[37](https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39801/rehearse-39801-periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-rosa-4.12-reliability-perfscale-reliability-rosa-staging/1669279168463900672#1:build-log.txt%3A37)
{"component":"entrypoint","file":"k8s.io/test-infra/prow/entrypoint/run.go:164","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2023-06-15T13:42:30Z"}
[38](https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39801/rehearse-39801-periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-rosa-4.12-reliability-perfscale-reliability-rosa-staging/1669279168463900672#1:build-log.txt%3A38)
INFO[2023-06-15T13:42:30Z] Received signal.                              signal=interrupt
[39](https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39801/rehearse-39801-periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-rosa-4.12-reliability-perfscale-reliability-rosa-staging/1669279168463900672#1:build-log.txt%3A39)
INFO[2023-06-15T13:42:30Z] error: Process interrupted with signal interrupt, cancelling execution... 
[40](https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39801/rehearse-39801-periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-rosa-4.12-reliability-perfscale-reliability-rosa-staging/1669279168463900672#1:build-log.txt%3A40)
INFO[2023-06-15T13:42:30Z] cleanup: Deleting release pod release-images-latest 
[41](https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39801/rehearse-39801-periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-rosa-4.12-reliability-perfscale-reliability-rosa-staging/1669279168463900672#1:build-log.txt%3A41)
INFO[2023-06-15T13:42:30Z] Step perfscale-reliability-rosa-staging-openshift-qe-perfscale-wait failed after 2h59m2s. 
[42](https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39801/rehearse-39801-periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-rosa-4.12-reliability-perfscale-reliability-rosa-staging/1669279168463900672#1:build-log.txt%3A42)
INFO[2023-06-15T13:42:30Z] cleanup: Deleting pods with label ci.openshift.io/multi-stage-test=perfscale-reliability-rosa-staging 
[43](https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39801/rehearse-39801-periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-rosa-4.12-reliability-perfscale-reliability-rosa-staging/1669279168463900672#1:build-log.txt%3A43)
INFO[2023-06-15T13:51:00Z] Step phase test failed after 3h7m32s.        
[44](https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39801/rehearse-39801-periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-rosa-4.12-reliability-perfscale-reliability-rosa-staging/1669279168463900672#1:build-log.txt%3A44)
INFO[2023-06-15T13:51:00Z] Running multi-stage phase post
qiliRedHat commented 1 year ago

rosa hypershift -staging https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39801/rehearse-39801-periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-rosa-4.12-reliability-perfscale-reliability-rosa-hypershift-staging/1670663438147260416

2023-06-19 14:25:29,586 - INFO - Reliability test results:
[Function]               |     Total|    Passed|    Failed|Failure Rate|
-----------------------------------------------------------------------
[delete_all_projects]    |         2|         2|         0|      0.0%|
[check_operators]        |         1|         1|         0|      0.0%|
[oc_task]                |         1|         1|         0|      0.0%|
[verify_project_deletion]|         4|         4|         0|      0.0%|
[kubectl_task]           |         1|         1|         0|      0.0%|
[new_project]            |         3|         3|         0|      0.0%|
[cronjob.sh -n 10]       |         1|         1|         0|      0.0%|
[cronjob.sh -c]          |         1|         1|         0|      0.0%|
[new_app]                |         2|         2|         0|      0.0%|
[cronjob.sh -d]          |         1|         1|         0|      0.0%|
[delete_project]         |         3|         3|         0|      0.0%|
[load_app]               |        20|        20|         0|      0.0%|
[scale_deployment]       |         4|         4|         0|      0.0%|
-----------------------------------------------------------------------
qiliRedHat commented 1 year ago

@svetsa-rh Both rosa and rosa-hypershift are tested, this pr is ready for review.

svetsa-rh commented 1 year ago

@qiliRedHat Given the discussion with Sai, should we wait to review/merge this in lieu of using newer version?

qiliRedHat commented 1 year ago

@qiliRedHat Given the discussion with Sai, should we wait to review/merge this in lieu of using newer version?

Sorry, what do you mean by newer version?

qiliRedHat commented 1 year ago

@svetsa-rh Thanks for the comments! I changed to fix the issues you mentioned above, please take a look. As there is no main logic change, I will do one more test on the classic ocp, after you clarified about the 'newer version'.

svetsa-rh commented 1 year ago

@qiliRedHat Given the discussion with Sai, should we wait to review/merge this in lieu of using newer version?

Sorry, what do you mean by newer version?

@qiliRedHat My bad, I got reliability test and router perf test mixed up. I will review your changes soon.

svetsa-rh commented 1 year ago

@svetsa-rh Thanks for the comments! I changed to fix the issues you mentioned above, please take a look. As there is no main logic change, I will do one more test on the classic ocp, after you clarified about the 'newer version'.

Sure, go ahead and run. I think all my comments are minor.

qiliRedHat commented 1 year ago

@svetsa-rh Test result form classic OCP

2023-06-28 15:09:07,254 - INFO - Reliability test results:
[Function]               |     Total|    Passed|    Failed|Failure Rate|
-----------------------------------------------------------------------
[delete_all_projects]    |         3|         3|         0|      0.0%|
[check_operators]        |         9|         9|         0|      0.0%|
[verify_project_deletion]|         8|         8|         0|      0.0%|
[oc_task]                |         9|         9|         0|      0.0%|
[kubectl_task]           |         9|         9|         0|      0.0%|
[new_project]            |         5|         5|         0|      0.0%|
[check_all_projects]     |         1|         1|         0|      0.0%|
[new_app]                |         4|         4|         0|      0.0%|
[cronjob.sh -n 10]       |         1|         1|         0|      0.0%|
[cronjob.sh -c]          |         4|         4|         0|      0.0%|
[load_app]               |        40|        40|         0|      0.0%|
[build]                  |         1|         1|         0|      0.0%|
[check_pods]             |         2|         2|         0|      0.0%|
[delete_project]         |         5|         5|         0|      0.0%|
[scale_deployment]       |         4|         4|         0|      0.0%|
[cronjob.sh -d]          |         1|         1|         0|      0.0%|
-----------------------------------------------------------------------
qiliRedHat commented 1 year ago

Test on latest code https://redhat-internal.slack.com/archives/C0266JJ4XM5/p1687944106682559?thread_ts=1687941235.083889&cid=C0266JJ4XM5

2023-06-28 09:21:46,528 - INFO - Reliability test results:
[Function]               |     Total|    Passed|    Failed|Failure Rate|
-----------------------------------------------------------------------
[delete_all_projects]    |        16|        16|         0|      0.0%|
[check_operators]        |         1|         1|         0|      0.0%|
[oc_task]                |         1|         1|         0|      0.0%|
[verify_project_deletion]|        64|        64|         0|      0.0%|
[kubectl_task]           |         1|         1|         0|      0.0%|
[new_project]            |        32|        32|         0|      0.0%|
[check_all_projects]     |        16|        16|         0|      0.0%|
[new_app]                |        32|        32|         0|      0.0%|
[load_app]               |       320|       320|         0|      0.0%|
[build]                  |        16|        16|         0|      0.0%|
[check_pods]             |        32|        32|         0|      0.0%|
[delete_project]         |        33|        32|         1|      3.0%|
-----------------------------------------------------------------------

Negative path

start.sh logs will be saved to start_20230628_092612.log.
mkdir: cannot create directory ‘test’: File exists
Test folder test is created.
Preparing venv.
Python 3.10.8
/root/reliability/reliability-v2-414-haproxy
Generating reliabilty configuration file.
[ERROR] 'utils/path_to_auth_files//admin' is not a valid admin file path. Please provide the absolute path to the folder contains admin file with -p.
qiliRedHat commented 1 year ago

@svetsa-rh PTAL again with latest test reuslts.

qiliRedHat commented 1 year ago

@svetsa-rh README format is updated. https://github.com/openshift/svt/blob/9405be683140cb1a3481506aa3d97aeb362045df/reliability-v2/README.md Please have a final review. Thanks for all the comments!

qiliRedHat commented 1 year ago

@svetsa-rh Please give a final review, thanks!

svetsa-rh commented 1 year ago

/lgtm