Daily scheduled cluster deletion job fails

andyuk1986 commented 2 months ago

Closes #810

Please find the successful run with this changes here: https://github.com/andyuk1986/keycloak-benchmark/actions/runs/9070651045

ryanemerson commented 2 months ago

Thanks for the PR @andyuk1986. Looking at the logs of your successful run it seems like the rosa logs is still returning a non-zero exit code due to a 404 for the cluster, however your action passes as this exit code is swallowed by the wait command. From the docs:

If n is not given, all currently active child processes are waited for, and the return status is zero.

My issue with using wait for this purpose is that we now have the output of destroy.sh and rosa logs combined, instead of being output sequentially. I think a simpler solution is to ensure that rosa logs always returns a 0 exit code, e.g.

rosa logs uninstall --debug -c ${CLUSTER_NAME} > "$(custom_date)_delete-cluster.log" || true

andyuk1986 commented 2 months ago

@ryanemerson the thing is that, with the simplest solution we will never get the uninstall logs as the cluster is uninstalled is not there any more, that's why the error was throwing. When the actions are sequential, then first the cluster is deleted and then we try to get the uninstall logs from it and the command complains that the cluster doesn't exist.

That's why I have made it to work in parallel so that while the cluster is uninstalling we record the logs to the file (I have added --watch there for following the logs). When the cluster is uninstalled successfully then the logs command which I had finishes with Info message not Error message - I have tried that with gh-ryan-a cluster deletion yesterday , and got the following logs in the end: `time=2024-05-14T00:43:00+02:00 level=debug msg=Response body follows time=2024-05-14T00:43:00+02:00 level=debug msg={ "kind": "Error", "id": "404", "href": "/api/clusters_mgmt/v1/errors/404", "code": "CLUSTERS-MGMT-404", "reason": "Cluster '2b7r83i86c62iskvidlm3kuuro0164qd' not found", "operation_id": "e1c0996e-e621-4308-8683-d27bb44eeacf" } time=2024-05-14T00:43:00+02:00 level=debug msg=Bearer token expires in 1m45.262401399s time=2024-05-14T00:43:00+02:00 level=debug msg=Got tokens on first attempt time=2024-05-14T00:43:00+02:00 level=debug msg=Request method is GET time=2024-05-14T00:43:00+02:00 level=debug msg=Request URL is 'https://api.openshift.com/api/clusters_mgmt/v1/clusters/2b7r83i86c62iskvidlm3kuuro0164qd/status' time=2024-05-14T00:43:00+02:00 level=debug msg=Request header 'Accept' is 'application/json' time=2024-05-14T00:43:00+02:00 level=debug msg=Request header 'Authorization' is omitted time=2024-05-14T00:43:00+02:00 level=debug msg=Request header 'User-Agent' is 'ROSACLI/1.2.23 OCM-SDK/0.1.347'

time=2024-05-14T00:43:01+02:00 level=debug msg=Response protocol is 'HTTP/2.0' time=2024-05-14T00:43:01+02:00 level=debug msg=Response status is '404 Not Found' time=2024-05-14T00:43:01+02:00 level=debug msg=Response header 'Content-Type' is 'application/json' time=2024-05-14T00:43:01+02:00 level=debug msg=Response header 'Date' is 'Mon, 13 May 2024 22:43:00 GMT' time=2024-05-14T00:43:01+02:00 level=debug msg=Response header 'Server' is 'envoy' time=2024-05-14T00:43:01+02:00 level=debug msg=Response header 'Vary' is 'Accept-Encoding' time=2024-05-14T00:43:01+02:00 level=debug msg=Response header 'X-Envoy-Upstream-Service-Time' is '167' time=2024-05-14T00:43:01+02:00 level=debug msg=Response header 'X-Operation-Id' is 'ee701620-d28d-4f43-8906-c6a36f125c90' time=2024-05-14T00:43:01+02:00 level=debug msg=Response body follows time=2024-05-14T00:43:01+02:00 level=debug msg={ "kind": "Error", "id": "404", "href": "/api/clusters_mgmt/v1/errors/404", "code": "CLUSTERS-MGMT-404", "reason": "Cluster '2b7r83i86c62iskvidlm3kuuro0164qd' not found", "operation_id": "ee701620-d28d-4f43-8906-c6a36f125c90" } I: Cluster 'gh-ryan-a' completed uninstallation `

ryanemerson commented 2 months ago

Sorry, you're right @andyuk1986, I saw that we were still getting debug log output from the rosa logs uninstall command, but it's only the calls to AWS not the actual uninstall information.

In that case +1 to Kamesh's timeout suggestion.

I would also suggest that you start watching the logs before you call destroy.sh to make sure nothing is lost in the unlikely event that the default tail limit is reached before rosa logs is executed.

andyuk1986 commented 2 months ago

@ryanemerson thanks a lot for your comment. So I have updated the PR with timeout impl suggested by Kamesh, also I will start to watch logs before starting the cluster destroy process. The only thing I have just noticed that --debug enables debug mode, but the debug logs are not saved in the file. I have checked the logs for today's cluster creation and it only contains 2 lines: INFO: Loading cluster 'gh-keycloak-a' INFO: Cluster 'gh-keycloak-a' has been successfully installed So need to fix that as well.

andyuk1986 commented 2 months ago

@kami619 @ryanemerson the PR is ready for review.

keycloak / keycloak-benchmark

Daily scheduled cluster deletion job fails #811