layer5io / meshery-performance-action

GitHub Action for pipelining microservices and Kubernetes performance testing with Meshery
https://layer5.io/projects/nighthawk
Apache License 2.0
29 stars 22 forks source link

Add retries and confirmations to ensure CNCF runners and machines are removed. #58

Open gyohuangxin opened 2 years ago

gyohuangxin commented 2 years ago

Description

There are some remaining CNCF runners not being remove after tests done, the number of them gradually increases over time. We can delete them manually, but it's better to make sure they are properly removed. image

The same thing happened to equinix servers deletion: image

Expected Behavior

We should add retries and confirmations to ensure CNCF runners and machines are removed.

Screenshots/Logs

Environment:

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

This issue is being automatically closed due to inactivity. However, you may choose to reopen this issue.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

This issue is being automatically closed due to inactivity. However, you may choose to reopen this issue.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

leecalcote commented 1 year ago

Uh-oh. We do need to complete this item.

vielmetti commented 1 year ago

It's possible to create machines on Equinix Metal in such a way that there's a termination time associated with them. See the "termination_time" field at

https://deploy.equinix.com/developers/api/metal/#tag/Devices/operation/createDevice

in the Equinix Metal API reference.

(That's not a substitute for cleanup, but it could backstop any other efforts if there's a bug somewhere else).

vielmetti commented 1 year ago

There was a short-lived API outage yesterday, described at

https://status.equinixmetal.com/incidents/h30n2jlr5d3p

which may have impacted manual deletion of these systems. Please retry if you were affected by this. As of this writing, there are 48 systems deployed.

gyohuangxin commented 1 year ago

@vielmetti I'm still facing the issue to access the management UI: image

vielmetti commented 1 year ago

@gyohuangxin can you open up a ticket with our support team? I'll share your UI issue with the team, but it may be something specific to your account.

vielmetti commented 1 year ago

@gyohuangxin Can you please task someone else on the project to assist you with cleaning up the idle and stranded resources while we sort out your access problems.

vielmetti commented 1 year ago

The code that notices that a deprovision failed is here

https://github.com/layer5io/meshery-smp-action/blob/862c5283953f1b5a3a607c9e1f00461f98a4b4d5/.github/workflows/scripts/stop-cil-runner.sh#L19

It logs an error:

echo "ERROR: Failed to remove CNCF CIL machine: $hostname, device id: $device_id."

and then exits without retrying. If anything fails for any temporary reason, the machines will live forever until someone has manual attention.

Where does this error log go? If it's published somewhere we could look for patterns.

leecalcote commented 1 year ago

@Revolyssup, will you please add this to tomorrow’s CI meeting? @edwvilla’s help here is much appreciated. Let’s ensure that we have a quick review and resolution. // @gyohuangxin

leecalcote commented 1 year ago

All existing servers were manually deprovisioned today. A fresh batch of newly provisioned servers is running (now) from workflow schedule. Let's see if those servers are automatically deprovisioned on completion of their task.

leecalcote commented 1 year ago

Yes, it seems that the test servers are successfully deprovisioned at end of test. 👍