Open gyohuangxin opened 2 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue is being automatically closed due to inactivity. However, you may choose to reopen this issue.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue is being automatically closed due to inactivity. However, you may choose to reopen this issue.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Uh-oh. We do need to complete this item.
It's possible to create machines on Equinix Metal in such a way that there's a termination time associated with them. See the "termination_time" field at
https://deploy.equinix.com/developers/api/metal/#tag/Devices/operation/createDevice
in the Equinix Metal API reference.
(That's not a substitute for cleanup, but it could backstop any other efforts if there's a bug somewhere else).
There was a short-lived API outage yesterday, described at
https://status.equinixmetal.com/incidents/h30n2jlr5d3p
which may have impacted manual deletion of these systems. Please retry if you were affected by this. As of this writing, there are 48 systems deployed.
@vielmetti I'm still facing the issue to access the management UI:
@gyohuangxin can you open up a ticket with our support team? I'll share your UI issue with the team, but it may be something specific to your account.
@gyohuangxin Can you please task someone else on the project to assist you with cleaning up the idle and stranded resources while we sort out your access problems.
The code that notices that a deprovision failed is here
It logs an error:
echo "ERROR: Failed to remove CNCF CIL machine: $hostname, device id: $device_id."
and then exits without retrying. If anything fails for any temporary reason, the machines will live forever until someone has manual attention.
Where does this error log go? If it's published somewhere we could look for patterns.
@Revolyssup, will you please add this to tomorrow’s CI meeting? @edwvilla’s help here is much appreciated. Let’s ensure that we have a quick review and resolution. // @gyohuangxin
All existing servers were manually deprovisioned today. A fresh batch of newly provisioned servers is running (now) from workflow schedule. Let's see if those servers are automatically deprovisioned on completion of their task.
Yes, it seems that the test servers are successfully deprovisioned at end of test. 👍
Description
There are some remaining CNCF runners not being remove after tests done, the number of them gradually increases over time. We can delete them manually, but it's better to make sure they are properly removed.
The same thing happened to equinix servers deletion:
Expected Behavior
We should add retries and confirmations to ensure CNCF runners and machines are removed.
Screenshots/Logs
Environment: